yacy_search_server

Commit Graph

Author	SHA1	Message	Date
orbiter	22dbbcfa56	better (and corrected) recognition of intranet and internet-addresses. This corrects the isLocal property that is used by network definitions to restrict index ranges to local and global addresses. Address locations (intranet or internet) had been partly identified by the top level domain of the host address. Since intranet addresses can also be addressed using a host name that is in a country domain it is necessary to do a dns resolving for each check. The check is supported by a local dns cache so the intranet/internet check should not affect network traffic too much. To ensure that the cache works properly the cache class was upgraded to better concurrency data structures. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6977 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	b6fb239e74	redesign of parser interface: some file types are containers for several files. These containers had been parsed in such a way that the set of resulting parsed content was merged into one single document before parsing. Using this parser infrastructure it is not possible to parse document containers that contain individual files. An example is a rss file where the rss messages can be treated as individual documents with their own url reference. Another example is a surrogate file which was treated with a special operation outside of the parser infrastructure. This commit introduces a redesigned parser interface and a new abstract parser implementation. The new parser interface has now only one entry point and returns always a set of parsed documents. In case of single documents the parser method returns a set of one documents. To be compliant with the new interface, the zip and tar parser had been also completely redesigned. All parsers are now much more simple and cleaner in its structure. The switchboard operations had been extended to operate with sets of parsed files, not single parsed files. additionally, parsing of jar manifest files had been added. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6955 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	150cf42a1b	migrated all my LGPL 3 -licensed files to the LGPL 2.1 because LGPL 3 is not compatible to the GPL 2 see http://www.gnu.org/licenses/license-list.html for explanation Since (as far as I know) nobody else has ever contributed to these files I may be allowed to just apply an older license. You may consider this as a dual-licensing and may use and optionally replicate the older files under GPL 3. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6952 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	5d00888c95	- added animated visualization for DHT-in and DHT-out in network graphic - found and fixed a possible memory leak in YaCy internal RSS feed system - some refactoring in RSS feed mechanisms to make this possible git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6950 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	dcd01698b4	added a 'transition feature' that shall lower the barrier to move from ggle to yacy (yes!): Here a new concept called 'search heuristics' is introduced. A heuristic is a kind of 'shortcut' to good results in IT, here for good search results. In this case it will be used to get a very transparent way to compare what YaCy is able to produce as search result and what ggle produces as search result. Here is what your can do now: - add the phrase 'heuristic:scroogle' to your search query, like 'oil spill heuristic:scroogle' and then a call to scroogle is made to get anonymous search results from ggle. - these results are _not_ taken as meta-search results, but are used to instantly feed a crawling and indexing process. This happens very fast, here 20 results from scroogle are taken and loaded all simultanously, parsed and indexed immediately and from the results of the parsed content the search result is feeded, along to the normal p2p search - when new results from that heuristic (more to come) get part of the search results, then it is verified if such results are redundant to existing (they had been part of the normal YaCy search result anyway) or if they had been completely new to YaCy. - in the search results the new search results from heuristics are marked with a 'H ++' and search results from heuristics that had been already found by YaCy are marked with a 'H ='. That means: - you can now see YaCy and Scroogle search results in one result page but you also see that you would not have 'missed' the ggle results when you would only have used YaCy. - to make it short: YaCy now subsumes g**gle results. If you use only YaCy, you miss nothing. to come: a configuration page that let you configure the usage of heuristics and get this feature by default. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6944 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	777195e8d1	more abstraction for access of LoaderDispatcher and cache git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6937 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	7bcfa033c9	more abstraction of the htcache when using the LoaderDispatcher: a cache access shall not made directly to the cache any more, all loading attempts shall use the LoaderDispatcher. To control the usage of the cache, a enum instance from CrawlProfile.CacheStrategy shall be used. Some direct loading methods without the usage of a cache strategy have been removed. This affects also the verify-option of the yacysearch servlet. If there is a 'verify=false' now after this commit this does not necessarily mean that no snippets are generated. Instead, all snippets that can be retrieved using the cache only are presented. This still means that the search hit was not verified because the snippet was generated using the cache. If a cache-based generation of snippets is not possible, then the verify=false causes that the link is not rejected. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6936 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	73f03e05ee	fixed a bug in snippet fetch strategy: cache only does not help if resource can only be found in web git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6930 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	87087f12fe	- scanned remote search process and enhanced some data structure and synchronizations here and there - removed concurrency overhead for small number of index normalizations as it happens during remote search - removed 'load only parseable' constraint for snippet fetch because some resources may not have any url file extension and these had therefore not been parseable and searcheable since they may become parseable after loading when their mime type is known - this partly fixes some problems with http://forum.yacy-websuche.de/viewtopic.php?p=20300#p20300 but more changes are necessary to get all expected search results git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6926 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	b03caaa57a	better handling of OOM situations git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6918 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	60e71876ad	- more abstraction (HashMap -> Map) - more concurrency-awareness (HashMap -> ConcurrentHashMap) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6910 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	a83772c71b	fixes and enhancements for balancer: - crawl lists for each domain now uses a HandleSet which should use less memory than LinkedLists - but: fill more entries into the domain lists (all available entries) - fixes to selection criteria (best domain selection) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6909 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	9cde05418f	fixed url crawl list display git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6908 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	30b337fa9f	fixes to balancer when crawling filesystem (problem was: host == null) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6906 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	844853243a	fixed balancer time guessing git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6905 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	3f93a0cc8f	redesign of remote proxy settings git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6903 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	11639aef35	- added new protocol loader for 'file'-type URLs - it is now possible to crawl the local file system with an intranet peer - redesign of URL handling - refactoring: created LGPLed package cora: 'content retrieval api' which may be used externally by other applications without yacy core elements because it has no dependencies to other parts of yacy git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6902 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	6950d8a33d	fixes to SMB crawler git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6900 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	2a8f70f0ca	- fix for caching of OSM tiles. if you want that this fix applies to your peer, please delete the crawl profiles - fix for initial generation of crawl profiles (one more reason to remove your crawl profiles) - more String -> byte[] migration - more logging for cache store/hit git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6874 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	2126c03a62	- removed download-limit that can be given for the crawler for non-crawler download tasks. This was necessary because the same procedure was used for other downloads like for the download of dictionary files where a limit is not useful. The limit still stays for the indexer - migrated the opengeodb downloader to a new version of the opengeodb-dump git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6873 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	cf43bdc87e	This is a large bugfix and enhancement commit to support a better location detection for data - fixes to http file server session handling - fixes and enhancements to metadata date/time handling - added dc:publisher metadata field and updated all document parser - fixed bug in metdata read procedure - enhanced dublin core and rss parser to understand more fields more properly - enhanced url selection in case that multiple urls are given in surrogates - fix for condenser; failure when last word does not end with termination symbol git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6863 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	c45117f81f	fixed dates in metadata git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6860 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	7ab207d93a	better presentation of search result metadata and fixes to htcache loading git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6851 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	40a8d132d9	tried to fix 100% CPU when calling Balancer.top() see also: http://forum.yacy-websuche.de/viewtopic.php?p=19978#p19978 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6844 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	90c3e5d6f6	- cleanup, removed unused imports - added crawling queue sizes to /api/status_p.xml, syntax same as in queues_p.html - fixed a bug in queue enumeration that caused a out of bounds exception git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6842 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	4cd5418963	removed finalize methods because of a hint in http://java.sun.com/javase/6/webnotes/trouble/TSG-VM/html/memleaks.html#gbyvh The finalize method prevents that the memory, used by the objects containing the finalize method, is collected and available for the garbage collector. Instead, the memory allocated by such classes are enqueued to a java-internal finalize queue runner. This slows down all operations that uses a lot of object containing finalize methods. this fix does not remove all finalize method, but such that may be used for throw-away objects that are allocated many times. This should cause a better run-time performance and less OutOfMemoryErrors git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6835 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	bfa35d6d20	possible fix for ZURL.list counter git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6834 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	8c40f1cb8e	self-healing for broken table files (may cause other problems, but better than nothing) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6826 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	7b69d79727	enhanced remove() operation: in many cases it is not necessary to return the removed object to the called. for such cases the delete() operation was introduced which is sometimes much cheaper in operation since it does not need to create objects to hold the removed content and it does not need to read those objects. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6824 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	64f29f990e	a collection of performance hacks and code cleanup: - removed usage of URL-Caches which could have been a memory leak - removed unused classes and methods - removed not necessary synchronizations - added synchronization hacks where possible - fine-tuned crawling speed to prevent IO of balancer - fixed a bug in IODispatcher that may have caused that no merges were done - reduced number of parameters in very often called methods (compare methods) - reduced complexity of data structures of now massively used HandleSet class - reduction of new String() and getBytes() usage / new methods to support this transition git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6820 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	8b8107b2a3	reduced IO-load and synchronization/blocking - enhanced the Balancer performance when building new domain stacks using a new Table buffer - added the new Table buffer BufferedObjectIndex class - changed order of access to LURL-read (prefereing segment over Crawl Queues) will reduced blocking time on balancer - fixed PPM setting in Crawler_p servlet (had doubled values) - reduced synchronization in IndexCell because it is not necessary: reduced blocking during indexing/merging/dumping - removed did-you-mean cache in IndexCell because that caused too much overhead and more memory usage but was not very useful. This reduced also deadlocks that could be causes when searched are performed during indexing. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6819 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	1a8a134e0c	continuing String-hash - to - byte[]-hash redesign that was started in SVN 6775 and continued in SVN 6790 The result should be a less usage of new String() and less memory usage (since a String-encapsulated byte[] has 40 bytes overhead) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6815 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	48b9371735	changed balancer re-load counter. causes less blocking here doing intranet indexing. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6812 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	55d8e686ea	performance hacks git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6807 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
hermens	4ec0092677	more null == proxy fixes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6794 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
hermens	2f90f0ad56	Remove asserts blocking proxy use cases git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6793 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	25aef069a6	continuing String-hash - to - byte[]-hash redesign that was started in SVN 6775 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6790 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
low012	b97ad0f380	) some minor changes for better code readability ) added more SVN properties git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6787 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	ba51d140e1	added more info in assert in balancer git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6782 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	1e8e79b9ef	redesign of reference hash (URL-hash) parameter hand-over: pass value as byte[], not as String. This should cause that less byte[] <-> String conversions are made during time-critical tasks. This redesign is not yet complete, more to come .. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6775 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	90dd197ae7	- no latency for local crawls - catch interrupted exception during 'fast' crawls in workflow processor git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6759 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	c855fc48c6	only load robots.txt for http and http protocol git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6753 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	3300930fc5	- (almost) fixed FTP crawler - integrated/fixed SMB crawler git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6742 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	9623d9e6d2	added a smb loader component for the YaCy crawler git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6737 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	b88f5fbb4b	slightly changed crawling policy git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6723 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	7684a575c4	fix for deletion of error database each time when YaCy starts up git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6721 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	c09a995930	better logging of double occurrences of urls in the crawler git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6718 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	727dd9b193	- fixed a bug in robots.txt parser - moved storage of robots.txt entries to WorkTables, so it is now possible to browse the robots entries with the table browser git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6710 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	46c4f8b68a	better look-ahead into the crawl queue: show more on crawl monitor git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6699 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	564927ce72	redesign of CrawlResult data structures because of OOM occurrences during URL deletion processes. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6675 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago

1 2 3 4 5 ...

310 Commits (572e429efff59c130ddd7c762708bb6a0a8d00c9)