yacy_search_server

Commit Graph

Author	SHA1	Message	Date
orbiter	3a807e10cf	- added a cache for active crawl profiles to the crawl switchboard - moved the domain cache for domain counter from the crawl switchboard to the crawl profiles. the crawl domain counter is now therefore relative for each crawl start, not for the whole crawler. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8018 6c8d7289-2bf4-0310-a012-ef5d649a1542	13 years ago
orbiter	cf4fd525ee	added directDocByURL attribute in crawl profile git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7985 6c8d7289-2bf4-0310-a012-ef5d649a1542	13 years ago
orbiter	b250e6466d	implemented crawl restrictions for IP pattern and country lists git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7980 6c8d7289-2bf4-0310-a012-ef5d649a1542	13 years ago
orbiter	5ad7f9612b	added crawl settings for three new filters for each crawl: must-match for IPs (IPs that are known after DNS resolving for each URL in the crawl queue) must-not-match for IPs must-match against a list of country codes (allows only loading from hosts that are hostet in given countries) note: the settings and input environment is there with that commit, but the values are not yet evaluated git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7976 6c8d7289-2bf4-0310-a012-ef5d649a1542	13 years ago
low012	c7b95e8c81	) Invalid crawl profiles (containing invalid mustmatch/mustnotmatch filters) will be moved from active crawls to invalid crawls (new file: DATA/INDEX/freeworld/QUEUES/crawlProfilesInvalid.heap). This file can not be edited yet, but it shoudl be easy to extend the CrawlProfileEditor accordingly. ) Corrupt crawlProfilesPassive.heap would cause crawlProfilesActive.heap to be deleted. Don't know if this ever happend, but will not happen anymore. ) Cleaned up a little bit. ) Added some comments. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7827 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	115abc8917	- more attributes for search progress bar - moved cache strategy to cora package git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7778 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	4bea3f9714	hack to reduce resource contention caused by massive UTF8 decodings which use java.nio resources: used a ASCII String <-> byte[] conversion wherever possible. Many Strings in YaCy are hashes which are pure ASCII (base64 hashes). The new ASCII String <-> byte[] conversion method have less computation overhead than the UTF8 conversion. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7746 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	958ff4778e	enhanced location search: search is now done using verify=false (instead of verify=cacheonly) which will cause that much more targets can be found. This showed a bug where no location information was used from the metadata (and other metadata information) if cache=false is requested. The bug was fixed. Added also location parsing from wikimedia dumps. A wikipedia dump can now also be a source for a location search. Fixed many smaller bugs in connection with location search. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7657 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	19fd13d3bc	Added federated index storage to solr. YaCy supports now the storage to remote solr indexes. More federated storage (and search) methods may follow. The remote index scheme is the same as produced by the SolrCell; see http://wiki.apache.org/solr/ExtractingRequestHandler Because this default scheme is used, the default example scheme can be used as solr configuration This is also the same scheme that solr uses if documents are imported with apache tika. federated solr storage is switched off by default. To use this, do the following: - set federated.service.solr.indexing.enabled = true - download solr from http://www.apache.org/dyn/closer.cgi/lucene/solr/ - extract the solr (3.1) package, 'cd example' and start solr with 'java -jar start.jar' - start yacy and then start a crawler. The crawler will fill both, YaCy and solr indexes. - to check whats in solr after indexing, open http://localhost:8983/solr/admin/ Until now it is not possible to use the solr index to search with YaCy in that solr index. This functionality is now available for two reasons: 1) to compare the functionality of Solr and YaCy and to compare the search speed 2) to use YaCy as a search appliance for people who need a crawler or other source harvesting methods that YaCy provides (like dublin core reading, wikimedia dump reading, rss feed reader etc) if people still want to use solr instead of YaCy. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7654 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	b1a8d0c020	enhancements to web cache and less strict caching rules git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7620 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	cb1f49d0f2	replaced all 'new String' with default encoding (missing) or UTF-8 encoding with a String generation method that uses a pre-defined Charset constant for UTF-8. This avoids a cache-lookup for the Charset object using String hashing of the String 'UTF-8'. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7558 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	4588b5a291	- fixed document number limitation for crawls that restrict the number of documents per domain - some restructuring of the document counting and logging structures was necessary - better abstraction of CrawlProfiles - added deletion of logs to the index deletion option (if the index is deleted using the servlets) which is necessary to reset the domain counters for the page limitation - more refactoring to get the LibraryProvider more clean - some refactoring of the Condenser class git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7478 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	45b1ab3d07	custom + generic skins: - added a generic skin which is filled with actual color assignment using a servlet - enabled css servlets - added a generic color scheme in configuration file - added configuration input in Customization/Appearance servlet - added a jquery color picker widget - placed color picked widget to input field of generic colour definition input fields git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7235 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	0d363a94d7	more performance hacks this makes YaCy search results VERY fast for all verify=false search cases and it enhances the search speed also for all other snippet-fetch cases. With this change my peer performed 100 Queries Per Second (!!!) while doing 10 queries simultanously (!!!) in an intranet index of 20000 URLs on my 16-core Mac Check this yourself by doing: cd bin ./searchtestmulti.sh after finishing the run, divide 1000 by the given time per query (which is the qps for one thread) and then multiply again by 10 (because 10 search threads has been started) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7231 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	aacf572a26	- enhancements for search speed - bug fixes in many classes including basic data structure classes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7217 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	2c549ae341	fixed a number of small bugs: - better crawl star for files paths and smb paths - added time-out wrapper for dns resolving and reverse resolving to prevent blockings - fixed intranet scanner result list check boxes - prevented htcache usage in case of file and smb crawling (not necessary, documents are locally available) - fixed rss feed loader - fixes sitemap loader which had not been restricted to single files (crawl-depth must be zero) - clearing of crawl result lists when a network switch was done - higher maximum file size for crawler git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7214 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	f6eebb6f99	replaced auto-dom filter with easy-to-understand Site Link-List crawler option - nobody understand the auto-dom filter without a lenghtly introduction about the function of a crawler - nobody ever used the auto-dom filter other than with a crawl depth of 1 - the auto-dom filter was buggy since the filter did not survive a restart and then a search index contained waste - the function of the auto-dom filter was in fact to just load a link list from the given start url and then start separate crawls for all these urls restricted by their domain - the new Site Link-List option shows the target urls in real-time during input of the start url (like the robots check) and gives a transparent feed-back what it does before it can be used - the new option also fits into the easy site-crawl start menu git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7213 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	65eaf30f77	redesign of crawl profiles data structure. target will be: - permanent storage of auto-dom statistics in profile - storage of profiles in WorkTable data structure not finished yet. No functional change yet. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7088 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	0f276dd63f	- MapHeap now implements Map<byte[], Map<String, String>> - refactoring of method names to comply with Map method names git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7072 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	70dd26ec95	added the new crawl scheduling function to the crawl start menu: - the scheduler extends the option for re-crawl timing. Many people misunderstood the re-crawl timing feature because that was just a criteria for the url double-check and not a scheduler. Now the scheduler setting is combined with the re-crawl setting and people will have the choice between no re-crawl, re-crawl as was possible so far and a scheduled re-crawl. The 'classic' re-crawl time is set automatically when the scheduling function is selected - removed the bookmark-based scheduler. This scheduler was not able to transport all attributes of a crawl start and did therefore not support special crawling starts i.e. for forums and wikis - since the old scheduler was not aber to crawl special forums and wikis, the must-not-match filter was statically fixed to all bad pages for these special use cases. Since the new scheduler can handle these filters, it is possible to remove the default settings for the filters - removed the busy thread that was used to trigger the bookmark-based scheduler - removed the crontab for the bookmark-based scheduler git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7051 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	7bcfa033c9	more abstraction of the htcache when using the LoaderDispatcher: a cache access shall not made directly to the cache any more, all loading attempts shall use the LoaderDispatcher. To control the usage of the cache, a enum instance from CrawlProfile.CacheStrategy shall be used. Some direct loading methods without the usage of a cache strategy have been removed. This affects also the verify-option of the yacysearch servlet. If there is a 'verify=false' now after this commit this does not necessarily mean that no snippets are generated. Instead, all snippets that can be retrieved using the cache only are presented. This still means that the search hit was not verified because the snippet was generated using the cache. If a cache-based generation of snippets is not possible, then the verify=false causes that the link is not rejected. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6936 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	73f03e05ee	fixed a bug in snippet fetch strategy: cache only does not help if resource can only be found in web git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6930 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	b03caaa57a	better handling of OOM situations git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6918 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	60e71876ad	- more abstraction (HashMap -> Map) - more concurrency-awareness (HashMap -> ConcurrentHashMap) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6910 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	2a8f70f0ca	- fix for caching of OSM tiles. if you want that this fix applies to your peer, please delete the crawl profiles - fix for initial generation of crawl profiles (one more reason to remove your crawl profiles) - more String -> byte[] migration - more logging for cache store/hit git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6874 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	2126c03a62	- removed download-limit that can be given for the crawler for non-crawler download tasks. This was necessary because the same procedure was used for other downloads like for the download of dictionary files where a limit is not useful. The limit still stays for the indexer - migrated the opengeodb downloader to a new version of the opengeodb-dump git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6873 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	4cd5418963	removed finalize methods because of a hint in http://java.sun.com/javase/6/webnotes/trouble/TSG-VM/html/memleaks.html#gbyvh The finalize method prevents that the memory, used by the objects containing the finalize method, is collected and available for the garbage collector. Instead, the memory allocated by such classes are enqueued to a java-internal finalize queue runner. This slows down all operations that uses a lot of object containing finalize methods. this fix does not remove all finalize method, but such that may be used for throw-away objects that are allocated many times. This should cause a better run-time performance and less OutOfMemoryErrors git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6835 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	64f29f990e	a collection of performance hacks and code cleanup: - removed usage of URL-Caches which could have been a memory leak - removed unused classes and methods - removed not necessary synchronizations - added synchronization hacks where possible - fine-tuned crawling speed to prevent IO of balancer - fixed a bug in IODispatcher that may have caused that no merges were done - reduced number of parameters in very often called methods (compare methods) - reduced complexity of data structures of now massively used HandleSet class - reduction of new String() and getBytes() usage / new methods to support this transition git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6820 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	25aef069a6	continuing String-hash - to - byte[]-hash redesign that was started in SVN 6775 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6790 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
low012	b97ad0f380	) some minor changes for better code readability ) added more SVN properties git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6787 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	5df628a2a4	- added BEncoder class - added BEncodedHeap class that encodes B data structures and stores that to a heap - refactoring of MapView, this is now named MapHeap to fit into the naming scheme of the BEncodedHeap git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6579 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	a06f7ddb33	more PMD recommendations git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6572 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	362b7a929b	added extensive memory protection logic to avoid out of memory errors that may be caused by the RowCollection memory allocation function git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6521 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	e34e63a039	preset of proper HashMap dimensions: should prevent re-hashing and increase performance git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6511 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	4a5100789f	replaced _all_ size() == 0 with isEmpty() and all size() > 0 with !isEmpty(). The isEmpty() method is much faster in some cases, especially when used to access badly balanced hashtables where an size() operation becomes a large iteration. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6510 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	4431b9767e	added about 450 replacements for printStackTrace() methods to pipe such traces into the log at DATA/LOG/ git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6458 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	e3025ee691	- new icon for OAI-PMH loading action - added many stack trace outputs for exceptions in crawl profile handler to find the 'missing profile handle' bug - catched one more timeout exception in httpd file loader git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6457 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	a8ce192f63	- shifted main classes to new package net.yacy - fixed some bugs in last commit git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6427 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	ce8dc575ca	refactoring git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6398 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	bea3b99aff	moved table and util classes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6397 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	c0e0e1f422	moved blob classes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6395 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	4446acc8cd	moved kelondro order git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6392 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	f677d534b1	start of a really extensive refactoring which will produce a hierarchical package structure with the domain yacy.net as package root - moved here the logging classes as part of the new net.yacy.kelondro package git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6391 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	ce972ff4ef	update to default ranking profile which has now some settings to deny some phpbb3 pages which are redundant in the index when crawling phpbb3. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6288 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	c6c97f23ad	- added cache usage properties to crawl start - added special rule to balancer to omit forced delays if cache is used exclusively - extended the htCache size by default to 32GB git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6241 6c8d7289-2bf4-0310-a012-ef5d649a1542	16 years ago
orbiter	c4ae2cd03f	fixed bug that caused deletion of crawl profiles at every application startup git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6240 6c8d7289-2bf4-0310-a012-ef5d649a1542	16 years ago
orbiter	161d2fd2ef	redesign of access to the HTCache (now http.client.Cache): - better control to the cache by using combined request-header and content access methods - refactoring of many classes to comply to this new access method - make shure that the cache is always written if something was loaded - some redesign of the process how http response results are feeded into the new indexing queue - introduction of a cache read policy: * never use the cache * use the cache if entry exist * use the cache if the proxy freshness rule confirmes * use only the cache and go never online - added configuration options for the crawl profiles to use the new cache policies. There is not yet a input during crawl start to set the policy but this will be added in another step. - set the default policies for the existing crawl profiles. If you want them to appear in your default profiles you must delete the crawl profiles database; othervise the policy is 'proxy freshness rule' - enhanced some cache access methods in such a way that unnecessary retrievals are omitted (i.e. for size computation). That should reduce some IO but also a lot of CPU computation because sizes were computed after decompression of content after retrieval of the content from the disc. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6239 6c8d7289-2bf4-0310-a012-ef5d649a1542	16 years ago
orbiter	5bb8074150	removed the indexing queue. This queue was superfluous since the introduction of the blocking queues last year, where documents are parsed, analysed and stored in the index with concurrency. - The indexing queue was a historic data structure that was introduced at the very beginning at the project as a part of the switchboard organisation object structure. Without the indexing queue the switchboard queue becomes also superfluous. It has been removed as well. - Removing the switchboard queue requires that all servlets are called without a opaque generic ('<?>'). That caused that all serlets had to be modified. - Many servlets displayed the indexing queue or the size of that queue. In the past months the indexer was so fast that mostly the indexing queue appeared empty, so there was no use of it any more. Because the queue has been removed, the display in the servlets had also to be removed. - The surrogate work task had been a part of the indexing queue control structure. Without the indexing queue the surrogates needed its own task management. That has been integrated here. - Because the indexing queue had a special queue entry object and properties attached to this object, the propterties had to be moved to the queue entry object which is part of the new indexing queue withing the blocking queue, the Response Object. That object has now also the new properties of the removed indexing queue entry object. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6225 6c8d7289-2bf4-0310-a012-ef5d649a1542	16 years ago
orbiter	222850414e	simplification of the code: removed unused classes, methods and variables git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6154 6c8d7289-2bf4-0310-a012-ef5d649a1542	16 years ago
orbiter	ae015e8e98	refactoring of blob package classes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6088 6c8d7289-2bf4-0310-a012-ef5d649a1542	16 years ago

1 2

80 Commits (c659310e89e8a4b4b2d1de0b13c67f604843ad1e)