Commit Graph

75 Commits (72003b109b879b6770d9f4390e8ee39a620c1c79)

Author SHA1 Message Date
Michael Peter Christen 788288eb9e added the generation of 50 (!!) new solr field in the core 'webgraph'.
12 years ago
Michael Peter Christen 84f82541e8 search process enhancements
12 years ago
Michael Peter Christen 2d9e577ad0 replaced the custom robots.txt loader by the standard http loader
12 years ago
Michael Peter Christen 5f0ab25382 removed the option to prevent removal of & parts inside of the
12 years ago
Michael Peter Christen 1533bfd63b refactoring
12 years ago
Michael Peter Christen 00c1c777fa refactoring
12 years ago
Michael Peter Christen 24d9db1613 snippet retrieval loading processes may use a smaller minimum load time
12 years ago
orbiter 0cbda0b2b8 - replaced all length() == 0 and size() == 0 with isEmpty()
13 years ago
Michael Peter Christen c3db015410 prevent loading of content from the cache when retrieval with IFFRESH is
13 years ago
Michael Peter Christen b0c408788b made class methods static where possible
13 years ago
Michael Peter Christen 7c1ba99755 removed more unused method parameters
13 years ago
Michael Peter Christen ea10766bfd cleaned unnecessary nested code
13 years ago
Michael Peter Christen 1825f165b8 better integration of blacklist according to use case
13 years ago
Michael Peter Christen 03280fb161 removed segments-concept and the Segments class:
13 years ago
Michael Peter Christen 77f795756c fixing redirects and status codes: storing of status code in
13 years ago
Michael Peter Christen 7dc59979bc fix for npe, possibly for http://bugs.yacy.net/view.php?id=195
13 years ago
Roland 'Quix0r' Haeder edaa09b9b1 Rewrote all String blacklist types to enum 'BlacklistType', closes bug
13 years ago
Michael Peter Christen 7e0ddbd275 added a "fromCache" flag in Response object to omit one cache.has()
13 years ago
Michael Peter Christen 659178942f - Redesigned crawler and parser to accept embedded links from the NOLOAD
13 years ago
Michael Peter Christen 33d1062c79 refactoring: the cache belongs to the crawler
13 years ago
reger a95f645a61 Bugfix class repository.Loaddispatcher fixed download file limit of 10000
13 years ago
Michael Peter Christen ef5192f8c9 using the generic document parser for crawl starts instead of the html
13 years ago
Michael Christen 216a287a85 Merge commit '6d4e08ed06c5cd28c45981b2ebe31c7f7ec6fd83' into quix0r
13 years ago
Michael Christen 361146dd7a better error handling for file loader
13 years ago
Roland 'Quix0r' Haeder a3083d13bf Blacklist checks are now always turned on, in media searches (e.g. image search) images matching blacklist entries are no longer shown to the user
13 years ago
orbiter e22f8497c9 - tested the ARC methods
13 years ago
orbiter d2ea250d99 refactoring:
13 years ago
orbiter 49e5ca579f added new configuration property "crawler.embedLinksAsDocuments". If this is switched on (this is default now), the all embedded image, audio and video links from all parsed documents are added to the search index as individual document. This will increase the search index size dramatically but will also enable us to create a much faster image, audio and video search. If the flag is switched on, the index entries are also stored to a solr index, if this is also enabled.
13 years ago
orbiter 1c007188ad bugfixes in html parser
13 years ago
sixcooler 5f8a5ca32d - not doing merge-jobs while short on Memory
13 years ago
orbiter 2c58af6874 - added a short memory status simulation mode
13 years ago
sixcooler 59b767eebd stop loading via http at defined maximum of bytes - even size is unknown before loading
13 years ago
orbiter 115abc8917 - more attributes for search progress bar
14 years ago
orbiter 0c1b29f3c9 - applied many small performance hacks
14 years ago
orbiter 4bea3f9714 hack to reduce resource contention caused by massive UTF8 decodings which use java.nio resources:
14 years ago
orbiter d8e934c085 better abstraction of http client identification
14 years ago
orbiter 96c32e87b0 fixes to crawler and new user-agent crawl-delay handling
14 years ago
orbiter b1a8d0c020 enhancements to web cache and less strict caching rules
14 years ago
orbiter 30aed9824a moved getBytes() to UTF8.getBytes() to use a default String encoding
14 years ago
orbiter 4588b5a291 - fixed document number limitation for crawls that restrict the number of documents per domain
14 years ago
f1ori a321c7673d * adminAccountForLocalhost only for localhost
14 years ago
orbiter 6c1b14c8e1 - more control in access tracker: count number of returned search results (not only info how much is in the index)
14 years ago
f1ori 9d2159582f * fix system update if urls are in blacklist (for example for very general blacklists like *.de)
14 years ago
orbiter a563b05b60 enhanced crawler:
14 years ago
orbiter aacf572a26 - enhancements for search speed
14 years ago
orbiter d2fd93135c - moved yacybot user agent string definition to MultiProtocolURI since there are basic access mechanisms where the bot string is needed
14 years ago
f1ori 8fe1102452 fix http://forum.yacy-websuche.de/viewtopic.php?p=20889#p18426
14 years ago
orbiter 24502fe3de performance hacks
14 years ago
orbiter ffaa9a1c51 avoiding double-loading of the same resource from the web in case that a seond attempt to load the resource is started while the first attempt is still loading the content from the web. This will delay the second attempt to the time when the first attempt has finished with the possible result that the second attempt reads only from the web cache, not from the web.
14 years ago
orbiter ae07e11bc5 enhanced image search result display: concurrent loading of images before they are displayed
14 years ago