Commit Graph

95 Commits (4a35570c903da7058ff3d5eef5265919f3c648ec)

Author SHA1 Message Date
reger 48aed15c48 skip loader wait cycle on concurrent access in nocache configuration.
10 years ago
Michael Peter Christen 1735dbc9d9 enhanced image search: bugfixes and performance enhancements
10 years ago
Michael Peter Christen ebd0be2cea fixes and speed updates for search process
10 years ago
Michael Peter Christen 2de159719b added an option to set 'obey nofollow' for links with rel="nofollow"
10 years ago
Michael Peter Christen fb3dd56b02 fix for processing of noindex flag in http header
11 years ago
Michael Peter Christen 10cf8215bd added crawl depth for failed documents
11 years ago
Michael Peter Christen da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues:
11 years ago
Michael Peter Christen 6bd8c6f195 fix for wrong status codes of error pages
11 years ago
Michael Peter Christen b08375da33 fix for bad/missing values of size_i
11 years ago
Michael Peter Christen 1a4a69c226 set more logger to 'final static'
11 years ago
Michael Peter Christen 91a875dff5 self-healing of mistakenly deactivated crawl profiles. This fixes a bug
11 years ago
Michael Peter Christen 2602be8d1e - removed ZURL data structure; removed also the ZURL data file
11 years ago
Michael Peter Christen 61c5e40687 - replaced the properties object in AnchorURL with distinct variables
11 years ago
Michael Peter Christen 5e31bad711 - the webgraph shall store all links which appear on a web page and not
11 years ago
Michael Peter Christen 765943a4b7 Redesign of crawler identification and robots steering. A non-p2p user
11 years ago
Roland Haeder aaedc0405d Fixes and avoid of catching bad exceptions (some):
11 years ago
Michael Peter Christen 89c0aa0e74 added collection_sxt to error documents
11 years ago
Michael Peter Christen 5878c1d599 - refactoring of log to ConcurrentLog:
12 years ago
Michael Peter Christen 8f2d3ce2f9 reduced locking situation in crawler: shifted synchronized location and
12 years ago
Michael Peter Christen 77faeada4d small memory leak patch
12 years ago
Michael Peter Christen 788288eb9e added the generation of 50 (!!) new solr field in the core 'webgraph'.
12 years ago
Michael Peter Christen 84f82541e8 search process enhancements
12 years ago
Michael Peter Christen 2d9e577ad0 replaced the custom robots.txt loader by the standard http loader
12 years ago
Michael Peter Christen 5f0ab25382 removed the option to prevent removal of & parts inside of the
12 years ago
Michael Peter Christen 1533bfd63b refactoring
12 years ago
Michael Peter Christen 00c1c777fa refactoring
12 years ago
Michael Peter Christen 24d9db1613 snippet retrieval loading processes may use a smaller minimum load time
12 years ago
orbiter 0cbda0b2b8 - replaced all length() == 0 and size() == 0 with isEmpty()
13 years ago
Michael Peter Christen c3db015410 prevent loading of content from the cache when retrieval with IFFRESH is
13 years ago
Michael Peter Christen b0c408788b made class methods static where possible
13 years ago
Michael Peter Christen 7c1ba99755 removed more unused method parameters
13 years ago
Michael Peter Christen ea10766bfd cleaned unnecessary nested code
13 years ago
Michael Peter Christen 1825f165b8 better integration of blacklist according to use case
13 years ago
Michael Peter Christen 03280fb161 removed segments-concept and the Segments class:
13 years ago
Michael Peter Christen 77f795756c fixing redirects and status codes: storing of status code in
13 years ago
Michael Peter Christen 7dc59979bc fix for npe, possibly for http://bugs.yacy.net/view.php?id=195
13 years ago
Roland 'Quix0r' Haeder edaa09b9b1 Rewrote all String blacklist types to enum 'BlacklistType', closes bug
13 years ago
Michael Peter Christen 7e0ddbd275 added a "fromCache" flag in Response object to omit one cache.has()
13 years ago
Michael Peter Christen 659178942f - Redesigned crawler and parser to accept embedded links from the NOLOAD
13 years ago
Michael Peter Christen 33d1062c79 refactoring: the cache belongs to the crawler
13 years ago
reger a95f645a61 Bugfix class repository.Loaddispatcher fixed download file limit of 10000
13 years ago
Michael Peter Christen ef5192f8c9 using the generic document parser for crawl starts instead of the html
13 years ago
Michael Christen 216a287a85 Merge commit '6d4e08ed06c5cd28c45981b2ebe31c7f7ec6fd83' into quix0r
13 years ago
Michael Christen 361146dd7a better error handling for file loader
13 years ago
Roland 'Quix0r' Haeder a3083d13bf Blacklist checks are now always turned on, in media searches (e.g. image search) images matching blacklist entries are no longer shown to the user
13 years ago
orbiter e22f8497c9 - tested the ARC methods
13 years ago
orbiter d2ea250d99 refactoring:
13 years ago
orbiter 49e5ca579f added new configuration property "crawler.embedLinksAsDocuments". If this is switched on (this is default now), the all embedded image, audio and video links from all parsed documents are added to the search index as individual document. This will increase the search index size dramatically but will also enable us to create a much faster image, audio and video search. If the flag is switched on, the index entries are also stored to a solr index, if this is also enabled.
13 years ago
orbiter 1c007188ad bugfixes in html parser
13 years ago
sixcooler 5f8a5ca32d - not doing merge-jobs while short on Memory
13 years ago