yacy_search_server

History

Michael Peter Christen e6a87e0426 enhanced crawler a main problem when crawling is long waiting time cuased by crawl-delay values from robots.txt entries. that attribute is not supported by google and interpreted by yandex and bing in different ways. In large crawls there is always one host which blocks the whole crawl with extreme large values. YaCy now still obeys crawl-delay but limits them to 10 seconds. Additionally the blocking logic when loading new robots.txt was analyzed and a deadlock was removed. Furthermore the construction of new queue lists was redesigned and it was ensured that always a large list of different hosts for host-balancing is provided for the loader.	4 years ago
..
yacy	enhanced crawler	4 years ago

Michael Peter Christen e6a87e0426 enhanced crawler

a main problem when crawling is long waiting time cuased by crawl-delay
values from robots.txt entries. that attribute is not supported by
google and interpreted by yandex and bing in different ways. In large
crawls there is always one host which blocks the whole crawl with
extreme large values. YaCy now still obeys crawl-delay but limits them
to 10 seconds.
Additionally the blocking logic when loading new robots.txt was analyzed
and a deadlock was removed. Furthermore the construction of new queue
lists was redesigned and it was ensured that always a large list of
different hosts for host-balancing is provided for the loader.

yacy

enhanced crawler