yacy_search_server

Commit Graph

Author	SHA1	Message	Date
luccioman	aa9ddf3c23	Added control over Robots.txt active threads maximum number. When starting a crawl from a file containing thousands of links, configuration setting "crawler.MaxActiveThreads" is effective to prevent saturating the system with too many outgoing HTTP connections threads launched by the crawler. But robots.txt was not affected by this setting and was indefinitely increasing the number of concurrently loading threads until most ot the connections timed out. To improve performance control, added a pool of threads for Robots.txt, consistently used in its ensureExist() and massCrawlCheck() methods. The Robots.txt threads pool max size can now be configured in the /PerformanceQueus_p.html page, or with the new "robots.txt.MaxActiveThreads" setting, initialized with the same default value as the crawler.	8 years ago
reger	fcc29c36f0	test case for HostBalancer issue in intranet mode with file:// protocol, 2 hostqueues accessing same cache file concurrently http://mantis.tokeek.de/view.php?id=668 Reason seems to be diff. hosthash key of hostqueues on reopen. Internal queue key and external representation (directoryname currently hostname.port) must be adjusted to fix it (not done yet).	8 years ago

Author

SHA1

Message

Date

luccioman

aa9ddf3c23

Added control over Robots.txt active threads maximum number.

When starting a crawl from a file containing thousands of links,
configuration setting "crawler.MaxActiveThreads" is effective to prevent
saturating the system with too many outgoing HTTP connections threads
launched by the crawler.
But robots.txt was not affected by this setting and was indefinitely
increasing the number of concurrently loading threads until most ot the
connections timed out.

To improve performance control, added a pool of threads for Robots.txt,
consistently used in its ensureExist() and massCrawlCheck() methods.
The Robots.txt threads pool max size can now be configured in the
/PerformanceQueus_p.html page, or with the new
"robots.txt.MaxActiveThreads" setting, initialized with the same default
value as the crawler.

reger

fcc29c36f0

test case for HostBalancer issue in intranet mode

with file:// protocol, 2 hostqueues accessing same cache file concurrently
http://mantis.tokeek.de/view.php?id=668
Reason seems to be diff. hosthash key of hostqueues on reopen. 
Internal queue key and external representation (directoryname currently hostname.port) must be adjusted to fix it (not done yet).

2 Commits (2a87b08cea67f8f2ae46e318c1c3945e8520ec53)