yacy_search_server

Commit Graph

Author	SHA1	Message	Date
reger	379e9b330d	use supplied url port to get robots.txt in crawlers hostqueue	9 years ago
reger	b5371ea8c1	read/init crawl queue in a thread to speed-up YaCy start on large existing crawler queues	9 years ago
reger	3e742d1e34	Init remote crawler on demand If remote crawl option is not activated, skip init of remoteCrawlJob to save the resources of queue and ideling thread. Deploy of the remoteCrawlJob deferred on activation of the option.	10 years ago
Michael Peter Christen	5bb52f79be	reduce number of calls to queue.size() because that may be a bottleneck during crawling	10 years ago
Michael Peter Christen	a34f837592	better delete all files in path when removing host crawl stack	10 years ago
orbiter	4ae7aead28	addon to latest fix	10 years ago
Michael Peter Christen	49d91b94c3	npe fix in crawler	10 years ago
orbiter	e9163e7e10	fix for malformed hostpath names in crawl balancer	10 years ago
Michael Peter Christen	06ab72d1af	enhanced crawler host round-robin strategy	11 years ago
Michael Peter Christen	49886fab08	enhanced debugging	11 years ago
orbiter	d7d38f9135	made number of open files in crawler configurable and increased default maximum number of open files from 100 to 1000. This number can be changed with the attribut crawler.onDemandLimit	11 years ago
orbiter	97983ba89f	fixed generics warnings for generic array instantiation that appeared after migration to Java 7	11 years ago
reger	1600414450	fix NPE on continuing crawls after YaCy restart (Agent is then nulll)	11 years ago
Michael Peter Christen	c1c1be8f02	fix for slow crawling and better logging in balancer	11 years ago
orbiter	2f63bd0261	enhanced Host Balancer strategy: fair round robin	11 years ago
Michael Peter Christen	8b32dd5f9e	special strategy for balancer: do not remove targets with zero wait time from the queue	11 years ago
Michael Peter Christen	9c6228d948	fix for deadlocks in crawler	11 years ago
Michael Peter Christen	06afb568e2	new Strategies in Balancer: - doublecheck cache now records the crawl depth as well - doublecheck cache is available from the outside (made static) - no more need to crawl hosts with lowest depth first, instead all hosts which have only singleton entries are preferred to reduce the number of files.	11 years ago
Michael Peter Christen	da86f150ab	- added a new Crawler Balancer: HostBalancer and HostQueues: This organizes all urls to be loaded in separate queues for each host. Each host separates the crawl depth into it's own queue. The primary rule for urls taken from any queue is, that the crawl depth is minimal. This produces a crawl depth which is identical to the clickdepth. Furthermorem the crawl is able to create a much better balancing over all hosts which is fair to all hosts that are in the queue. This process will create a very large number of files for wide crawls in the QUEUES folder: for each host a directory, for each crawl depth a file inside the directory. A crawl with maxdepth = 4 will be able to create 10.000s of files. To be able to use that many file readers, it was necessary to implement a new index data structure which opens the file only if an access is wanted (OnDemandOpenFileIndex). The usage of such on-demand file reader shall prevent that the number of file pointers is over the system limit, which is usually about 10.000 open files. Some parts of YaCy had to be adopted to handle the crawl depth number correctly. The logging and the IndexCreateQueues servlet had to be adopted to show the crawl queues differently, because the host name is attached to the port on the host to differentiate between http, https, and ftp services.	11 years ago

19 Commits (4eddabee4221f0d8deec55a157b520e1e1c140ce)