yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Michael Peter Christen	e6a87e0426	enhanced crawler a main problem when crawling is long waiting time cuased by crawl-delay values from robots.txt entries. that attribute is not supported by google and interpreted by yandex and bing in different ways. In large crawls there is always one host which blocks the whole crawl with extreme large values. YaCy now still obeys crawl-delay but limits them to 10 seconds. Additionally the blocking logic when loading new robots.txt was analyzed and a deadlock was removed. Furthermore the construction of new queue lists was redesigned and it was ensured that always a large list of different hosts for host-balancing is provided for the loader.	3 years ago
Lina Ceballos	a96752f5ab	adding SPDX license and copyright headers	4 years ago
reger	379e9b330d	use supplied url port to get robots.txt in crawlers hostqueue	9 years ago
reger	90686a75a2	fix flux factor (additional crawl delay by access count) calculation	9 years ago
Michael Peter Christen	da86f150ab	- added a new Crawler Balancer: HostBalancer and HostQueues: This organizes all urls to be loaded in separate queues for each host. Each host separates the crawl depth into it's own queue. The primary rule for urls taken from any queue is, that the crawl depth is minimal. This produces a crawl depth which is identical to the clickdepth. Furthermorem the crawl is able to create a much better balancing over all hosts which is fair to all hosts that are in the queue. This process will create a very large number of files for wide crawls in the QUEUES folder: for each host a directory, for each crawl depth a file inside the directory. A crawl with maxdepth = 4 will be able to create 10.000s of files. To be able to use that many file readers, it was necessary to implement a new index data structure which opens the file only if an access is wanted (OnDemandOpenFileIndex). The usage of such on-demand file reader shall prevent that the number of file pointers is over the system limit, which is usually about 10.000 open files. Some parts of YaCy had to be adopted to handle the crawl depth number correctly. The logging and the IndexCreateQueues servlet had to be adopted to show the crawl queues differently, because the host name is attached to the port on the host to differentiate between http, https, and ftp services.	11 years ago
Michael Peter Christen	6ada0daae9	making latency_factor and maximum number of same hosts in loader queue settings available in Crawler_p.html servlet for steering.	11 years ago
Michael Peter Christen	0168f80c28	new crawling factors can now be changed during runtime	11 years ago
Michael Peter Christen	77531850b5	reverted crawling strategy from latest commit.	11 years ago
Michael Peter Christen	c0da966dfa	enhanced crawler speed	11 years ago
orbiter	0e8d752462	refactoring	11 years ago
Michael Peter Christen	5e31bad711	- the webgraph shall store all links which appear on a web page and not all unique links! This made it necessary, that a large portion of the parser and link processing classes must be adopted to carry a different type of link collection which carry a property attribute which are attached to web anchors. - introduction of a new URL class, AnchorURL - the other url classes, DigestURI and MultiProtocolURI had been renamed and refactored to fit into a new document package schema, document.id - cleanup of net.yacy.cora.document package and refactoring	11 years ago
Michael Peter Christen	765943a4b7	Redesign of crawler identification and robots steering. A non-p2p user in intranets and the internet can now choose to appear as Googlebot. This is an essential necessity to be able to compete in the field of commercial search appliances, since most web pages are these days optimized only for Google and no other search platform any more. All commercial search engine providers have a built-in fake-Google User Agent to be able to get the same search index as Google can do. Without the resistance against obeying to robots.txt in this case, no competition is possible any more. YaCy will always obey the robots.txt when it is used for crawling the web in a peer-to-peer network, but to establish a Search Appliance (like a Google Search Appliance, GSA) it is necessary to be able to behave exactly like a Google crawler. With this change, you will be able to switch the user agent when portal or intranet mode is selected on per-crawl-start basis. Every crawl start can have a different user agent.	11 years ago
Michael Peter Christen	16d1d744fa	added url_file_name_s in default collection schema for the file name without the file extension. This part of the file path is removed from the multi-field url_paths_sxt, which has now not the file name as last part of the path list. The same applies to the new fields source_file_name_s and target_file_name_s in the webgraph schema.	12 years ago
Michael Peter Christen	77faeada4d	small memory leak patch	12 years ago
Michael Peter Christen	a3cd3852ab	introduced a better place to update the lastacc time value in latency	12 years ago
Michael Peter Christen	864abcd33d	removed Latency update after URL selection because that causes a completely wrong behaviour when cache fresh cases appear. Makes re-crawling MUCH faster!	12 years ago
Michael Peter Christen	756772fbd3	fix for waitingtime computation for intranet configuration	12 years ago
Michael Peter Christen	0fe8be7981	enhaced data structures for balancer and latency computation which should produce a bit better prognosis about forced waiting times.	12 years ago
Michael Peter Christen	b2ffd49817	less latency	12 years ago
Michael Peter Christen	0833937c1c	better balancing and duetime-cumputation also for no-delay intranet hosts	12 years ago
Michael Peter Christen	2d9e577ad0	replaced the custom robots.txt loader by the standard http loader	12 years ago
orbiter	8952153ecf	update to Balancer algorithm: - create a load list from the current list of known hosts - do not create this list for each Balancer.pop access - create the list from those hosts which have a zero-waiting time - select 1/3 from that list which have the most urls waiting - get hosts from the wainting list in random order - fixes for some delta-time computations - always load all urls from hosts which have never been loaded before	12 years ago
Michael Peter Christen	00c1c777fa	refactoring	12 years ago

23 Commits (60c9986a0e4a4f4f75e1ba93e6cf1ac0f9f1fed6)