yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Michael Peter Christen	9fcd8f1bda	added canonical filter attention: this is on by default! (it should do the right thing)	2 years ago
Michael Christen	4304e07e6f	crawl profile adoption to new tag valency attribute	2 years ago
Michael Peter Christen	e6a87e0426	enhanced crawler a main problem when crawling is long waiting time cuased by crawl-delay values from robots.txt entries. that attribute is not supported by google and interpreted by yandex and bing in different ways. In large crawls there is always one host which blocks the whole crawl with extreme large values. YaCy now still obeys crawl-delay but limits them to 10 seconds. Additionally the blocking logic when loading new robots.txt was analyzed and a deadlock was removed. Furthermore the construction of new queue lists was redesigned and it was ensured that always a large list of different hosts for host-balancing is provided for the loader.	3 years ago
Lina Ceballos	a96752f5ab	adding SPDX license and copyright headers	4 years ago
sgaebel	9bc2297161	fixes deleting during recrawl	4 years ago
sgaebel	80785b785e	adds deleting during recrawl	4 years ago
sgaebel	8d2e7262d9	Recrawl: - set the chunksize to 100 to meet the max of the embedded solr - re-enable sorting (the case where we switched it of should be away) - enable recrawling on remote-solr	6 years ago
luccioman	80fb1026d0	Create recrawl requests with the relevant crawl profile. Recrawl default profile was previously effectively used for crawl stacker acceptance check, but request entries were indeed still created with the "snippetGlobalText" profile.	7 years ago
luccioman	8b572b7337	Commit Solr index before simulating or starting recrawl job. This ensures up-to-date simulation query results, and recrawl processing.	7 years ago
luccioman	b712a0671e	Added a specific default crawl profile for the recrawl job. - with only light constraint on known indexed documents load date, as it can already been controlled by the selection query, and the goal of the job is indeed to recrawl selected documents now - using the iffresh cache strategy	7 years ago
luccioman	4e03335625	Added more details to the recrawl job report	7 years ago
luccioman	433e241e4f	Added a report info box about eventual last terminated recrawl job For easier monitoring of recrawls.	7 years ago
luccioman	b2af25b14f	Added a stop condition to the Recrawl busy thread	7 years ago
luccioman	421728d25a	Made possible to customize selection query before launching a recrawl	7 years ago
luccioman	46f37e38dc	Customized Threads with generic name for easier monitoring.	7 years ago
reger	7a64bebb86	init Recrawl job chunk size to max crawl loader during job start, to use some system preferences and allow injection of recrawl urls before queue is empty During recrawl the balancer hangs on the very last urls often on hosts with huge delay time, by allowing injection earlier progress is more balanced. Max number of injected crawl urls by recrawl job is 2 * max loader.	9 years ago
reger	fb75fea446	use recrawljob w/o sort results by date This is a workaround for existing index (not fully reindexed) since intro of schema with docvalues to prevent solr exception causing recrawljob to fail with org.apache.solr.core.SolrCore java.lang.IllegalStateException: unexpected docvalues type NONE for field 'load_date_dt' (expected=NUMERIC). Use UninvertingReader or index with docvalues.	9 years ago
reger	98ab655917	on reindex delete index document with invalid url if discovered	9 years ago
Michael Peter Christen	dbbad23e12	removed warnings	9 years ago
reger	72f6a0b0b2	enhance recrawl job - allow to modify the query to select documents to process (after job has started) - allow to include failed urls (httpstatus <> 200)	10 years ago
reger	cd7c0e0aae	detail optimization of RecrawlThread	10 years ago
reger	ace71a8877	Initial (experimental) implementation of index update/re-crawl job added to IndexReIndexMonitor_p.html Selects existing documents from index and feeds it to the crawler. currently only the field fresh_date_dt is used determine documents for recrawl (fresh_date_dt:[* TO NOW-1DAY] Documents are added in small chunks (200) to the crawler, only if no other crawl is running.	10 years ago

22 Commits (0663ae3c99e8e0c920d67c7d5419fb7fa8ca99d7)