yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Michael Peter Christen	fc98ca7a9c	removed ContentControl servlet and functinality This was not used at all (as I know) and was blocking a smooth integration of ivy in the context of an existing JSON parser.	2 years ago
Daleth Darko	3ced06c731	Various javadoc fixes	3 years ago
Michael Peter Christen	d7b17d8935	fixed missing thread name revert after balancer waiting	3 years ago
Michael Peter Christen	bd3f2483a1	replaced url and date retrieval by only url retrieval This should prevent that the search index is used for freshnes of the index entry.	3 years ago
Michael Peter Christen	163ba26d90	replaced check for load time method instead of loading the solr document, an index only for the last loading time was created. This prevents that solr has to fetch from its index while the index is created. Excessive re-loading of documents while indexing has shown to produce deadlocks, so this should now be prevented.	3 years ago
Michael Peter Christen	1ead7b85b5	remove compiler warning "warning: [try] explicit call to close() on an auto-closeable resource"	3 years ago
sgaebel	cdf901270c	always use HTTPClient by 'try with resources' pattern to free up resources	3 years ago
sgaebel	69adaa9f55	makes our HTTPClient closable	3 years ago
Michael Peter Christen	d19872fd26	making sure that crawl queues are closed correctly to prevent data loss	3 years ago
Michael Peter Christen	e6a87e0426	enhanced crawler a main problem when crawling is long waiting time cuased by crawl-delay values from robots.txt entries. that attribute is not supported by google and interpreted by yandex and bing in different ways. In large crawls there is always one host which blocks the whole crawl with extreme large values. YaCy now still obeys crawl-delay but limits them to 10 seconds. Additionally the blocking logic when loading new robots.txt was analyzed and a deadlock was removed. Furthermore the construction of new queue lists was redesigned and it was ensured that always a large list of different hosts for host-balancing is provided for the loader.	3 years ago
Michael Peter Christen	9e13d77de4	removed call to class.finalize() because of deprecation in java 9 next: removal of finalize() implementation after testing with assert false	3 years ago
Michael Christen	b2af745dd6	Merge pull request #404 from lnceballosz/master NGI0 - Updating licensing aspects according REUSE	4 years ago
sgaebel	c69c462a15	replaces a expensive getLoadTimeURL() by exists() refactors urlExists to getHarvestProcess as that is what it does	4 years ago
Lina Ceballos	a96752f5ab	adding SPDX license and copyright headers	4 years ago
Michael Peter Christen	63f58e4785	enhanced strategy in host browser limit number of fresh hosts in round robin hashes	4 years ago
Michael Peter Christen	9be36800a4	increased redirect depth by one this makes sense if one redirect replaces http with https and another replaces www subdomain by without (and vice versa)	4 years ago
sgaebel	fc03c4b4fe	removes some warning and unused objects	4 years ago
sgaebel	9bc2297161	fixes deleting during recrawl	4 years ago
sgaebel	80785b785e	adds deleting during recrawl	4 years ago
Michael Peter Christen	e0ad8ca9da	replaced json library from JSON.org with libandroid-json-java This fixes https://github.com/yacy/yacy_search_server/issues/347	5 years ago
luccioman	6b45cd5799	New optional crawl filter on the URL a doc must match to crawl its links For finer control over which parsed documents can trigger an addition of their links to the crawl stack, complementary to the existing crawl depth parameter.	6 years ago
sgaebel	8d2e7262d9	Recrawl: - set the chunksize to 100 to meet the max of the embedded solr - re-enable sorting (the case where we switched it of should be away) - enable recrawling on remote-solr	6 years ago
luccioman	08ea0b0397	Added a configurable timeout to wkhtmltopdf calls for pdf snapshots Necessary to prevent blocking the indexing workflow when some wkhtmltopdf renderings fail without terminating	6 years ago
luccioman	fcf6b16db4	Added new crawler attribute for finer control over Media Type detection New "Media Type detection" section in the advanced crawl start page allow to choose between : - not loading URLs with unknown or unsupported file extension without checking the actual Media Type (relying Content-Type header for now). This was the old default behavior, faster, but not really accurate. - always cross check URL file extension against the actual Media Type. This lets properly parse URLs ending with an apparently odd file extension, but which have actually a supported Media Type such as text/html. Sample URLs with misleading file extensions added as documentation in the crawl start page. fixes issue #244	6 years ago
luccioman	7adbd1f87d	Fixed raw IPV6 addresses snapshots read/write on FAT32 and NTFS fs Fixes issue #225	6 years ago
luccioman	4ee14ff3c5	Fixed NullPointerException case on malformed crawl queue folder name	6 years ago
luccioman	21ad9435ec	Fixed crawl queue folder naming for IPv6 hosts on MS Windows filesystems As reported by @vikulin in issue #187, crawling websites using a raw IPv6 address as host name in their URL failed when running on Microsoft Windows platforms (FAT32 or NTFS filesystems) when YaCy crawler created the crawl queue folder, as the ':' character which is part of an IPV6 address is forbidden on these filesystems.	6 years ago
luccioman	dcad393fe5	Fixed exceeding max size of failreason_s Solr field on large link list When using the 'From Link-List of URL' as a crawl start, with lists in the order of one or more thousands of links, the failreason_s Solr field maximum size (32kb) was exceeded by the string representation of the URL must-match filter when a crawl URL was rejected because not matching.	6 years ago
luccioman	c726154a59	Fixed removal of URLs from the delegatedURL remote crawl stack URLs were removed from the stack using their hash as a bytes array, whereas the hash is stored in the stack as String instance.	6 years ago
luccioman	a15ac8e0ca	Made CrawlProfile loading tolerant to malformed json string attribute	7 years ago
luccioman	a715bb7876	Fixed rendering of solr mustNoMatch value on CrawlProfileEditor_p.xml	7 years ago
luccioman	0b302c5004	Do not block whole server startup on persisted crawl profile load error	7 years ago
luccioman	4d9aa4ed1e	Fixed default crawl profile solr mustnotmatch query from previous commit	7 years ago
luccioman	cced94298a	Added a new crawler document filter type using Solr syntax This makes possbile to set up much more advanced document crawl filters, by filtering on one or more document indexed fields before inserting in the index.	7 years ago
Michael Christen	e0dc632020	removed transformer it was not used any more	7 years ago
luccioman	fa4399d5d2	Small perf improvement : initialize threads names early when possible Initializing Thread names using the Thread constructor parameter is faster as it already sets a thread name even if no customized one is given, while an additional call to the Thread.setName() function internally do synchronized access, eventually runs access check on the security manager and performs a native call. Profiling a running YaCy server revealed that the total processing time spent on Thread.setName() for a typical p2p search was in the range of seconds.	7 years ago
luccioman	fb3032c530	Added a crawl filtering possibility on documents Media Type (MIME)	7 years ago
luccioman	e45afedee4	Added support for enclosures (media links) to the RSS loader	7 years ago
luccioman	aaefd5219c	Reduce log verbosity of RSS loader on feed items with no link	7 years ago
luccioman	17c7a85f18	Make StreamResponse usable in Java try-with-resources statements	7 years ago
luccioman	80fb1026d0	Create recrawl requests with the relevant crawl profile. Recrawl default profile was previously effectively used for crawl stacker acceptance check, but request entries were indeed still created with the "snippetGlobalText" profile.	7 years ago
luccioman	46b5249c20	Removed time condition on HostBalancer initialization in JUnit test. Its initialization in main application usage remains asynchronous.	7 years ago
luccioman	8b572b7337	Commit Solr index before simulating or starting recrawl job. This ensures up-to-date simulation query results, and recrawl processing.	7 years ago
luccioman	7baa99f26f	Fixed stored URL in web cache when redirection(s) occurs. Associate cached content to the last redirection location, instead of the first URL of a redirection(s) chain : - for proper base URL processing in parsers (fixes mantis 636 - http://mantis.tokeek.de/view.php?id=636) - to prevent duplicated content in Solr index when recrawling a redirected URL	7 years ago
luccioman	9ddf92d143	Removed unncessary reflection usage for workflow tasks. This improves code readability and maintainability (calls hierarchy are easier to read) and eventually performance.	7 years ago
luccioman	897d3d30cc	Added new recrawl job profile to the list of default crawl profiles	7 years ago
luccioman	b712a0671e	Added a specific default crawl profile for the recrawl job. - with only light constraint on known indexed documents load date, as it can already been controlled by the selection query, and the goal of the job is indeed to recrawl selected documents now - using the iffresh cache strategy	7 years ago
luccioman	adf3fa493d	Added comments about crawl profiles recrawl cycles	7 years ago
luccioman	3638e16c2e	More comprehensive log on rejected recrawls caused by date constraint	7 years ago
luccioman	d47afe6fab	Use a constant for crawler reject reason prefix with specific processing	7 years ago

1 2 3 4 5 ...

390 Commits (a2a40a3096097acd2d8080fd62f43a84efa3607c)