yacy_search_server

Commit Graph

Author	SHA1	Message	Date
luccioman	46b5249c20	Removed time condition on HostBalancer initialization in JUnit test. Its initialization in main application usage remains asynchronous.	7 years ago
luccioman	8b572b7337	Commit Solr index before simulating or starting recrawl job. This ensures up-to-date simulation query results, and recrawl processing.	7 years ago
luccioman	7baa99f26f	Fixed stored URL in web cache when redirection(s) occurs. Associate cached content to the last redirection location, instead of the first URL of a redirection(s) chain : - for proper base URL processing in parsers (fixes mantis 636 - http://mantis.tokeek.de/view.php?id=636) - to prevent duplicated content in Solr index when recrawling a redirected URL	7 years ago
luccioman	9ddf92d143	Removed unncessary reflection usage for workflow tasks. This improves code readability and maintainability (calls hierarchy are easier to read) and eventually performance.	7 years ago
luccioman	897d3d30cc	Added new recrawl job profile to the list of default crawl profiles	7 years ago
luccioman	b712a0671e	Added a specific default crawl profile for the recrawl job. - with only light constraint on known indexed documents load date, as it can already been controlled by the selection query, and the goal of the job is indeed to recrawl selected documents now - using the iffresh cache strategy	7 years ago
luccioman	adf3fa493d	Added comments about crawl profiles recrawl cycles	7 years ago
luccioman	3638e16c2e	More comprehensive log on rejected recrawls caused by date constraint	7 years ago
luccioman	d47afe6fab	Use a constant for crawler reject reason prefix with specific processing	7 years ago
luccioman	4e03335625	Added more details to the recrawl job report	7 years ago
luccioman	433e241e4f	Added a report info box about eventual last terminated recrawl job For easier monitoring of recrawls.	7 years ago
luccioman	b2af25b14f	Added a stop condition to the Recrawl busy thread	7 years ago
luccioman	421728d25a	Made possible to customize selection query before launching a recrawl	7 years ago
luccioman	09c4ee56a7	Added optional https support for remote crawl and profile operations	7 years ago
luccioman	5db1c9155a	Do locale independant case conversion on hosts, schemes, and file exts. Required for proper operation when the default system locale is Turkish, as dottless and dotted i characters have specific case conversion rules in this language.	7 years ago
Michael Peter Christen	25573bd5ab	added a crawl filter based on <div> tag class names When a crawl is started, a new field to exclude content from scraping is available. The field can be identified with the class name of div tags. All text contained in such a div tag where the configured class name(s) match are not indexed, while the remaining page is indexed.	7 years ago
luccioman	46f37e38dc	Customized Threads with generic name for easier monitoring.	7 years ago
luccioman	046be566e1	Updated a license header typo.	7 years ago
Apply55gx	3c905a2a5c	fix typo	7 years ago
luccioman	6cec2cdcb5	Use unredirected robots.txt URL when adding an entry to the table.	8 years ago
luccioman	3f0446f14b	Ensure proper synchronous robots entry retrieval on first check. Previously, when checking for the first time the robots.txt policy on a unknown host (not cached in the robots table), result was always empty in the /getpageinfo_p.xml api and in the /CrawlCheck_p.html page. Next calls returned however the correct information.	8 years ago
luccioman	5a646540cc	Support parsing gzip files from servers with redundant headers. Some web servers provide both 'Content-Encoding : "gzip"' and 'Content-Type : "application/x-gzip"' HTTP headers on their ".gz" files. This was annoying to fail on such resources which are not so uncommon, while non conforming (see RFC 7231 section 3.1.2.2 for "Content-Encoding" header specification https://tools.ietf.org/html/rfc7231#section-3.1.2.2)	8 years ago
luccioman	11a7f923d4	Distinguish response parsing failures from unexpected exceptions.	8 years ago
luccioman	452a17a8d5	Finer control on bounded input streams with custom stream implementation	8 years ago
luccioman	1e84956721	Support loading local files with a per request specified maximum size. Consistently with the HTTP loader implementation.	8 years ago
luccioman	bf55f1d6e5	Started support of partial parsing on large streamed resources. Thus enable getpageinfo_p API to return something in a reasonable amount of time on resources over MegaBytes size range. Support added first with the generic XML parser, for other formats regular crawler limits apply as usual.	8 years ago
luccioman	433bdb7c0d	Respect maxFileSize limit also when streaming HTTP and when relevant. Constraint applied consistently with HTTP content full load in byte array.	8 years ago
luccioman	8da3174867	Ensure lower case conversion consistency with any default locale. Especially for Turkish speaking users using "tr" as their system default locale : strings for technical stuff (URLs, tag names, constants...) must not be lower cased with the default locale, as 'I' doesn't becomes 'i' like in other locales such as "en", but becomes 'ı'.	8 years ago
luccioman	9dd790087d	Added HT Cache basic statistics (hit rate)	8 years ago
luccioman	28b451a0b3	Made Cache compression level and lock timeout user configurable	8 years ago
luccioman	a7394b479b	Limit the synchronization blocking time on some Cache operations. Using a Reentrant lock instead of the intrinsic synchronization lock permits limiting the blocking time to acquire a lock. Useful on a very busy Cache concurrently accessed by many threads : when the time to acquire a lock is too high, getting/storing content on the cache becomes inefficient, and it is then better to fall back to loading remote resources. Illustrated by the CacheTest stress test and some traces reported in mantis 751 ( http://mantis.tokeek.de/view.php?id=751 )	8 years ago
luccioman	8399275142	Properly close file output streams even on exceptions scenarios.	8 years ago
luccioman	d98c04853d	Ensure proper closing of file input streams.	8 years ago
luccioman	a9cb083fa1	Improved consistency between loader openInputStream and load functions	8 years ago
luccioman	b1da92648e	Fixed surrogates import monitoring page (/CrawlResults.html?process=7) This page was always empty, as described in mantis 740 (http://mantis.tokeek.de/view.php?id=740)	8 years ago
luccioman	f66438442e	Extended Mediawiki dump import to remote URLs. When using a public HTTP URL in /IndexImportMediawiki_p.html, the remote file now is directly streamed and processed, allowing import of several GB dumps even with a low memory remote peer, and without need to manually download the dump file first.	8 years ago
reger	ce87025462	further avoid to set connect info properties as header value following comment "use of properties as header values is discouraged" in case where (proxy)HTTPClient overwrites values with supplied url. Use defined request.referer procedure in response class.	8 years ago
luccioman	39e081ef38	Fixed display of crawler pending URLs counts in HostBrowser.html page. As described in mantis 722 (http://mantis.tokeek.de/view.php?id=722) Also updated some Javadoc.	8 years ago
luccioman	0da1e6ba16	Factored code re-implementing DigestURL.hosthash() method. This ensure consistent implementation of the url host hash generation and easier usage finding in source code. Also added a unit test for this function.	8 years ago
luccioman	c1401d821e	Adjusted crawl depth control for FTP crawl start URLs.	8 years ago
luccioman	3ca695390c	FTP crawl start URLs : applied crawl profile depth control Applied rules : - when the FTP URL denotes a file resource, stack it as any start URL : eventually embedded links can be followed applying the usual depth rules - when the FTP URL denotes a directory, list files under this directory and stack them for crawl, and repeat the process on sub folders until crawl depth is reached	8 years ago
reger	c50e23c495	reduce creation of empty legacy RequestHeader() in situation where null is acceptable (less for garbage collection).	8 years ago
reger	87f6631a2a	adjust Cache getHeader to prev. changes/commit	8 years ago
reger	0d2964cf2b	expanded error message on rejected crawl url due to faile dns lookup close of http://mantis.tokeek.de/view.php?id=678	8 years ago
luccioman	aa9ddf3c23	Added control over Robots.txt active threads maximum number. When starting a crawl from a file containing thousands of links, configuration setting "crawler.MaxActiveThreads" is effective to prevent saturating the system with too many outgoing HTTP connections threads launched by the crawler. But robots.txt was not affected by this setting and was indefinitely increasing the number of concurrently loading threads until most ot the connections timed out. To improve performance control, added a pool of threads for Robots.txt, consistently used in its ensureExist() and massCrawlCheck() methods. The Robots.txt threads pool max size can now be configured in the /PerformanceQueus_p.html page, or with the new "robots.txt.MaxActiveThreads" setting, initialized with the same default value as the crawler.	8 years ago
reger	e0816ef2e5	use human readable date format in CrawlStacker error message "double in: local index, oldDate = "	8 years ago
luccioman	f0639d810c	Customized name for Threads still using the default "Thread-n" pattern. This makes threads monitoring easier to read.	8 years ago
luccioman	db3b9db9c2	Crawl from local file : faster task end when manually terminating crawl.	8 years ago
luccioman	47af33a04c	Advanced Crawl from local file : better processing of large files. Applied strategy : when there is no restriction on domains or sub-path(s), stack anchor links once discovered by the content scraper instead of waiting the complete parsing of the file. This makes it possible to handle a crawling start file with thousands of links in a reasonable amount of time. Performance limitation : even if the crawl start faster with a large file, the content of the parsed file still is fully loaded in memory.	8 years ago
luccioman	6f49ece22f	Fixed redirected URLs processing as crawl start point. See mantis 699 (http://mantis.tokeek.de/view.php?id=699) for details.	8 years ago

1 2 3 4 5 ...

349 Commits (539925a27595782f64a42c0c109741586377c07d)