Commit Graph

349 Commits (539925a27595782f64a42c0c109741586377c07d)

Author SHA1 Message Date
luccioman 46b5249c20 Removed time condition on HostBalancer initialization in JUnit test.
7 years ago
luccioman 8b572b7337 Commit Solr index before simulating or starting recrawl job.
7 years ago
luccioman 7baa99f26f Fixed stored URL in web cache when redirection(s) occurs.
7 years ago
luccioman 9ddf92d143 Removed unncessary reflection usage for workflow tasks.
7 years ago
luccioman 897d3d30cc Added new recrawl job profile to the list of default crawl profiles
7 years ago
luccioman b712a0671e Added a specific default crawl profile for the recrawl job.
7 years ago
luccioman adf3fa493d Added comments about crawl profiles recrawl cycles
7 years ago
luccioman 3638e16c2e More comprehensive log on rejected recrawls caused by date constraint
7 years ago
luccioman d47afe6fab Use a constant for crawler reject reason prefix with specific processing
7 years ago
luccioman 4e03335625 Added more details to the recrawl job report
7 years ago
luccioman 433e241e4f Added a report info box about eventual last terminated recrawl job
7 years ago
luccioman b2af25b14f Added a stop condition to the Recrawl busy thread
7 years ago
luccioman 421728d25a Made possible to customize selection query before launching a recrawl
7 years ago
luccioman 09c4ee56a7 Added optional https support for remote crawl and profile operations
7 years ago
luccioman 5db1c9155a Do locale independant case conversion on hosts, schemes, and file exts.
7 years ago
Michael Peter Christen 25573bd5ab added a crawl filter based on <div> tag class names
7 years ago
luccioman 46f37e38dc Customized Threads with generic name for easier monitoring.
7 years ago
luccioman 046be566e1 Updated a license header typo.
7 years ago
Apply55gx 3c905a2a5c fix typo
7 years ago
luccioman 6cec2cdcb5 Use unredirected robots.txt URL when adding an entry to the table.
8 years ago
luccioman 3f0446f14b Ensure proper synchronous robots entry retrieval on first check.
8 years ago
luccioman 5a646540cc Support parsing gzip files from servers with redundant headers.
8 years ago
luccioman 11a7f923d4 Distinguish response parsing failures from unexpected exceptions.
8 years ago
luccioman 452a17a8d5 Finer control on bounded input streams with custom stream implementation
8 years ago
luccioman 1e84956721 Support loading local files with a per request specified maximum size.
8 years ago
luccioman bf55f1d6e5 Started support of partial parsing on large streamed resources.
8 years ago
luccioman 433bdb7c0d Respect maxFileSize limit also when streaming HTTP and when relevant.
8 years ago
luccioman 8da3174867 Ensure lower case conversion consistency with any default locale.
8 years ago
luccioman 9dd790087d Added HT Cache basic statistics (hit rate)
8 years ago
luccioman 28b451a0b3 Made Cache compression level and lock timeout user configurable
8 years ago
luccioman a7394b479b Limit the synchronization blocking time on some Cache operations.
8 years ago
luccioman 8399275142 Properly close file output streams even on exceptions scenarios.
8 years ago
luccioman d98c04853d Ensure proper closing of file input streams.
8 years ago
luccioman a9cb083fa1 Improved consistency between loader openInputStream and load functions
8 years ago
luccioman b1da92648e Fixed surrogates import monitoring page (/CrawlResults.html?process=7)
8 years ago
luccioman f66438442e Extended Mediawiki dump import to remote URLs.
8 years ago
reger ce87025462 further avoid to set connect info properties as header value
8 years ago
luccioman 39e081ef38 Fixed display of crawler pending URLs counts in HostBrowser.html page.
8 years ago
luccioman 0da1e6ba16 Factored code re-implementing DigestURL.hosthash() method.
8 years ago
luccioman c1401d821e Adjusted crawl depth control for FTP crawl start URLs.
8 years ago
luccioman 3ca695390c FTP crawl start URLs : applied crawl profile depth control
8 years ago
reger c50e23c495 reduce creation of empty legacy RequestHeader() in situation where null
8 years ago
reger 87f6631a2a adjust Cache getHeader to prev. changes/commit
8 years ago
reger 0d2964cf2b expanded error message on rejected crawl url due to faile dns lookup
8 years ago
luccioman aa9ddf3c23 Added control over Robots.txt active threads maximum number.
8 years ago
reger e0816ef2e5 use human readable date format in CrawlStacker error message
8 years ago
luccioman f0639d810c Customized name for Threads still using the default "Thread-n" pattern.
8 years ago
luccioman db3b9db9c2 Crawl from local file : faster task end when manually terminating crawl.
8 years ago
luccioman 47af33a04c Advanced Crawl from local file : better processing of large files.
8 years ago
luccioman 6f49ece22f Fixed redirected URLs processing as crawl start point.
8 years ago