Commit Graph

325 Commits (f8f1959ebb3f96b66e75d7d83cd70ae9714e85bd)

Author SHA1 Message Date
luccioman 1e84956721 Support loading local files with a per request specified maximum size.
7 years ago
luccioman bf55f1d6e5 Started support of partial parsing on large streamed resources.
7 years ago
luccioman 433bdb7c0d Respect maxFileSize limit also when streaming HTTP and when relevant.
7 years ago
luccioman 8da3174867 Ensure lower case conversion consistency with any default locale.
7 years ago
luccioman 9dd790087d Added HT Cache basic statistics (hit rate)
8 years ago
luccioman 28b451a0b3 Made Cache compression level and lock timeout user configurable
8 years ago
luccioman a7394b479b Limit the synchronization blocking time on some Cache operations.
8 years ago
luccioman 8399275142 Properly close file output streams even on exceptions scenarios.
8 years ago
luccioman d98c04853d Ensure proper closing of file input streams.
8 years ago
luccioman a9cb083fa1 Improved consistency between loader openInputStream and load functions
8 years ago
luccioman b1da92648e Fixed surrogates import monitoring page (/CrawlResults.html?process=7)
8 years ago
luccioman f66438442e Extended Mediawiki dump import to remote URLs.
8 years ago
reger ce87025462 further avoid to set connect info properties as header value
8 years ago
luccioman 39e081ef38 Fixed display of crawler pending URLs counts in HostBrowser.html page.
8 years ago
luccioman 0da1e6ba16 Factored code re-implementing DigestURL.hosthash() method.
8 years ago
luccioman c1401d821e Adjusted crawl depth control for FTP crawl start URLs.
8 years ago
luccioman 3ca695390c FTP crawl start URLs : applied crawl profile depth control
8 years ago
reger c50e23c495 reduce creation of empty legacy RequestHeader() in situation where null
8 years ago
reger 87f6631a2a adjust Cache getHeader to prev. changes/commit
8 years ago
reger 0d2964cf2b expanded error message on rejected crawl url due to faile dns lookup
8 years ago
luccioman aa9ddf3c23 Added control over Robots.txt active threads maximum number.
8 years ago
reger e0816ef2e5 use human readable date format in CrawlStacker error message
8 years ago
luccioman f0639d810c Customized name for Threads still using the default "Thread-n" pattern.
8 years ago
luccioman db3b9db9c2 Crawl from local file : faster task end when manually terminating crawl.
8 years ago
luccioman 47af33a04c Advanced Crawl from local file : better processing of large files.
8 years ago
luccioman 6f49ece22f Fixed redirected URLs processing as crawl start point.
8 years ago
luccioman 7263d17436 Removed mentions of deprecated LURL-db.
8 years ago
luccioman 54cfcc3f56 CrawlCheck_p.html : also display info about disallowed URLs.
8 years ago
luccioman 8b341e9818 Robots : properly handle URLs including non ASCII characters
8 years ago
luccioman dcdea2d02f Fixed shutdown for crawler.MaxActiveThreads value greater than 200
8 years ago
luccioman 3ee4f56c39 Improved ErrorCache behavior when switching networks
8 years ago
Michael Peter Christen 5e165a8150 removed unused imports
8 years ago
reger 7ab41d4ff1 use directories original lastmodified date in file- & smbloader in response
8 years ago
reger 708bcbb042 one more replacement to use cached hosthash vs. calculated
8 years ago
reger 22db449f2a to prevent crawler to concurrently access and alter same crawl queue
8 years ago
reger 8d58a48029 remove wrong log line in CrawlSwitchboard
8 years ago
reger a6ba1faa80 introduce a translation edit servlet Translator_p.html YaCy's UI text translation
9 years ago
reger eb2a00b1d8 fix NPE on missing crawldepth_i
9 years ago
reger 7be1c7a05a fix logger name
9 years ago
reger 7789c32c82 delete crawl queue on init exception
9 years ago
reger 379e9b330d use supplied url port to get robots.txt in crawlers hostqueue
9 years ago
reger 06d0e2aeb9 result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode.
9 years ago
sixcooler 5cb7ba0dc4 fix for connections not getting closed to get favicon.ico during seach
9 years ago
Ryszard Goń a98c395023 Add the Autocrawl thread
9 years ago
Ryszard Goń 1728cd30c6 Create autocrawl profiles
9 years ago
luc 571bc55937 Refactoring : use StandardCharsets constants instead of hard-coded
9 years ago
reger b7e8358645 make use of header.getContentType where possible (mime is normalized afterwards)
9 years ago
Michael Peter Christen d82d311995 Merge branch 'master' of https://github.com/luccioman/yacy_search_server
9 years ago
reger b5371ea8c1 read/init crawl queue in a thread
9 years ago
reger 90686a75a2 fix flux factor (additional crawl delay by access count) calculation
9 years ago