Commit Graph

313 Commits (df5970df6d4de27ef96641aadc2591c219e87a36)

Author SHA1 Message Date
reger ce87025462 further avoid to set connect info properties as header value
8 years ago
luccioman 39e081ef38 Fixed display of crawler pending URLs counts in HostBrowser.html page.
8 years ago
luccioman 0da1e6ba16 Factored code re-implementing DigestURL.hosthash() method.
8 years ago
luccioman c1401d821e Adjusted crawl depth control for FTP crawl start URLs.
8 years ago
luccioman 3ca695390c FTP crawl start URLs : applied crawl profile depth control
8 years ago
reger c50e23c495 reduce creation of empty legacy RequestHeader() in situation where null
8 years ago
reger 87f6631a2a adjust Cache getHeader to prev. changes/commit
8 years ago
reger 0d2964cf2b expanded error message on rejected crawl url due to faile dns lookup
8 years ago
luccioman aa9ddf3c23 Added control over Robots.txt active threads maximum number.
8 years ago
reger e0816ef2e5 use human readable date format in CrawlStacker error message
8 years ago
luccioman f0639d810c Customized name for Threads still using the default "Thread-n" pattern.
8 years ago
luccioman db3b9db9c2 Crawl from local file : faster task end when manually terminating crawl.
8 years ago
luccioman 47af33a04c Advanced Crawl from local file : better processing of large files.
8 years ago
luccioman 6f49ece22f Fixed redirected URLs processing as crawl start point.
8 years ago
luccioman 7263d17436 Removed mentions of deprecated LURL-db.
8 years ago
luccioman 54cfcc3f56 CrawlCheck_p.html : also display info about disallowed URLs.
8 years ago
luccioman 8b341e9818 Robots : properly handle URLs including non ASCII characters
8 years ago
luccioman dcdea2d02f Fixed shutdown for crawler.MaxActiveThreads value greater than 200
8 years ago
luccioman 3ee4f56c39 Improved ErrorCache behavior when switching networks
8 years ago
Michael Peter Christen 5e165a8150 removed unused imports
9 years ago
reger 7ab41d4ff1 use directories original lastmodified date in file- & smbloader in response
9 years ago
reger 708bcbb042 one more replacement to use cached hosthash vs. calculated
9 years ago
reger 22db449f2a to prevent crawler to concurrently access and alter same crawl queue
9 years ago
reger 8d58a48029 remove wrong log line in CrawlSwitchboard
9 years ago
reger a6ba1faa80 introduce a translation edit servlet Translator_p.html YaCy's UI text translation
9 years ago
reger eb2a00b1d8 fix NPE on missing crawldepth_i
9 years ago
reger 7be1c7a05a fix logger name
9 years ago
reger 7789c32c82 delete crawl queue on init exception
9 years ago
reger 379e9b330d use supplied url port to get robots.txt in crawlers hostqueue
9 years ago
reger 06d0e2aeb9 result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode.
9 years ago
sixcooler 5cb7ba0dc4 fix for connections not getting closed to get favicon.ico during seach
9 years ago
Ryszard Goń a98c395023 Add the Autocrawl thread
9 years ago
Ryszard Goń 1728cd30c6 Create autocrawl profiles
9 years ago
luc 571bc55937 Refactoring : use StandardCharsets constants instead of hard-coded
9 years ago
reger b7e8358645 make use of header.getContentType where possible (mime is normalized afterwards)
9 years ago
Michael Peter Christen d82d311995 Merge branch 'master' of https://github.com/luccioman/yacy_search_server
9 years ago
reger b5371ea8c1 read/init crawl queue in a thread
9 years ago
reger 90686a75a2 fix flux factor (additional crawl delay by access count) calculation
9 years ago
luc 4af27289e5 Merge branch 'master' of https://github.com/yacy/yacy_search_server
9 years ago
reger 297fdb60d3 throw exception if crawler hostqueue can't create hostpath directory.
9 years ago
luc 755efac17d Use same max file size when loading all resource bytes or opening stream
9 years ago
luc f01d49c37a Process large or local file images dealing directly with content
9 years ago
luc 5bbb2e1730 Ensure resource is closed when reading a full file InputStream
9 years ago
reger 7a64bebb86 init Recrawl job chunk size to max crawl loader during job start, to use some system preferences
9 years ago
reger fb75fea446 use recrawljob w/o sort results by date
9 years ago
reger 43c27aa550 upd to solr/lucene 5.3.1
9 years ago
reger 98ab655917 on reindex delete index document with invalid url
10 years ago
reger 367fe388b9 fix exception throw after sendError in DefaultServlet
10 years ago
Michael Peter Christen 8f90767889 fix for filesystem crawl
10 years ago
Michael Peter Christen dbbad23e12 removed warnings
10 years ago