Commit Graph

304 Commits (baf6d21cfe530e2b585e0a78d4c7aaf45e73db05)

Author SHA1 Message Date
reger e0816ef2e5 use human readable date format in CrawlStacker error message
8 years ago
luccioman f0639d810c Customized name for Threads still using the default "Thread-n" pattern.
8 years ago
luccioman db3b9db9c2 Crawl from local file : faster task end when manually terminating crawl.
8 years ago
luccioman 47af33a04c Advanced Crawl from local file : better processing of large files.
8 years ago
luccioman 6f49ece22f Fixed redirected URLs processing as crawl start point.
8 years ago
luccioman 7263d17436 Removed mentions of deprecated LURL-db.
8 years ago
luccioman 54cfcc3f56 CrawlCheck_p.html : also display info about disallowed URLs.
8 years ago
luccioman 8b341e9818 Robots : properly handle URLs including non ASCII characters
8 years ago
luccioman dcdea2d02f Fixed shutdown for crawler.MaxActiveThreads value greater than 200
8 years ago
luccioman 3ee4f56c39 Improved ErrorCache behavior when switching networks
8 years ago
Michael Peter Christen 5e165a8150 removed unused imports
9 years ago
reger 7ab41d4ff1 use directories original lastmodified date in file- & smbloader in response
9 years ago
reger 708bcbb042 one more replacement to use cached hosthash vs. calculated
9 years ago
reger 22db449f2a to prevent crawler to concurrently access and alter same crawl queue
9 years ago
reger 8d58a48029 remove wrong log line in CrawlSwitchboard
9 years ago
reger a6ba1faa80 introduce a translation edit servlet Translator_p.html YaCy's UI text translation
9 years ago
reger eb2a00b1d8 fix NPE on missing crawldepth_i
9 years ago
reger 7be1c7a05a fix logger name
9 years ago
reger 7789c32c82 delete crawl queue on init exception
9 years ago
reger 379e9b330d use supplied url port to get robots.txt in crawlers hostqueue
9 years ago
reger 06d0e2aeb9 result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode.
9 years ago
sixcooler 5cb7ba0dc4 fix for connections not getting closed to get favicon.ico during seach
9 years ago
Ryszard Goń a98c395023 Add the Autocrawl thread
9 years ago
Ryszard Goń 1728cd30c6 Create autocrawl profiles
9 years ago
luc 571bc55937 Refactoring : use StandardCharsets constants instead of hard-coded
9 years ago
reger b7e8358645 make use of header.getContentType where possible (mime is normalized afterwards)
9 years ago
Michael Peter Christen d82d311995 Merge branch 'master' of https://github.com/luccioman/yacy_search_server
9 years ago
reger b5371ea8c1 read/init crawl queue in a thread
9 years ago
reger 90686a75a2 fix flux factor (additional crawl delay by access count) calculation
9 years ago
luc 4af27289e5 Merge branch 'master' of https://github.com/yacy/yacy_search_server
9 years ago
reger 297fdb60d3 throw exception if crawler hostqueue can't create hostpath directory.
9 years ago
luc 755efac17d Use same max file size when loading all resource bytes or opening stream
9 years ago
luc f01d49c37a Process large or local file images dealing directly with content
9 years ago
luc 5bbb2e1730 Ensure resource is closed when reading a full file InputStream
9 years ago
reger 7a64bebb86 init Recrawl job chunk size to max crawl loader during job start, to use some system preferences
9 years ago
reger fb75fea446 use recrawljob w/o sort results by date
9 years ago
reger 43c27aa550 upd to solr/lucene 5.3.1
9 years ago
reger 98ab655917 on reindex delete index document with invalid url
10 years ago
reger 367fe388b9 fix exception throw after sendError in DefaultServlet
10 years ago
Michael Peter Christen 8f90767889 fix for filesystem crawl
10 years ago
Michael Peter Christen dbbad23e12 removed warnings
10 years ago
reger fa08ca207e ! finish running crawls before applying !
10 years ago
Michael Peter Christen fbeae20b3a try a healing of the cache if the index file is corrupted
10 years ago
Michael Peter Christen 3c4c69adea fix for
10 years ago
Michael Peter Christen 9c12555be5 added link to Snapshots in search results if the snapshot exists and
10 years ago
reger 72f6a0b0b2 enhance recrawl job
10 years ago
Michael Peter Christen 197f7449e5 All entities of crawl profiles are now editable in the crawl profile
10 years ago
reger 3e742d1e34 Init remote crawler on demand
10 years ago
reger cd7c0e0aae detail optimization of RecrawlThread
10 years ago
reger ace71a8877 Initial (experimental) implementation of index update/re-crawl job
10 years ago