Commit Graph

98 Commits (607b39b427f76ec139df5d9d5479cf09a0d6fe4a)

Author SHA1 Message Date
luccioman 5a646540cc Support parsing gzip files from servers with redundant headers.
8 years ago
luccioman 11a7f923d4 Distinguish response parsing failures from unexpected exceptions.
8 years ago
luccioman 452a17a8d5 Finer control on bounded input streams with custom stream implementation
8 years ago
luccioman 1e84956721 Support loading local files with a per request specified maximum size.
8 years ago
luccioman bf55f1d6e5 Started support of partial parsing on large streamed resources.
8 years ago
luccioman 433bdb7c0d Respect maxFileSize limit also when streaming HTTP and when relevant.
8 years ago
luccioman 8da3174867 Ensure lower case conversion consistency with any default locale.
8 years ago
luccioman a9cb083fa1 Improved consistency between loader openInputStream and load functions
8 years ago
luccioman b1da92648e Fixed surrogates import monitoring page (/CrawlResults.html?process=7)
8 years ago
luccioman f66438442e Extended Mediawiki dump import to remote URLs.
8 years ago
reger ce87025462 further avoid to set connect info properties as header value
8 years ago
reger c50e23c495 reduce creation of empty legacy RequestHeader() in situation where null
8 years ago
luccioman f0639d810c Customized name for Threads still using the default "Thread-n" pattern.
8 years ago
luccioman 6f49ece22f Fixed redirected URLs processing as crawl start point.
8 years ago
Michael Peter Christen 5e165a8150 removed unused imports
9 years ago
reger 7ab41d4ff1 use directories original lastmodified date in file- & smbloader in response
9 years ago
reger 06d0e2aeb9 result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode.
9 years ago
sixcooler 5cb7ba0dc4 fix for connections not getting closed to get favicon.ico during seach
9 years ago
luc 571bc55937 Refactoring : use StandardCharsets constants instead of hard-coded
9 years ago
reger b7e8358645 make use of header.getContentType where possible (mime is normalized afterwards)
9 years ago
luc 755efac17d Use same max file size when loading all resource bytes or opening stream
9 years ago
luc f01d49c37a Process large or local file images dealing directly with content
9 years ago
luc 5bbb2e1730 Ensure resource is closed when reading a full file InputStream
9 years ago
reger fa08ca207e ! finish running crawls before applying !
10 years ago
reger 141cd80456 correct log msg text
10 years ago
Michael Peter Christen fed26f33a8 enhanced timezone managament for indexed data:
10 years ago
Michael Peter Christen b5ac29c9a5 added a html field scraper which reads text from html entities of a
10 years ago
Michael Peter Christen 783cf6fbc7 the LinkedBlockingQueue is much faster than the ArrayBlockingQueue
10 years ago
Michael Peter Christen 8c3e5b7b6d added experimental pdf splitting which enables YaCy to split pdfs during
10 years ago
Michael Peter Christen 28683530cd fixes to usage of no-cache: use and recognize also the no-store
10 years ago
reger 568c991405 remove the unused Request variable
10 years ago
reger ff18129def ViewFile servlet: update index if newer,
10 years ago
Michael Peter Christen 25a64c51b3 moved snapshot generation out of the html handler to prevent that
10 years ago
Michael Peter Christen 97f6089a41 YaCy can now create web page snapshots as pdf documents which can later
10 years ago
Michael Peter Christen ad0da5f246 added new web page snapshot infrastructure which will lead to the
10 years ago
Michael Peter Christen 84763126e0 added option to make the YaCy proxy act as the cache is never stale. If
10 years ago
Michael Peter Christen 67cd4c37bd activated the new apk parser which was already ready but not included in
11 years ago
Michael Peter Christen ebd0be2cea fixes and speed updates for search process
11 years ago
Michael Peter Christen eca9380e3d bugfix for crawler double-check: if an url is redirected, the
11 years ago
orbiter 22ce4fb4dd better error handling for remote solr queries and exists-checks
11 years ago
orbiter 4b06adb751 fix for file urls
11 years ago
Michael Peter Christen 2de159719b added an option to set 'obey nofollow' for links with rel="nofollow"
11 years ago
Michael Peter Christen b893c42a0f bugfix for image search
11 years ago
Michael Peter Christen ba6ffddefc refactoring
11 years ago
Michael Peter Christen 10cf8215bd added crawl depth for failed documents
11 years ago
Michael Peter Christen da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues:
11 years ago
Michael Peter Christen 6bd8c6f195 fix for wrong status codes of error pages
11 years ago
Michael Peter Christen b08375da33 fix for bad/missing values of size_i
11 years ago
reger dd5bf0b71b cleanup old reference to HTTPDemon.setAlternativeResolver
11 years ago
Michael Peter Christen bcd9dd9e1d enhanced concurrent loading by using a fixed set of concurrent loader
11 years ago