Commit Graph

106 Commits (13e42c2dd27894043892a2600679cbecfba05339)

Author SHA1 Message Date
Michael Peter Christen 9be36800a4 increased redirect depth by one
4 years ago
luccioman fcf6b16db4 Added new crawler attribute for finer control over Media Type detection
6 years ago
luccioman e45afedee4 Added support for enclosures (media links) to the RSS loader
7 years ago
luccioman aaefd5219c Reduce log verbosity of RSS loader on feed items with no link
7 years ago
luccioman 17c7a85f18 Make StreamResponse usable in Java try-with-resources statements
7 years ago
luccioman 7baa99f26f Fixed stored URL in web cache when redirection(s) occurs.
7 years ago
luccioman 5db1c9155a Do locale independant case conversion on hosts, schemes, and file exts.
7 years ago
Michael Peter Christen 25573bd5ab added a crawl filter based on <div> tag class names
7 years ago
luccioman 5a646540cc Support parsing gzip files from servers with redundant headers.
7 years ago
luccioman 11a7f923d4 Distinguish response parsing failures from unexpected exceptions.
7 years ago
luccioman 452a17a8d5 Finer control on bounded input streams with custom stream implementation
7 years ago
luccioman 1e84956721 Support loading local files with a per request specified maximum size.
7 years ago
luccioman bf55f1d6e5 Started support of partial parsing on large streamed resources.
7 years ago
luccioman 433bdb7c0d Respect maxFileSize limit also when streaming HTTP and when relevant.
7 years ago
luccioman 8da3174867 Ensure lower case conversion consistency with any default locale.
7 years ago
luccioman a9cb083fa1 Improved consistency between loader openInputStream and load functions
8 years ago
luccioman b1da92648e Fixed surrogates import monitoring page (/CrawlResults.html?process=7)
8 years ago
luccioman f66438442e Extended Mediawiki dump import to remote URLs.
8 years ago
reger ce87025462 further avoid to set connect info properties as header value
8 years ago
reger c50e23c495 reduce creation of empty legacy RequestHeader() in situation where null
8 years ago
luccioman f0639d810c Customized name for Threads still using the default "Thread-n" pattern.
8 years ago
luccioman 6f49ece22f Fixed redirected URLs processing as crawl start point.
8 years ago
Michael Peter Christen 5e165a8150 removed unused imports
8 years ago
reger 7ab41d4ff1 use directories original lastmodified date in file- & smbloader in response
8 years ago
reger 06d0e2aeb9 result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode.
9 years ago
sixcooler 5cb7ba0dc4 fix for connections not getting closed to get favicon.ico during seach
9 years ago
luc 571bc55937 Refactoring : use StandardCharsets constants instead of hard-coded
9 years ago
reger b7e8358645 make use of header.getContentType where possible (mime is normalized afterwards)
9 years ago
luc 755efac17d Use same max file size when loading all resource bytes or opening stream
9 years ago
luc f01d49c37a Process large or local file images dealing directly with content
9 years ago
luc 5bbb2e1730 Ensure resource is closed when reading a full file InputStream
9 years ago
reger fa08ca207e ! finish running crawls before applying !
9 years ago
reger 141cd80456 correct log msg text
10 years ago
Michael Peter Christen fed26f33a8 enhanced timezone managament for indexed data:
10 years ago
Michael Peter Christen b5ac29c9a5 added a html field scraper which reads text from html entities of a
10 years ago
Michael Peter Christen 783cf6fbc7 the LinkedBlockingQueue is much faster than the ArrayBlockingQueue
10 years ago
Michael Peter Christen 8c3e5b7b6d added experimental pdf splitting which enables YaCy to split pdfs during
10 years ago
Michael Peter Christen 28683530cd fixes to usage of no-cache: use and recognize also the no-store
10 years ago
reger 568c991405 remove the unused Request variable
10 years ago
reger ff18129def ViewFile servlet: update index if newer,
10 years ago
Michael Peter Christen 25a64c51b3 moved snapshot generation out of the html handler to prevent that
10 years ago
Michael Peter Christen 97f6089a41 YaCy can now create web page snapshots as pdf documents which can later
10 years ago
Michael Peter Christen ad0da5f246 added new web page snapshot infrastructure which will lead to the
10 years ago
Michael Peter Christen 84763126e0 added option to make the YaCy proxy act as the cache is never stale. If
10 years ago
Michael Peter Christen 67cd4c37bd activated the new apk parser which was already ready but not included in
10 years ago
Michael Peter Christen ebd0be2cea fixes and speed updates for search process
10 years ago
Michael Peter Christen eca9380e3d bugfix for crawler double-check: if an url is redirected, the
10 years ago
orbiter 22ce4fb4dd better error handling for remote solr queries and exists-checks
10 years ago
orbiter 4b06adb751 fix for file urls
10 years ago
Michael Peter Christen 2de159719b added an option to set 'obey nofollow' for links with rel="nofollow"
10 years ago