Commit Graph

116 Commits (f1c70dce33f493414fb3e8e8911d78593b229725)

Author SHA1 Message Date
Michael Peter Christen 9fcd8f1bda added canonical filter
2 years ago
Michael Christen 4304e07e6f crawl profile adoption to new tag valency attribute
2 years ago
Daleth Darko 3ced06c731 Various javadoc fixes
3 years ago
Michael Peter Christen bd3f2483a1 replaced url and date retrieval by only url retrieval
3 years ago
Michael Peter Christen 1ead7b85b5 remove compiler warning
3 years ago
sgaebel cdf901270c always use HTTPClient by 'try with resources' pattern to free up
3 years ago
sgaebel 69adaa9f55 makes our HTTPClient closable
3 years ago
Michael Christen b2af745dd6
Merge pull request #404 from lnceballosz/master
4 years ago
sgaebel c69c462a15 replaces a expensive getLoadTimeURL() by exists()
4 years ago
Lina Ceballos a96752f5ab adding SPDX license and copyright headers
4 years ago
Michael Peter Christen 9be36800a4 increased redirect depth by one
4 years ago
luccioman fcf6b16db4 Added new crawler attribute for finer control over Media Type detection
6 years ago
luccioman e45afedee4 Added support for enclosures (media links) to the RSS loader
7 years ago
luccioman aaefd5219c Reduce log verbosity of RSS loader on feed items with no link
7 years ago
luccioman 17c7a85f18 Make StreamResponse usable in Java try-with-resources statements
7 years ago
luccioman 7baa99f26f Fixed stored URL in web cache when redirection(s) occurs.
7 years ago
luccioman 5db1c9155a Do locale independant case conversion on hosts, schemes, and file exts.
7 years ago
Michael Peter Christen 25573bd5ab added a crawl filter based on <div> tag class names
7 years ago
luccioman 5a646540cc Support parsing gzip files from servers with redundant headers.
7 years ago
luccioman 11a7f923d4 Distinguish response parsing failures from unexpected exceptions.
7 years ago
luccioman 452a17a8d5 Finer control on bounded input streams with custom stream implementation
7 years ago
luccioman 1e84956721 Support loading local files with a per request specified maximum size.
7 years ago
luccioman bf55f1d6e5 Started support of partial parsing on large streamed resources.
7 years ago
luccioman 433bdb7c0d Respect maxFileSize limit also when streaming HTTP and when relevant.
7 years ago
luccioman 8da3174867 Ensure lower case conversion consistency with any default locale.
7 years ago
luccioman a9cb083fa1 Improved consistency between loader openInputStream and load functions
8 years ago
luccioman b1da92648e Fixed surrogates import monitoring page (/CrawlResults.html?process=7)
8 years ago
luccioman f66438442e Extended Mediawiki dump import to remote URLs.
8 years ago
reger ce87025462 further avoid to set connect info properties as header value
8 years ago
reger c50e23c495 reduce creation of empty legacy RequestHeader() in situation where null
8 years ago
luccioman f0639d810c Customized name for Threads still using the default "Thread-n" pattern.
8 years ago
luccioman 6f49ece22f Fixed redirected URLs processing as crawl start point.
8 years ago
Michael Peter Christen 5e165a8150 removed unused imports
8 years ago
reger 7ab41d4ff1 use directories original lastmodified date in file- & smbloader in response
8 years ago
reger 06d0e2aeb9 result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode.
9 years ago
sixcooler 5cb7ba0dc4 fix for connections not getting closed to get favicon.ico during seach
9 years ago
luc 571bc55937 Refactoring : use StandardCharsets constants instead of hard-coded
9 years ago
reger b7e8358645 make use of header.getContentType where possible (mime is normalized afterwards)
9 years ago
luc 755efac17d Use same max file size when loading all resource bytes or opening stream
9 years ago
luc f01d49c37a Process large or local file images dealing directly with content
9 years ago
luc 5bbb2e1730 Ensure resource is closed when reading a full file InputStream
9 years ago
reger fa08ca207e ! finish running crawls before applying !
9 years ago
reger 141cd80456 correct log msg text
10 years ago
Michael Peter Christen fed26f33a8 enhanced timezone managament for indexed data:
10 years ago
Michael Peter Christen b5ac29c9a5 added a html field scraper which reads text from html entities of a
10 years ago
Michael Peter Christen 783cf6fbc7 the LinkedBlockingQueue is much faster than the ArrayBlockingQueue
10 years ago
Michael Peter Christen 8c3e5b7b6d added experimental pdf splitting which enables YaCy to split pdfs during
10 years ago
Michael Peter Christen 28683530cd fixes to usage of no-cache: use and recognize also the no-store
10 years ago
reger 568c991405 remove the unused Request variable
10 years ago
reger ff18129def ViewFile servlet: update index if newer,
10 years ago