Commit Graph

697 Commits (30d71c63596eaba23f40c47baed0482d84d8f0f9)

Author SHA1 Message Date
luccioman 780173008e Implemented partial stream parsing of tar archives.
8 years ago
luccioman acab6a6def Also handle text content when parsing XML within limits.
8 years ago
luccioman 8a94fef9e0 Prevent unwanted cached bytes duplication on stream parsing.
8 years ago
luccioman 5a646540cc Support parsing gzip files from servers with redundant headers.
8 years ago
luccioman eda7b0aeb6 Merge branch 'master' of https://github.com/yacy/yacy_search_server
8 years ago
reger 3005be7349 Clean up unmaintained and unused AugmentParser trail.
8 years ago
luccioman cb4f1358e1 Added gzip parser support for max content bytes limit
8 years ago
luccioman 5216c681a9 Added HTML parser support for maximum content bytes parsing limit
8 years ago
luccioman 651fad6da5 Added RSS parser support for maximum content bytes parsing limit
8 years ago
luccioman 452a17a8d5 Finer control on bounded input streams with custom stream implementation
8 years ago
luccioman f8f1959ebb Added parsing within bounds implementation to the generic parser.
8 years ago
luccioman e0f400a0bd Support trying multiple parsers even when streaming on large resources.
8 years ago
luccioman bf55f1d6e5 Started support of partial parsing on large streamed resources.
8 years ago
luccioman 90a7c1affa HTML parser : removed unnecessary remaining recursive processing
8 years ago
luccioman 9b1bb2545e Refactored plain-text URLs detection implementation.
8 years ago
luccioman 8da3174867 Ensure lower case conversion consistency with any default locale.
8 years ago
luccioman 286f3018bd Made mime type and extension normalization locale independent.
8 years ago
luccioman 319231a458 Added a generic XML parser, able to parse elements text and URLs.
8 years ago
luccioman d2a4a27f52 Improved stream-oriented parsing entering conditions.
8 years ago
luccioman ce89492319 Ensure system resource release by closing document stream.
8 years ago
luccioman 8399275142 Properly close file output streams even on exceptions scenarios.
8 years ago
luccioman a04feac064 Ensure file input streams proper closing in both success and failures
8 years ago
luccioman d98c04853d Ensure proper closing of file input streams.
8 years ago
luccioman 306a82dd71 Fixed scraper NullPointerException cases on malformed URLs.
8 years ago
reger 1737af37cf Set request originator to own peer in warc importer
8 years ago
reger 039162fbf0 Change warc importer to use defaultsurrogate-crawl profile, as reported
8 years ago
reger 077d062be3 Adjust mergeDocuments to keep youngest last-modified date of document
8 years ago
luccioman 654801523e Fixed StringIndexOutOfBoundsException case.
8 years ago
luccioman edd7ccac40 Added some JavaDoc
8 years ago
luccioman 79fdf14b0a Fixed regression introduced by commit 9ad4d16
8 years ago
Michael Peter Christen 7678fd67e3 copied fix from yacy_grid_parser for wrong array type
8 years ago
reger 9ad4d16829 Add a responsHeader to the solr index export with a format identifier
8 years ago
luccioman 527d494c1a Fixed "Unchecked conversion" compilation warnings.
8 years ago
reger c77e43a391 Take out mailto collect in internal parsed document
8 years ago
reger bec34d3546 Add url input field as source for WarcImporter
8 years ago
luccioman f66438442e Extended Mediawiki dump import to remote URLs.
8 years ago
reger ba339a2a45 Add servlet to import warc file from filesystem IndexImportWarc_p.html.
8 years ago
reger 510f11d374 Implement surrogate import from Warc archives (as first option handle
8 years ago
reger 209a7374bd remove unused import pdfParser
8 years ago
reger de1c1c16db Improve pdf text extraction resource handling.
8 years ago
reger 18c7563dbe Extend DCEntry.getLanguage convert to ISO639-1 codes for more languages
8 years ago
reger f254fcfc67 fix htmlParser <script> text extraction on code containing expression
8 years ago
Michael Peter Christen 02d0b3172c Merge branch 'master' of https://github.com/yacy/yacy_search_server.git
8 years ago
Michael Peter Christen d4f45cf05e added dc.date.modified and dc.date.created to date parser
8 years ago
reger df80c57842 add ukr and pol to DCEntry.getLanguage ISO639-2 3-char language code
8 years ago
luccioman 6a4d51d8f9 Cleaned up some Javadoc warnings.
8 years ago
reger 4c9be29a55 fix concurrency issue with htmlParser using not current scraper data
8 years ago
reger b522d540b9 Include itemprop latitude/longitude (see schema.org) in attribute
8 years ago
reger 083df255e4 fix html tag attribute parsing containing attribute w/o value
8 years ago
reger cb95b7339a include html5 <time> tag in content scraper,
8 years ago