Commit Graph

711 Commits (117a85987989210f3b3295778e12bbaf2f5cd733)

Author SHA1 Message Date
luccioman 9412881230 Added basic support for autotagging microdata annotated item types.
7 years ago
luccioman 5a14d34a7d Refactoring : documented and extracted autotagging processing functions.
7 years ago
luccioman 58b9834729 Added HTML microdata typed items parsing capability.
7 years ago
luccioman 733cacdbb8 Revised the RDFaParser main launcher for minimal proper operation.
7 years ago
Michael Peter Christen 25573bd5ab added a crawl filter based on <div> tag class names
7 years ago
luccioman e2f6427a63 Added a basic JUnit test for the Visio parser (vsdParser)
7 years ago
luccioman 1e9cdaabd4 Do locale neutral case conversion of HTML charset name.
7 years ago
luccioman e0eda84c24 Remove old hard-coded holiday dates from DateDection class.
7 years ago
luccioman 46f37e38dc Customized Threads with generic name for easier monitoring.
7 years ago
luccioman 32c9dfa768 Added partial bzip2 stream parsing support and bzipParser Junit test
7 years ago
luccioman c6ae87168a Added unit tests on the gzip parser.
8 years ago
luccioman 169ffdd1c7 Finer control on max links to parse in the html parser.
8 years ago
luccioman e41d046a9d Improved parsing support for OOXML spreadsheets (.xlsx)
8 years ago
reger 51a4e03c93 Allow to stop currently running warc import (stop button)
8 years ago
luccioman 780173008e Implemented partial stream parsing of tar archives.
8 years ago
luccioman acab6a6def Also handle text content when parsing XML within limits.
8 years ago
luccioman 8a94fef9e0 Prevent unwanted cached bytes duplication on stream parsing.
8 years ago
luccioman 5a646540cc Support parsing gzip files from servers with redundant headers.
8 years ago
luccioman eda7b0aeb6 Merge branch 'master' of https://github.com/yacy/yacy_search_server
8 years ago
reger 3005be7349 Clean up unmaintained and unused AugmentParser trail.
8 years ago
luccioman cb4f1358e1 Added gzip parser support for max content bytes limit
8 years ago
luccioman 5216c681a9 Added HTML parser support for maximum content bytes parsing limit
8 years ago
luccioman 651fad6da5 Added RSS parser support for maximum content bytes parsing limit
8 years ago
luccioman 452a17a8d5 Finer control on bounded input streams with custom stream implementation
8 years ago
luccioman f8f1959ebb Added parsing within bounds implementation to the generic parser.
8 years ago
luccioman e0f400a0bd Support trying multiple parsers even when streaming on large resources.
8 years ago
luccioman bf55f1d6e5 Started support of partial parsing on large streamed resources.
8 years ago
luccioman 90a7c1affa HTML parser : removed unnecessary remaining recursive processing
8 years ago
luccioman 9b1bb2545e Refactored plain-text URLs detection implementation.
8 years ago
luccioman 8da3174867 Ensure lower case conversion consistency with any default locale.
8 years ago
luccioman 286f3018bd Made mime type and extension normalization locale independent.
8 years ago
luccioman 319231a458 Added a generic XML parser, able to parse elements text and URLs.
8 years ago
luccioman d2a4a27f52 Improved stream-oriented parsing entering conditions.
8 years ago
luccioman ce89492319 Ensure system resource release by closing document stream.
8 years ago
luccioman 8399275142 Properly close file output streams even on exceptions scenarios.
8 years ago
luccioman a04feac064 Ensure file input streams proper closing in both success and failures
8 years ago
luccioman d98c04853d Ensure proper closing of file input streams.
8 years ago
luccioman 306a82dd71 Fixed scraper NullPointerException cases on malformed URLs.
8 years ago
reger 1737af37cf Set request originator to own peer in warc importer
8 years ago
reger 039162fbf0 Change warc importer to use defaultsurrogate-crawl profile, as reported
8 years ago
reger 077d062be3 Adjust mergeDocuments to keep youngest last-modified date of document
8 years ago
luccioman 654801523e Fixed StringIndexOutOfBoundsException case.
8 years ago
luccioman edd7ccac40 Added some JavaDoc
8 years ago
luccioman 79fdf14b0a Fixed regression introduced by commit 9ad4d16
8 years ago
Michael Peter Christen 7678fd67e3 copied fix from yacy_grid_parser for wrong array type
8 years ago
reger 9ad4d16829 Add a responsHeader to the solr index export with a format identifier
8 years ago
luccioman 527d494c1a Fixed "Unchecked conversion" compilation warnings.
8 years ago
reger c77e43a391 Take out mailto collect in internal parsed document
8 years ago
reger bec34d3546 Add url input field as source for WarcImporter
8 years ago
luccioman f66438442e Extended Mediawiki dump import to remote URLs.
8 years ago