Commit Graph

449 Commits (78bd82f8ef571e1f16758c44a46b8aab83d1eb09)

Author SHA1 Message Date
sgaebel fc03c4b4fe removes some warning and unused objects
5 years ago
sgaebel df9ea0a42a removes some warnings: unused imports, params
5 years ago
luccioman e90405b6f0 Support parsing audio URLs without file extension
6 years ago
sgaebel 811d40a6c4 taking care of closing inputstreams, HTTPClient
6 years ago
luccioman 3fb449b3b6 Properly resolve relative URLs against document URL in html base tags
6 years ago
luccioman 54fbe166ba Updated pdf cache clear steps consistently with current pdfbox version
7 years ago
luccioman 685122363d Added a parser for XZ compressed archives.
7 years ago
Michael Christen e0dc632020 removed transformer
7 years ago
luccioman fa4399d5d2 Small perf improvement : initialize threads names early when possible
7 years ago
luccioman cf62b571bd Added RSS reader support for `enclosure` feed item sub element.
7 years ago
luccioman 3da2739bbd Parse and index more common audio metadata text tag fields.
7 years ago
luccioman 846aba00fa Added parsing of URLs eventually present in audio metadata tags
7 years ago
Michael Peter Christen 187075b878 added nav filter
7 years ago
luccioman bcbd0ae1a4 Enabled partial parsing of audio resources.
7 years ago
luccioman 978e2be95b Let a chance for other parsers on audioTagParser error
7 years ago
luccioman 9e5846a26e Small fix on svg parser error message
7 years ago
luccioman 11611dbdcf Reuse existing File copy function to handle audio parser tmp files
7 years ago
luccioman f77f8f40f9 Factored audio parser tag processing
7 years ago
luccioman fb6457f5bc Fixed NPE case when on audio resource parsed with null tag
7 years ago
luccioman c3ff50c17a Updated the list of audio file formats supported by the audioTagParser
7 years ago
luccioman eb20589e29 Fixed issue #158 : completed div CSS class ignore in crawl
7 years ago
luccioman 9412881230 Added basic support for autotagging microdata annotated item types.
7 years ago
luccioman 58b9834729 Added HTML microdata typed items parsing capability.
7 years ago
luccioman 733cacdbb8 Revised the RDFaParser main launcher for minimal proper operation.
7 years ago
Michael Peter Christen 25573bd5ab added a crawl filter based on <div> tag class names
7 years ago
luccioman e2f6427a63 Added a basic JUnit test for the Visio parser (vsdParser)
7 years ago
luccioman 1e9cdaabd4 Do locale neutral case conversion of HTML charset name.
7 years ago
luccioman 32c9dfa768 Added partial bzip2 stream parsing support and bzipParser Junit test
7 years ago
luccioman c6ae87168a Added unit tests on the gzip parser.
8 years ago
luccioman 169ffdd1c7 Finer control on max links to parse in the html parser.
8 years ago
luccioman e41d046a9d Improved parsing support for OOXML spreadsheets (.xlsx)
8 years ago
luccioman 780173008e Implemented partial stream parsing of tar archives.
8 years ago
luccioman acab6a6def Also handle text content when parsing XML within limits.
8 years ago
luccioman 5a646540cc Support parsing gzip files from servers with redundant headers.
8 years ago
luccioman eda7b0aeb6 Merge branch 'master' of https://github.com/yacy/yacy_search_server
8 years ago
reger 3005be7349 Clean up unmaintained and unused AugmentParser trail.
8 years ago
luccioman cb4f1358e1 Added gzip parser support for max content bytes limit
8 years ago
luccioman 5216c681a9 Added HTML parser support for maximum content bytes parsing limit
8 years ago
luccioman 651fad6da5 Added RSS parser support for maximum content bytes parsing limit
8 years ago
luccioman 452a17a8d5 Finer control on bounded input streams with custom stream implementation
8 years ago
luccioman f8f1959ebb Added parsing within bounds implementation to the generic parser.
8 years ago
luccioman bf55f1d6e5 Started support of partial parsing on large streamed resources.
8 years ago
luccioman 90a7c1affa HTML parser : removed unnecessary remaining recursive processing
8 years ago
luccioman 9b1bb2545e Refactored plain-text URLs detection implementation.
8 years ago
luccioman 8da3174867 Ensure lower case conversion consistency with any default locale.
8 years ago
luccioman 319231a458 Added a generic XML parser, able to parse elements text and URLs.
8 years ago
luccioman 8399275142 Properly close file output streams even on exceptions scenarios.
8 years ago
luccioman a04feac064 Ensure file input streams proper closing in both success and failures
8 years ago
luccioman d98c04853d Ensure proper closing of file input streams.
8 years ago
luccioman 306a82dd71 Fixed scraper NullPointerException cases on malformed URLs.
8 years ago