Commit Graph

751 Commits (111cf4864293419233e5a2e3cb87e4c63a0294c1)

Author SHA1 Message Date
sgaebel cdf901270c always use HTTPClient by 'try with resources' pattern to free up
3 years ago
sgaebel 69adaa9f55 makes our HTTPClient closable
3 years ago
Michael Peter Christen 552ab7051b fix for warc importer
3 years ago
Michael Peter Christen e9c5e78868 replaced new Number(Number) with Number.instanceOf
4 years ago
Michael Peter Christen 9ef4503672 fixed some newInstance() warnings
4 years ago
jfhs 10bddc2c2d Decode HTML entities in all property values by default
4 years ago
jfhs 2135d259e3 Replace hardcoded html/xml entities with a file, support decoding all defined HTML entities
4 years ago
Michael Peter Christen d3526c52af fixed a problem in warc importer: do not fail if single WARC entries are
4 years ago
Michael Peter Christen d359d521a1 fixed warc importer
4 years ago
sgaebel fc03c4b4fe removes some warning and unused objects
5 years ago
sgaebel df9ea0a42a removes some warnings: unused imports, params
5 years ago
Michael Peter Christen e0ad8ca9da replaced json library from JSON.org with libandroid-json-java
5 years ago
luccioman e90405b6f0 Support parsing audio URLs without file extension
6 years ago
sgaebel c2398fd890 remove warnings: 'Statement unnecessarily nested within else clause'
6 years ago
sgaebel 811d40a6c4 taking care of closing inputstreams, HTTPClient
6 years ago
luccioman 3fb449b3b6 Properly resolve relative URLs against document URL in html base tags
6 years ago
luccioman fcf6b16db4 Added new crawler attribute for finer control over Media Type detection
6 years ago
luccioman 54fbe166ba Updated pdf cache clear steps consistently with current pdfbox version
7 years ago
luccioman 685122363d Added a parser for XZ compressed archives.
7 years ago
luccioman 8a29551c54 Upgraded the OpenGeoDB dump URL
7 years ago
luccioman bb51555830 Removed remaining unsafe accesses to SimpleDateFormat instances.
7 years ago
luccioman e97580dfc7 Fixed unsafe conccurent access to generic SimpleDateFormat instances
7 years ago
Michael Christen e0dc632020 removed transformer
7 years ago
luccioman fa4399d5d2 Small perf improvement : initialize threads names early when possible
7 years ago
luccioman e357ade47d Reduced memory footprint of text snippet extraction
7 years ago
luccioman e115e57cc7 Reduced text snippet extraction processing time.
7 years ago
luccioman fb3032c530 Added a crawl filtering possibility on documents Media Type (MIME)
7 years ago
luccioman cf62b571bd Added RSS reader support for `enclosure` feed item sub element.
7 years ago
luccioman 3da2739bbd Parse and index more common audio metadata text tag fields.
7 years ago
luccioman 846aba00fa Added parsing of URLs eventually present in audio metadata tags
7 years ago
Michael Peter Christen 187075b878 added nav filter
7 years ago
luccioman bcbd0ae1a4 Enabled partial parsing of audio resources.
7 years ago
luccioman 978e2be95b Let a chance for other parsers on audioTagParser error
7 years ago
luccioman 9e5846a26e Small fix on svg parser error message
7 years ago
luccioman 11611dbdcf Reuse existing File copy function to handle audio parser tmp files
7 years ago
luccioman f77f8f40f9 Factored audio parser tag processing
7 years ago
luccioman 9a7a353d0e Removed some unnecessary intermediate list creation on array copy.
7 years ago
luccioman fb6457f5bc Fixed NPE case when on audio resource parsed with null tag
7 years ago
luccioman c3ff50c17a Updated the list of audio file formats supported by the audioTagParser
7 years ago
luccioman eb20589e29 Fixed issue #158 : completed div CSS class ignore in crawl
7 years ago
luccioman 9412881230 Added basic support for autotagging microdata annotated item types.
7 years ago
luccioman 5a14d34a7d Refactoring : documented and extracted autotagging processing functions.
7 years ago
luccioman 58b9834729 Added HTML microdata typed items parsing capability.
7 years ago
luccioman 733cacdbb8 Revised the RDFaParser main launcher for minimal proper operation.
7 years ago
Michael Peter Christen 25573bd5ab added a crawl filter based on <div> tag class names
7 years ago
luccioman e2f6427a63 Added a basic JUnit test for the Visio parser (vsdParser)
7 years ago
luccioman 1e9cdaabd4 Do locale neutral case conversion of HTML charset name.
7 years ago
luccioman e0eda84c24 Remove old hard-coded holiday dates from DateDection class.
7 years ago
luccioman 46f37e38dc Customized Threads with generic name for easier monitoring.
7 years ago
luccioman 32c9dfa768 Added partial bzip2 stream parsing support and bzipParser Junit test
7 years ago