Commit Graph

769 Commits (09783ae89eb11688d86e4a3c1c539fb8566fde49)

Author SHA1 Message Date
luccioman 978e2be95b Let a chance for other parsers on audioTagParser error
7 years ago
luccioman 9e5846a26e Small fix on svg parser error message
7 years ago
luccioman 11611dbdcf Reuse existing File copy function to handle audio parser tmp files
7 years ago
luccioman f77f8f40f9 Factored audio parser tag processing
7 years ago
luccioman 9a7a353d0e Removed some unnecessary intermediate list creation on array copy.
7 years ago
luccioman fb6457f5bc Fixed NPE case when on audio resource parsed with null tag
7 years ago
luccioman c3ff50c17a Updated the list of audio file formats supported by the audioTagParser
7 years ago
luccioman eb20589e29 Fixed issue #158 : completed div CSS class ignore in crawl
7 years ago
luccioman 9412881230 Added basic support for autotagging microdata annotated item types.
7 years ago
luccioman 5a14d34a7d Refactoring : documented and extracted autotagging processing functions.
7 years ago
luccioman 58b9834729 Added HTML microdata typed items parsing capability.
7 years ago
luccioman 733cacdbb8 Revised the RDFaParser main launcher for minimal proper operation.
7 years ago
Michael Peter Christen 25573bd5ab added a crawl filter based on <div> tag class names
7 years ago
luccioman e2f6427a63 Added a basic JUnit test for the Visio parser (vsdParser)
7 years ago
luccioman 1e9cdaabd4 Do locale neutral case conversion of HTML charset name.
7 years ago
luccioman e0eda84c24 Remove old hard-coded holiday dates from DateDection class.
7 years ago
luccioman 46f37e38dc Customized Threads with generic name for easier monitoring.
7 years ago
luccioman 32c9dfa768 Added partial bzip2 stream parsing support and bzipParser Junit test
7 years ago
luccioman c6ae87168a Added unit tests on the gzip parser.
7 years ago
luccioman 169ffdd1c7 Finer control on max links to parse in the html parser.
7 years ago
luccioman e41d046a9d Improved parsing support for OOXML spreadsheets (.xlsx)
7 years ago
reger 51a4e03c93 Allow to stop currently running warc import (stop button)
7 years ago
luccioman 780173008e Implemented partial stream parsing of tar archives.
7 years ago
luccioman acab6a6def Also handle text content when parsing XML within limits.
7 years ago
luccioman 8a94fef9e0 Prevent unwanted cached bytes duplication on stream parsing.
7 years ago
luccioman 5a646540cc Support parsing gzip files from servers with redundant headers.
7 years ago
luccioman eda7b0aeb6 Merge branch 'master' of https://github.com/yacy/yacy_search_server
7 years ago
reger 3005be7349 Clean up unmaintained and unused AugmentParser trail.
7 years ago
luccioman cb4f1358e1 Added gzip parser support for max content bytes limit
7 years ago
luccioman 5216c681a9 Added HTML parser support for maximum content bytes parsing limit
7 years ago
luccioman 651fad6da5 Added RSS parser support for maximum content bytes parsing limit
7 years ago
luccioman 452a17a8d5 Finer control on bounded input streams with custom stream implementation
7 years ago
luccioman f8f1959ebb Added parsing within bounds implementation to the generic parser.
7 years ago
luccioman e0f400a0bd Support trying multiple parsers even when streaming on large resources.
7 years ago
luccioman bf55f1d6e5 Started support of partial parsing on large streamed resources.
7 years ago
luccioman 90a7c1affa HTML parser : removed unnecessary remaining recursive processing
7 years ago
luccioman 9b1bb2545e Refactored plain-text URLs detection implementation.
7 years ago
luccioman 8da3174867 Ensure lower case conversion consistency with any default locale.
7 years ago
luccioman 286f3018bd Made mime type and extension normalization locale independent.
7 years ago
luccioman 319231a458 Added a generic XML parser, able to parse elements text and URLs.
7 years ago
luccioman d2a4a27f52 Improved stream-oriented parsing entering conditions.
8 years ago
luccioman ce89492319 Ensure system resource release by closing document stream.
8 years ago
luccioman 8399275142 Properly close file output streams even on exceptions scenarios.
8 years ago
luccioman a04feac064 Ensure file input streams proper closing in both success and failures
8 years ago
luccioman d98c04853d Ensure proper closing of file input streams.
8 years ago
luccioman 306a82dd71 Fixed scraper NullPointerException cases on malformed URLs.
8 years ago
reger 1737af37cf Set request originator to own peer in warc importer
8 years ago
reger 039162fbf0 Change warc importer to use defaultsurrogate-crawl profile, as reported
8 years ago
reger 077d062be3 Adjust mergeDocuments to keep youngest last-modified date of document
8 years ago
luccioman 654801523e Fixed StringIndexOutOfBoundsException case.
8 years ago
luccioman edd7ccac40 Added some JavaDoc
8 years ago
luccioman 79fdf14b0a Fixed regression introduced by commit 9ad4d16
8 years ago
Michael Peter Christen 7678fd67e3 copied fix from yacy_grid_parser for wrong array type
8 years ago
reger 9ad4d16829 Add a responsHeader to the solr index export with a format identifier
8 years ago
luccioman 527d494c1a Fixed "Unchecked conversion" compilation warnings.
8 years ago
reger c77e43a391 Take out mailto collect in internal parsed document
8 years ago
reger bec34d3546 Add url input field as source for WarcImporter
8 years ago
luccioman f66438442e Extended Mediawiki dump import to remote URLs.
8 years ago
reger ba339a2a45 Add servlet to import warc file from filesystem IndexImportWarc_p.html.
8 years ago
reger 510f11d374 Implement surrogate import from Warc archives (as first option handle
8 years ago
reger 209a7374bd remove unused import pdfParser
8 years ago
reger de1c1c16db Improve pdf text extraction resource handling.
8 years ago
reger 18c7563dbe Extend DCEntry.getLanguage convert to ISO639-1 codes for more languages
8 years ago
reger f254fcfc67 fix htmlParser <script> text extraction on code containing expression
8 years ago
Michael Peter Christen 02d0b3172c Merge branch 'master' of https://github.com/yacy/yacy_search_server.git
8 years ago
Michael Peter Christen d4f45cf05e added dc.date.modified and dc.date.created to date parser
8 years ago
reger df80c57842 add ukr and pol to DCEntry.getLanguage ISO639-2 3-char language code
8 years ago
luccioman 6a4d51d8f9 Cleaned up some Javadoc warnings.
8 years ago
reger 4c9be29a55 fix concurrency issue with htmlParser using not current scraper data
8 years ago
reger b522d540b9 Include itemprop latitude/longitude (see schema.org) in attribute
8 years ago
reger 083df255e4 fix html tag attribute parsing containing attribute w/o value
8 years ago
reger cb95b7339a include html5 <time> tag in content scraper,
8 years ago
reger c50e23c495 reduce creation of empty legacy RequestHeader() in situation where null
8 years ago
luccioman d27adc2b92 Fixed language detector initialization and NullPointerException cases.
8 years ago
luccioman 3f561c1635 Fixed a NullPointerException case.
8 years ago
luccioman 3092a8ced5 Fixed thread name consistency for improved monitoring.
8 years ago
luccioman eec5779889 Added a name prefix to pooled threads for easier monitoring.
8 years ago
reger 8fe28a83f2 harmonize used lastmodified date for rwi and fulltext in storeDocument
8 years ago
luccioman f0639d810c Customized name for Threads still using the default "Thread-n" pattern.
8 years ago
luccioman 47af33a04c Advanced Crawl from local file : better processing of large files.
8 years ago
luccioman 7717a3d43d Fixed license headers on files created to improve favicon management.
8 years ago
luccioman 6e1959f469 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git
8 years ago
reger b752bcfecb adjust date in text detection to ignore some program version strings
8 years ago
reger b017e97421 optimize condenser language detection a little.
8 years ago
reger ae3717d087 adjust Tokenizer sentence count to ignore repeated punktuation (like !!!! )
8 years ago
reger 474f0476c6 adjust Tokenizer sentence count on trailing text after last recognized sentence
8 years ago
reger 14f7577231 add support for older Word versions (Word6/Word95) to docParser
8 years ago
reger 1a79c64495 generalize DateDetection with holiday date rules readily available in icu
8 years ago
reger 6f68f08354 correct DateDetection Silvester date
8 years ago
reger efcb6a1e74 fix supported mime XML -> xml for rssParser (mime normalized to lower case for comparison)
8 years ago
luccioman b1b8e69da8 Fixed NullPointerException cases
8 years ago
reger a4465c97d6 as requested, disable/remove old swf parser
8 years ago
reger 96467c5467 remove not needed counter in Tokeninzer (completing last changes)
8 years ago
reger 272cdd496a reactivate sentence counter in WordTokenizer for phrasepos ranking,
8 years ago
Michael Peter Christen 5e165a8150 removed unused imports
8 years ago
reger e310ec5f70 fix posInText ranking calculation to score 0 on no position info
8 years ago
reger 4c7a77662a eleminate dependency on file-extension in storeDocument but use supported mime-type
8 years ago
reger ebde21079a refactor xlsParser to include Excel file attribute (like author) in parser result doc.
8 years ago
reger 27163af0e1 improve detection of referenced links by taking http and https link protocol
8 years ago
luccioman 6e96c7341a Merge remote-tracking branch 'origin/master'
8 years ago