Commit Graph

684 Commits (2a87b08cea67f8f2ae46e318c1c3945e8520ec53)

Author SHA1 Message Date
luccioman 90a7c1affa HTML parser : removed unnecessary remaining recursive processing
7 years ago
luccioman 9b1bb2545e Refactored plain-text URLs detection implementation.
7 years ago
luccioman 8da3174867 Ensure lower case conversion consistency with any default locale.
7 years ago
luccioman 286f3018bd Made mime type and extension normalization locale independent.
7 years ago
luccioman 319231a458 Added a generic XML parser, able to parse elements text and URLs.
7 years ago
luccioman d2a4a27f52 Improved stream-oriented parsing entering conditions.
8 years ago
luccioman ce89492319 Ensure system resource release by closing document stream.
8 years ago
luccioman 8399275142 Properly close file output streams even on exceptions scenarios.
8 years ago
luccioman a04feac064 Ensure file input streams proper closing in both success and failures
8 years ago
luccioman d98c04853d Ensure proper closing of file input streams.
8 years ago
luccioman 306a82dd71 Fixed scraper NullPointerException cases on malformed URLs.
8 years ago
reger 1737af37cf Set request originator to own peer in warc importer
8 years ago
reger 039162fbf0 Change warc importer to use defaultsurrogate-crawl profile, as reported
8 years ago
reger 077d062be3 Adjust mergeDocuments to keep youngest last-modified date of document
8 years ago
luccioman 654801523e Fixed StringIndexOutOfBoundsException case.
8 years ago
luccioman edd7ccac40 Added some JavaDoc
8 years ago
luccioman 79fdf14b0a Fixed regression introduced by commit 9ad4d16
8 years ago
Michael Peter Christen 7678fd67e3 copied fix from yacy_grid_parser for wrong array type
8 years ago
reger 9ad4d16829 Add a responsHeader to the solr index export with a format identifier
8 years ago
luccioman 527d494c1a Fixed "Unchecked conversion" compilation warnings.
8 years ago
reger c77e43a391 Take out mailto collect in internal parsed document
8 years ago
reger bec34d3546 Add url input field as source for WarcImporter
8 years ago
luccioman f66438442e Extended Mediawiki dump import to remote URLs.
8 years ago
reger ba339a2a45 Add servlet to import warc file from filesystem IndexImportWarc_p.html.
8 years ago
reger 510f11d374 Implement surrogate import from Warc archives (as first option handle
8 years ago
reger 209a7374bd remove unused import pdfParser
8 years ago
reger de1c1c16db Improve pdf text extraction resource handling.
8 years ago
reger 18c7563dbe Extend DCEntry.getLanguage convert to ISO639-1 codes for more languages
8 years ago
reger f254fcfc67 fix htmlParser <script> text extraction on code containing expression
8 years ago
Michael Peter Christen 02d0b3172c Merge branch 'master' of https://github.com/yacy/yacy_search_server.git
8 years ago
Michael Peter Christen d4f45cf05e added dc.date.modified and dc.date.created to date parser
8 years ago
reger df80c57842 add ukr and pol to DCEntry.getLanguage ISO639-2 3-char language code
8 years ago
luccioman 6a4d51d8f9 Cleaned up some Javadoc warnings.
8 years ago
reger 4c9be29a55 fix concurrency issue with htmlParser using not current scraper data
8 years ago
reger b522d540b9 Include itemprop latitude/longitude (see schema.org) in attribute
8 years ago
reger 083df255e4 fix html tag attribute parsing containing attribute w/o value
8 years ago
reger cb95b7339a include html5 <time> tag in content scraper,
8 years ago
reger c50e23c495 reduce creation of empty legacy RequestHeader() in situation where null
8 years ago
luccioman d27adc2b92 Fixed language detector initialization and NullPointerException cases.
8 years ago
luccioman 3f561c1635 Fixed a NullPointerException case.
8 years ago
luccioman 3092a8ced5 Fixed thread name consistency for improved monitoring.
8 years ago
luccioman eec5779889 Added a name prefix to pooled threads for easier monitoring.
8 years ago
reger 8fe28a83f2 harmonize used lastmodified date for rwi and fulltext in storeDocument
8 years ago
luccioman f0639d810c Customized name for Threads still using the default "Thread-n" pattern.
8 years ago
luccioman 47af33a04c Advanced Crawl from local file : better processing of large files.
8 years ago
luccioman 7717a3d43d Fixed license headers on files created to improve favicon management.
8 years ago
luccioman 6e1959f469 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git
8 years ago
reger b752bcfecb adjust date in text detection to ignore some program version strings
8 years ago
reger b017e97421 optimize condenser language detection a little.
8 years ago
reger ae3717d087 adjust Tokenizer sentence count to ignore repeated punktuation (like !!!! )
8 years ago