Commit Graph

617 Commits (fcc29c36f0ec8dfe34f266c239078b7b3b651fe2)

Author SHA1 Message Date
reger 9e94989237 upd to PDFBox 2.0.1 9 years ago
reger 24b0fa2a38 extend snapshot Html2Image.pdf2image to use PDFBox image export capability 9 years ago
reger 1d940e5a94 upd commons-compress 1.11 9 years ago
reger 764f5100f0 fix delete of temp file after odt % ooxml parser 9 years ago
reger 06d0e2aeb9 result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode. 9 years ago
reger 6f0b073bf3 override detected language (statistic langdetect) only with TLD determided 9 years ago
reger b65e2b527d include use of condenser's content text for language detection. 9 years ago
reger 2048b7e057 support scraping start-/enddate from html tag with property "datetime" 9 years ago
reger 900d4584ba complet resource cleanup of lists in contentscraper's close() 9 years ago
reger 1f18653de0 pass parsed swf content trough htmlscraper 9 years ago
reger 18ecf57792 add support of compressed swf to swfParser 9 years ago
reger ff27824964 fix swfParser reading file signature 9 years ago
luc 571bc55937 Refactoring : use StandardCharsets constants instead of hard-coded 9 years ago
reger 46ac0867ff fix poison mediawikiimporter output queue also after ExecutionException 9 years ago
reger a7591d3ed0 fix mediawikiimporter number format exception on coordinate parsing 9 years ago
reger e84d94f8ca fix mime table for ms office / open office documents 9 years ago
reger 45b9bd8403 adjust MultiProtocolURL.protocol detection to handle mailto with "://" in parameters, 9 years ago
reger 0c5548a7ff fix (todo) remove redundant holding of email link nameproperty in parser document 9 years ago
reger 6b7c10cef8 fix dc:date in mediawikiimporter/document.writexml to use lastmodified 9 years ago
reger 14803d58cd let html scraper accept html5 <link rel="icon"> for favicon links 9 years ago
reger 4d2b934487 prevent mailto links getting into parser result document's in/outbound link collection 9 years ago
luc 8ebefa4233 Fixed MediaWiki import : DCEntry conversion to SolrInputDocument was 9 years ago
luc 7736ee5a42 Updated MediaWimporter main() : display usage in console and stop 9 years ago
luc 27d11f8671 Fixed isSolrDump function : PushBackInputStream was not unread when 9 years ago
Michael Peter Christen 135a123a77 less logging in new language detection 9 years ago
Michael Peter Christen d6e9834040 Merge branch 'master' of 9 years ago
Michael Peter Christen d82d311995 Merge branch 'master' of https://github.com/luccioman/yacy_search_server 9 years ago
reger e163ea88f6 fix vsdParser (Visio) parser return statement 9 years ago
luc f0478bb14d BMP and ICO image formats support : integrated /haraldk/TwelveMonkeys 9 years ago
reger 0d3c5b223e have psParser cleanup temp file 9 years ago
reger 7d0d19cb8e avoid File.deleteOnExit() on temp files 9 years ago
reger 02e4489a23 set tmpfile.deleteOnExit by default, 9 years ago
reger 52a9040ae6 Sort out double keywords (dc_subject) early in parsed documents 9 years ago
reger 20e18d79f8 harmonize document title for archive parsers 9 years ago
reger 112ae013f4 update bzip and bzip parser process, 9 years ago
reger e76a90837b update zip and tar parser process, 9 years ago
reger 8532565c7d optimize order of parsers to try 9 years ago
reger 5d71fc70e3 fix tarParser early exit on looping content 9 years ago
reger 2fcf6f104c fix bzipParser recognition 9 years ago
reger bbe9df2bb3 fix MediawikiImporter for bz2 dump 10 years ago
reger c6687dd560 fix a system.out to log.fine 10 years ago
Michael Peter Christen ac034db8bc Merge branch 'master' of https://github.com/luccioman/yacy_search_server 10 years ago
luc 5902ce032e Corrected NullPointerException case when ImageIO reader is not found for 10 years ago
reger c6495a5b62 add a log entry on parsing ajax crawling scheme snapshot 10 years ago
reger 9252e36aeb implement ajax crawling scheme for ajax sites which adhere to the proposed use of hash-bangs to provide html content 10 years ago
Michael Peter Christen 7d075a1d76 added log lines 10 years ago
luc d6522fa4a2 Integrated haraldk/TwelveMonkeys library to first add TIF image format 10 years ago
reger 78e8c6f3e5 refactor special handling (static override) of SUPPORTED_EXTENSIONS/MIME_TYPES 10 years ago
reger d54c5d310a add links with image extension not automatically to image links. 10 years ago
reger 851e8f6c8a check jpeg file signature in genericImageParser 10 years ago