Commit Graph

619 Commits (ebde21079ade3a0a897bc82b44a5462ff43d4dd9)

Author SHA1 Message Date
reger ebde21079a refactor xlsParser to include Excel file attribute (like author) in parser result doc.
9 years ago
reger 27163af0e1 improve detection of referenced links by taking http and https link protocol
9 years ago
reger 9e94989237 upd to PDFBox 2.0.1
9 years ago
reger 24b0fa2a38 extend snapshot Html2Image.pdf2image to use PDFBox image export capability
9 years ago
reger 1d940e5a94 upd commons-compress 1.11
9 years ago
reger 764f5100f0 fix delete of temp file after odt % ooxml parser
9 years ago
reger 06d0e2aeb9 result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode.
9 years ago
reger 6f0b073bf3 override detected language (statistic langdetect) only with TLD determided
9 years ago
reger b65e2b527d include use of condenser's content text for language detection.
9 years ago
reger 2048b7e057 support scraping start-/enddate from html tag with property "datetime"
9 years ago
reger 900d4584ba complet resource cleanup of lists in contentscraper's close()
9 years ago
reger 1f18653de0 pass parsed swf content trough htmlscraper
9 years ago
reger 18ecf57792 add support of compressed swf to swfParser
9 years ago
reger ff27824964 fix swfParser reading file signature
9 years ago
luc 571bc55937 Refactoring : use StandardCharsets constants instead of hard-coded
9 years ago
reger 46ac0867ff fix poison mediawikiimporter output queue also after ExecutionException
9 years ago
reger a7591d3ed0 fix mediawikiimporter number format exception on coordinate parsing
9 years ago
reger e84d94f8ca fix mime table for ms office / open office documents
9 years ago
reger 45b9bd8403 adjust MultiProtocolURL.protocol detection to handle mailto with "://" in parameters,
9 years ago
reger 0c5548a7ff fix (todo) remove redundant holding of email link nameproperty in parser document
9 years ago
reger 6b7c10cef8 fix dc:date in mediawikiimporter/document.writexml to use lastmodified
9 years ago
reger 14803d58cd let html scraper accept html5 <link rel="icon"> for favicon links
9 years ago
reger 4d2b934487 prevent mailto links getting into parser result document's in/outbound link collection
9 years ago
luc 8ebefa4233 Fixed MediaWiki import : DCEntry conversion to SolrInputDocument was
9 years ago
luc 7736ee5a42 Updated MediaWimporter main() : display usage in console and stop
9 years ago
luc 27d11f8671 Fixed isSolrDump function : PushBackInputStream was not unread when
9 years ago
Michael Peter Christen 135a123a77 less logging in new language detection
9 years ago
Michael Peter Christen d6e9834040 Merge branch 'master' of
9 years ago
Michael Peter Christen d82d311995 Merge branch 'master' of https://github.com/luccioman/yacy_search_server
9 years ago
reger e163ea88f6 fix vsdParser (Visio) parser return statement
9 years ago
luc f0478bb14d BMP and ICO image formats support : integrated /haraldk/TwelveMonkeys
9 years ago
reger 0d3c5b223e have psParser cleanup temp file
9 years ago
reger 7d0d19cb8e avoid File.deleteOnExit() on temp files
9 years ago
reger 02e4489a23 set tmpfile.deleteOnExit by default,
9 years ago
reger 52a9040ae6 Sort out double keywords (dc_subject) early in parsed documents
9 years ago
reger 20e18d79f8 harmonize document title for archive parsers
9 years ago
reger 112ae013f4 update bzip and bzip parser process,
9 years ago
reger e76a90837b update zip and tar parser process,
9 years ago
reger 8532565c7d optimize order of parsers to try
9 years ago
reger 5d71fc70e3 fix tarParser early exit on looping content
9 years ago
reger 2fcf6f104c fix bzipParser recognition
9 years ago
reger bbe9df2bb3 fix MediawikiImporter for bz2 dump
9 years ago
reger c6687dd560 fix a system.out to log.fine
9 years ago
Michael Peter Christen ac034db8bc Merge branch 'master' of https://github.com/luccioman/yacy_search_server
9 years ago
luc 5902ce032e Corrected NullPointerException case when ImageIO reader is not found for
9 years ago
reger c6495a5b62 add a log entry on parsing ajax crawling scheme snapshot
9 years ago
reger 9252e36aeb implement ajax crawling scheme for ajax sites which adhere to the proposed use of hash-bangs to provide html content
9 years ago
Michael Peter Christen 7d075a1d76 added log lines
9 years ago
luc d6522fa4a2 Integrated haraldk/TwelveMonkeys library to first add TIF image format
9 years ago
reger 78e8c6f3e5 refactor special handling (static override) of SUPPORTED_EXTENSIONS/MIME_TYPES
9 years ago