Commit Graph

625 Commits (a4465c97d6b222253f34bc2b4e4d4ad276f645a3)

Author SHA1 Message Date
reger a4465c97d6 as requested, disable/remove old swf parser
9 years ago
reger 96467c5467 remove not needed counter in Tokeninzer (completing last changes)
9 years ago
reger 272cdd496a reactivate sentence counter in WordTokenizer for phrasepos ranking,
9 years ago
Michael Peter Christen 5e165a8150 removed unused imports
9 years ago
reger e310ec5f70 fix posInText ranking calculation to score 0 on no position info
9 years ago
reger 4c7a77662a eleminate dependency on file-extension in storeDocument but use supported mime-type
9 years ago
reger ebde21079a refactor xlsParser to include Excel file attribute (like author) in parser result doc.
9 years ago
reger 27163af0e1 improve detection of referenced links by taking http and https link protocol
9 years ago
reger 9e94989237 upd to PDFBox 2.0.1
9 years ago
reger 24b0fa2a38 extend snapshot Html2Image.pdf2image to use PDFBox image export capability
9 years ago
reger 1d940e5a94 upd commons-compress 1.11
9 years ago
reger 764f5100f0 fix delete of temp file after odt % ooxml parser
9 years ago
reger 06d0e2aeb9 result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode.
9 years ago
reger 6f0b073bf3 override detected language (statistic langdetect) only with TLD determided
9 years ago
reger b65e2b527d include use of condenser's content text for language detection.
9 years ago
reger 2048b7e057 support scraping start-/enddate from html tag with property "datetime"
9 years ago
reger 900d4584ba complet resource cleanup of lists in contentscraper's close()
9 years ago
reger 1f18653de0 pass parsed swf content trough htmlscraper
9 years ago
reger 18ecf57792 add support of compressed swf to swfParser
9 years ago
reger ff27824964 fix swfParser reading file signature
9 years ago
luc 571bc55937 Refactoring : use StandardCharsets constants instead of hard-coded
9 years ago
reger 46ac0867ff fix poison mediawikiimporter output queue also after ExecutionException
9 years ago
reger a7591d3ed0 fix mediawikiimporter number format exception on coordinate parsing
9 years ago
reger e84d94f8ca fix mime table for ms office / open office documents
9 years ago
reger 45b9bd8403 adjust MultiProtocolURL.protocol detection to handle mailto with "://" in parameters,
9 years ago
reger 0c5548a7ff fix (todo) remove redundant holding of email link nameproperty in parser document
9 years ago
reger 6b7c10cef8 fix dc:date in mediawikiimporter/document.writexml to use lastmodified
9 years ago
reger 14803d58cd let html scraper accept html5 <link rel="icon"> for favicon links
9 years ago
reger 4d2b934487 prevent mailto links getting into parser result document's in/outbound link collection
9 years ago
luc 8ebefa4233 Fixed MediaWiki import : DCEntry conversion to SolrInputDocument was
9 years ago
luc 7736ee5a42 Updated MediaWimporter main() : display usage in console and stop
9 years ago
luc 27d11f8671 Fixed isSolrDump function : PushBackInputStream was not unread when
9 years ago
Michael Peter Christen 135a123a77 less logging in new language detection
9 years ago
Michael Peter Christen d6e9834040 Merge branch 'master' of
10 years ago
Michael Peter Christen d82d311995 Merge branch 'master' of https://github.com/luccioman/yacy_search_server
10 years ago
reger e163ea88f6 fix vsdParser (Visio) parser return statement
10 years ago
luc f0478bb14d BMP and ICO image formats support : integrated /haraldk/TwelveMonkeys
10 years ago
reger 0d3c5b223e have psParser cleanup temp file
10 years ago
reger 7d0d19cb8e avoid File.deleteOnExit() on temp files
10 years ago
reger 02e4489a23 set tmpfile.deleteOnExit by default,
10 years ago
reger 52a9040ae6 Sort out double keywords (dc_subject) early in parsed documents
10 years ago
reger 20e18d79f8 harmonize document title for archive parsers
10 years ago
reger 112ae013f4 update bzip and bzip parser process,
10 years ago
reger e76a90837b update zip and tar parser process,
10 years ago
reger 8532565c7d optimize order of parsers to try
10 years ago
reger 5d71fc70e3 fix tarParser early exit on looping content
10 years ago
reger 2fcf6f104c fix bzipParser recognition
10 years ago
reger bbe9df2bb3 fix MediawikiImporter for bz2 dump
10 years ago
reger c6687dd560 fix a system.out to log.fine
10 years ago
Michael Peter Christen ac034db8bc Merge branch 'master' of https://github.com/luccioman/yacy_search_server
10 years ago