Commit Graph

390 Commits (86dc198698bcb11b366f6d7e7191ef2ef79de455)

Author SHA1 Message Date
reger 4c9be29a55 fix concurrency issue with htmlParser using not current scraper data
8 years ago
reger b522d540b9 Include itemprop latitude/longitude (see schema.org) in attribute
8 years ago
reger 083df255e4 fix html tag attribute parsing containing attribute w/o value
8 years ago
reger cb95b7339a include html5 <time> tag in content scraper,
8 years ago
reger c50e23c495 reduce creation of empty legacy RequestHeader() in situation where null
8 years ago
luccioman f0639d810c Customized name for Threads still using the default "Thread-n" pattern.
8 years ago
luccioman 47af33a04c Advanced Crawl from local file : better processing of large files.
8 years ago
luccioman 7717a3d43d Fixed license headers on files created to improve favicon management.
8 years ago
luccioman 6e1959f469 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git
8 years ago
reger 14f7577231 add support for older Word versions (Word6/Word95) to docParser
8 years ago
reger efcb6a1e74 fix supported mime XML -> xml for rssParser (mime normalized to lower case for comparison)
8 years ago
luccioman b1b8e69da8 Fixed NullPointerException cases
8 years ago
reger a4465c97d6 as requested, disable/remove old swf parser
8 years ago
Michael Peter Christen 5e165a8150 removed unused imports
8 years ago
reger 4c7a77662a eleminate dependency on file-extension in storeDocument but use supported mime-type
8 years ago
reger ebde21079a refactor xlsParser to include Excel file attribute (like author) in parser result doc.
8 years ago
luccioman 6e96c7341a Merge remote-tracking branch 'origin/master'
8 years ago
reger 9e94989237 upd to PDFBox 2.0.1
9 years ago
reger 24b0fa2a38 extend snapshot Html2Image.pdf2image to use PDFBox image export capability
9 years ago
reger 1d940e5a94 upd commons-compress 1.11
9 years ago
reger 764f5100f0 fix delete of temp file after odt % ooxml parser
9 years ago
reger 06d0e2aeb9 result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode.
9 years ago
luc 3cc5619d93 Improved HTML icons indexing and rendering in search results.
9 years ago
reger 2048b7e057 support scraping start-/enddate from html tag with property "datetime"
9 years ago
reger 900d4584ba complet resource cleanup of lists in contentscraper's close()
9 years ago
reger 1f18653de0 pass parsed swf content trough htmlscraper
9 years ago
reger 18ecf57792 add support of compressed swf to swfParser
9 years ago
reger ff27824964 fix swfParser reading file signature
9 years ago
luc 571bc55937 Refactoring : use StandardCharsets constants instead of hard-coded
9 years ago
reger e84d94f8ca fix mime table for ms office / open office documents
9 years ago
reger 14803d58cd let html scraper accept html5 <link rel="icon"> for favicon links
9 years ago
Michael Peter Christen d82d311995 Merge branch 'master' of https://github.com/luccioman/yacy_search_server
9 years ago
reger e163ea88f6 fix vsdParser (Visio) parser return statement
9 years ago
luc f0478bb14d BMP and ICO image formats support : integrated /haraldk/TwelveMonkeys
9 years ago
reger 0d3c5b223e have psParser cleanup temp file
9 years ago
reger 7d0d19cb8e avoid File.deleteOnExit() on temp files
9 years ago
reger 02e4489a23 set tmpfile.deleteOnExit by default,
9 years ago
reger 20e18d79f8 harmonize document title for archive parsers
9 years ago
reger 112ae013f4 update bzip and bzip parser process,
9 years ago
reger e76a90837b update zip and tar parser process,
9 years ago
reger 5d71fc70e3 fix tarParser early exit on looping content
9 years ago
reger 2fcf6f104c fix bzipParser recognition
9 years ago
reger c6687dd560 fix a system.out to log.fine
9 years ago
reger c6495a5b62 add a log entry on parsing ajax crawling scheme snapshot
9 years ago
reger 9252e36aeb implement ajax crawling scheme for ajax sites which adhere to the proposed use of hash-bangs to provide html content
9 years ago
reger 78e8c6f3e5 refactor special handling (static override) of SUPPORTED_EXTENSIONS/MIME_TYPES
9 years ago
reger d54c5d310a add links with image extension not automatically to image links.
9 years ago
reger 851e8f6c8a check jpeg file signature in genericImageParser
9 years ago
reger d5330391de remove some unused var allocation in parser
9 years ago
reger 7c82cd4415 add a end condition to svgParser for wrong content
9 years ago