Commit Graph

399 Commits (7678fd67e39de38253fd1267f4ac5692f9c096c6)

Author SHA1 Message Date
Michael Peter Christen 7678fd67e3 copied fix from yacy_grid_parser for wrong array type
8 years ago
luccioman 527d494c1a Fixed "Unchecked conversion" compilation warnings.
8 years ago
luccioman f66438442e Extended Mediawiki dump import to remote URLs.
8 years ago
reger 209a7374bd remove unused import pdfParser
8 years ago
reger de1c1c16db Improve pdf text extraction resource handling.
8 years ago
reger f254fcfc67 fix htmlParser <script> text extraction on code containing expression
8 years ago
Michael Peter Christen 02d0b3172c Merge branch 'master' of https://github.com/yacy/yacy_search_server.git
8 years ago
Michael Peter Christen d4f45cf05e added dc.date.modified and dc.date.created to date parser
8 years ago
luccioman 6a4d51d8f9 Cleaned up some Javadoc warnings.
8 years ago
reger 4c9be29a55 fix concurrency issue with htmlParser using not current scraper data
8 years ago
reger b522d540b9 Include itemprop latitude/longitude (see schema.org) in attribute
8 years ago
reger 083df255e4 fix html tag attribute parsing containing attribute w/o value
8 years ago
reger cb95b7339a include html5 <time> tag in content scraper,
8 years ago
reger c50e23c495 reduce creation of empty legacy RequestHeader() in situation where null
8 years ago
luccioman f0639d810c Customized name for Threads still using the default "Thread-n" pattern.
9 years ago
luccioman 47af33a04c Advanced Crawl from local file : better processing of large files.
9 years ago
luccioman 7717a3d43d Fixed license headers on files created to improve favicon management.
9 years ago
luccioman 6e1959f469 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git
9 years ago
reger 14f7577231 add support for older Word versions (Word6/Word95) to docParser
9 years ago
reger efcb6a1e74 fix supported mime XML -> xml for rssParser (mime normalized to lower case for comparison)
9 years ago
luccioman b1b8e69da8 Fixed NullPointerException cases
9 years ago
reger a4465c97d6 as requested, disable/remove old swf parser
9 years ago
Michael Peter Christen 5e165a8150 removed unused imports
9 years ago
reger 4c7a77662a eleminate dependency on file-extension in storeDocument but use supported mime-type
9 years ago
reger ebde21079a refactor xlsParser to include Excel file attribute (like author) in parser result doc.
9 years ago
luccioman 6e96c7341a Merge remote-tracking branch 'origin/master'
9 years ago
reger 9e94989237 upd to PDFBox 2.0.1
9 years ago
reger 24b0fa2a38 extend snapshot Html2Image.pdf2image to use PDFBox image export capability
9 years ago
reger 1d940e5a94 upd commons-compress 1.11
9 years ago
reger 764f5100f0 fix delete of temp file after odt % ooxml parser
9 years ago
reger 06d0e2aeb9 result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode.
9 years ago
luc 3cc5619d93 Improved HTML icons indexing and rendering in search results.
9 years ago
reger 2048b7e057 support scraping start-/enddate from html tag with property "datetime"
9 years ago
reger 900d4584ba complet resource cleanup of lists in contentscraper's close()
9 years ago
reger 1f18653de0 pass parsed swf content trough htmlscraper
9 years ago
reger 18ecf57792 add support of compressed swf to swfParser
9 years ago
reger ff27824964 fix swfParser reading file signature
9 years ago
luc 571bc55937 Refactoring : use StandardCharsets constants instead of hard-coded
9 years ago
reger e84d94f8ca fix mime table for ms office / open office documents
9 years ago
reger 14803d58cd let html scraper accept html5 <link rel="icon"> for favicon links
9 years ago
Michael Peter Christen d82d311995 Merge branch 'master' of https://github.com/luccioman/yacy_search_server
9 years ago
reger e163ea88f6 fix vsdParser (Visio) parser return statement
9 years ago
luc f0478bb14d BMP and ICO image formats support : integrated /haraldk/TwelveMonkeys
10 years ago
reger 0d3c5b223e have psParser cleanup temp file
10 years ago
reger 7d0d19cb8e avoid File.deleteOnExit() on temp files
10 years ago
reger 02e4489a23 set tmpfile.deleteOnExit by default,
10 years ago
reger 20e18d79f8 harmonize document title for archive parsers
10 years ago
reger 112ae013f4 update bzip and bzip parser process,
10 years ago
reger e76a90837b update zip and tar parser process,
10 years ago
reger 5d71fc70e3 fix tarParser early exit on looping content
10 years ago