Commit Graph

780 Commits (6d5e9ff53f4090e24a0cbe601df5665fb10b6ddf)

Author SHA1 Message Date
luccioman 319231a458 Added a generic XML parser, able to parse elements text and URLs.
7 years ago
luccioman d2a4a27f52 Improved stream-oriented parsing entering conditions.
8 years ago
luccioman ce89492319 Ensure system resource release by closing document stream.
8 years ago
luccioman 8399275142 Properly close file output streams even on exceptions scenarios.
8 years ago
luccioman a04feac064 Ensure file input streams proper closing in both success and failures
8 years ago
luccioman d98c04853d Ensure proper closing of file input streams.
8 years ago
luccioman 306a82dd71 Fixed scraper NullPointerException cases on malformed URLs.
8 years ago
reger 1737af37cf Set request originator to own peer in warc importer
8 years ago
reger 039162fbf0 Change warc importer to use defaultsurrogate-crawl profile, as reported
8 years ago
reger 077d062be3 Adjust mergeDocuments to keep youngest last-modified date of document
8 years ago
luccioman 654801523e Fixed StringIndexOutOfBoundsException case.
8 years ago
luccioman edd7ccac40 Added some JavaDoc
8 years ago
luccioman 79fdf14b0a Fixed regression introduced by commit 9ad4d16
8 years ago
Michael Peter Christen 7678fd67e3 copied fix from yacy_grid_parser for wrong array type
8 years ago
reger 9ad4d16829 Add a responsHeader to the solr index export with a format identifier
8 years ago
luccioman 527d494c1a Fixed "Unchecked conversion" compilation warnings.
8 years ago
reger c77e43a391 Take out mailto collect in internal parsed document
8 years ago
reger bec34d3546 Add url input field as source for WarcImporter
8 years ago
luccioman f66438442e Extended Mediawiki dump import to remote URLs.
8 years ago
reger ba339a2a45 Add servlet to import warc file from filesystem IndexImportWarc_p.html.
8 years ago
reger 510f11d374 Implement surrogate import from Warc archives (as first option handle
8 years ago
reger 209a7374bd remove unused import pdfParser
8 years ago
reger de1c1c16db Improve pdf text extraction resource handling.
8 years ago
reger 18c7563dbe Extend DCEntry.getLanguage convert to ISO639-1 codes for more languages
8 years ago
reger f254fcfc67 fix htmlParser <script> text extraction on code containing expression
8 years ago
Michael Peter Christen 02d0b3172c Merge branch 'master' of https://github.com/yacy/yacy_search_server.git
8 years ago
Michael Peter Christen d4f45cf05e added dc.date.modified and dc.date.created to date parser
8 years ago
reger df80c57842 add ukr and pol to DCEntry.getLanguage ISO639-2 3-char language code
8 years ago
luccioman 6a4d51d8f9 Cleaned up some Javadoc warnings.
8 years ago
reger 4c9be29a55 fix concurrency issue with htmlParser using not current scraper data
8 years ago
reger b522d540b9 Include itemprop latitude/longitude (see schema.org) in attribute
8 years ago
reger 083df255e4 fix html tag attribute parsing containing attribute w/o value
8 years ago
reger cb95b7339a include html5 <time> tag in content scraper,
8 years ago
reger c50e23c495 reduce creation of empty legacy RequestHeader() in situation where null
8 years ago
luccioman d27adc2b92 Fixed language detector initialization and NullPointerException cases.
8 years ago
luccioman 3f561c1635 Fixed a NullPointerException case.
8 years ago
luccioman 3092a8ced5 Fixed thread name consistency for improved monitoring.
8 years ago
luccioman eec5779889 Added a name prefix to pooled threads for easier monitoring.
8 years ago
reger 8fe28a83f2 harmonize used lastmodified date for rwi and fulltext in storeDocument
8 years ago
luccioman f0639d810c Customized name for Threads still using the default "Thread-n" pattern.
8 years ago
luccioman 47af33a04c Advanced Crawl from local file : better processing of large files.
8 years ago
luccioman 7717a3d43d Fixed license headers on files created to improve favicon management.
8 years ago
luccioman 6e1959f469 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git
8 years ago
reger b752bcfecb adjust date in text detection to ignore some program version strings
8 years ago
reger b017e97421 optimize condenser language detection a little.
8 years ago
reger ae3717d087 adjust Tokenizer sentence count to ignore repeated punktuation (like !!!! )
8 years ago
reger 474f0476c6 adjust Tokenizer sentence count on trailing text after last recognized sentence
8 years ago
reger 14f7577231 add support for older Word versions (Word6/Word95) to docParser
8 years ago
reger 1a79c64495 generalize DateDetection with holiday date rules readily available in icu
8 years ago
reger 6f68f08354 correct DateDetection Silvester date
8 years ago
reger efcb6a1e74 fix supported mime XML -> xml for rssParser (mime normalized to lower case for comparison)
8 years ago
luccioman b1b8e69da8 Fixed NullPointerException cases
8 years ago
reger a4465c97d6 as requested, disable/remove old swf parser
8 years ago
reger 96467c5467 remove not needed counter in Tokeninzer (completing last changes)
8 years ago
reger 272cdd496a reactivate sentence counter in WordTokenizer for phrasepos ranking,
8 years ago
Michael Peter Christen 5e165a8150 removed unused imports
8 years ago
reger e310ec5f70 fix posInText ranking calculation to score 0 on no position info
8 years ago
reger 4c7a77662a eleminate dependency on file-extension in storeDocument but use supported mime-type
8 years ago
reger ebde21079a refactor xlsParser to include Excel file attribute (like author) in parser result doc.
8 years ago
reger 27163af0e1 improve detection of referenced links by taking http and https link protocol
8 years ago
luccioman 6e96c7341a Merge remote-tracking branch 'origin/master'
8 years ago
reger 9e94989237 upd to PDFBox 2.0.1
9 years ago
reger 24b0fa2a38 extend snapshot Html2Image.pdf2image to use PDFBox image export capability
9 years ago
reger 1d940e5a94 upd commons-compress 1.11
9 years ago
reger 764f5100f0 fix delete of temp file after odt % ooxml parser
9 years ago
reger 06d0e2aeb9 result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode.
9 years ago
luc 9f712146df Display icons in ViewFile "links" mode.
9 years ago
reger 6f0b073bf3 override detected language (statistic langdetect) only with TLD determided
9 years ago
reger b65e2b527d include use of condenser's content text for language detection.
9 years ago
luc 3cc5619d93 Improved HTML icons indexing and rendering in search results.
9 years ago
reger 2048b7e057 support scraping start-/enddate from html tag with property "datetime"
9 years ago
reger 900d4584ba complet resource cleanup of lists in contentscraper's close()
9 years ago
reger 1f18653de0 pass parsed swf content trough htmlscraper
9 years ago
reger 18ecf57792 add support of compressed swf to swfParser
9 years ago
reger ff27824964 fix swfParser reading file signature
9 years ago
luc 571bc55937 Refactoring : use StandardCharsets constants instead of hard-coded
9 years ago
reger 46ac0867ff fix poison mediawikiimporter output queue also after ExecutionException
9 years ago
reger a7591d3ed0 fix mediawikiimporter number format exception on coordinate parsing
9 years ago
reger e84d94f8ca fix mime table for ms office / open office documents
9 years ago
reger 45b9bd8403 adjust MultiProtocolURL.protocol detection to handle mailto with "://" in parameters,
9 years ago
reger 0c5548a7ff fix (todo) remove redundant holding of email link nameproperty in parser document
9 years ago
reger 6b7c10cef8 fix dc:date in mediawikiimporter/document.writexml to use lastmodified
9 years ago
reger 14803d58cd let html scraper accept html5 <link rel="icon"> for favicon links
9 years ago
reger 4d2b934487 prevent mailto links getting into parser result document's in/outbound link collection
9 years ago
luc 8ebefa4233 Fixed MediaWiki import : DCEntry conversion to SolrInputDocument was
9 years ago
luc 7736ee5a42 Updated MediaWimporter main() : display usage in console and stop
9 years ago
luc 27d11f8671 Fixed isSolrDump function : PushBackInputStream was not unread when
9 years ago
Michael Peter Christen 135a123a77 less logging in new language detection
9 years ago
Michael Peter Christen d6e9834040 Merge branch 'master' of
9 years ago
Michael Peter Christen d82d311995 Merge branch 'master' of https://github.com/luccioman/yacy_search_server
9 years ago
reger e163ea88f6 fix vsdParser (Visio) parser return statement
9 years ago
luc f0478bb14d BMP and ICO image formats support : integrated /haraldk/TwelveMonkeys
9 years ago
reger 0d3c5b223e have psParser cleanup temp file
9 years ago
reger 7d0d19cb8e avoid File.deleteOnExit() on temp files
9 years ago
reger 02e4489a23 set tmpfile.deleteOnExit by default,
9 years ago
reger 52a9040ae6 Sort out double keywords (dc_subject) early in parsed documents
9 years ago
reger 20e18d79f8 harmonize document title for archive parsers
9 years ago
reger 112ae013f4 update bzip and bzip parser process,
9 years ago
reger e76a90837b update zip and tar parser process,
9 years ago
reger 8532565c7d optimize order of parsers to try
9 years ago
reger 5d71fc70e3 fix tarParser early exit on looping content
9 years ago
reger 2fcf6f104c fix bzipParser recognition
9 years ago
reger bbe9df2bb3 fix MediawikiImporter for bz2 dump
9 years ago
reger c6687dd560 fix a system.out to log.fine
9 years ago
Michael Peter Christen ac034db8bc Merge branch 'master' of https://github.com/luccioman/yacy_search_server
9 years ago
luc 5902ce032e Corrected NullPointerException case when ImageIO reader is not found for
9 years ago
reger c6495a5b62 add a log entry on parsing ajax crawling scheme snapshot
9 years ago
reger 9252e36aeb implement ajax crawling scheme for ajax sites which adhere to the proposed use of hash-bangs to provide html content
9 years ago
Michael Peter Christen 7d075a1d76 added log lines
9 years ago
luc d6522fa4a2 Integrated haraldk/TwelveMonkeys library to first add TIF image format
9 years ago
reger 78e8c6f3e5 refactor special handling (static override) of SUPPORTED_EXTENSIONS/MIME_TYPES
9 years ago
reger d54c5d310a add links with image extension not automatically to image links.
9 years ago
reger 851e8f6c8a check jpeg file signature in genericImageParser
9 years ago
reger d5330391de remove some unused var allocation in parser
9 years ago
reger 7c82cd4415 add a end condition to svgParser for wrong content
9 years ago
reger 356d4d1301 remove rdfParser from init (current function identical with genericParser)
9 years ago
reger c647d899e3 add svgParser to parse metadate from svg images
9 years ago
reger bad34804fe optimize parseInt for <img> tag attribute parsing
9 years ago
reger 2f51baff4f check for loading error (includs unsupported formats)
9 years ago
reger a3195d78ae add Portuguese month names to date recognition
9 years ago
reger d2cc11ea8f fix html parser taking <style> content as text.
9 years ago
reger 1e8369e18b use a parsed date in Document.toString
9 years ago
reger 41c4eade51 extract modification date from vCard (vcfParser)
9 years ago
reger 8768896975 extract lastmodified from openoffice doc
9 years ago
sixcooler a3dd4be749 added / corrected charste to be 1.7 compatible.
9 years ago
Michael Peter Christen df3314ac1a added a new facet type based on a probabilistic classifier using
9 years ago
Michael Peter Christen 7b412e8c07 added msg (text emails) format; should be handled by html parser.
9 years ago
Ryszard Goń 59096935d0 Use language-detection library for increased accuracy
10 years ago
Michael Peter Christen 90f75c8c3d added enrichment of synonyms and vocabularies for imported documents
10 years ago
Michael Peter Christen 7829480b82 refactoring: separated condenser and tokenizer
10 years ago
Michael Peter Christen 593de05922 enhanced surrogate import process speed (dramatically!)
10 years ago
reger 7478338a40 remove augmented parsing activation from frontend
10 years ago
reger 11aa2edfe1 remove RDFa parser activation from frontend
10 years ago
Michael Peter Christen d0aff91f23 fix for index import
10 years ago
Michael Peter Christen b43811d38c added surrogate import process for exported solr dumps.
10 years ago
reger 8a9622c31c fix string OoB on getImagelinks with long alttext
10 years ago
Michael Peter Christen ff29b0e503 added option to re-index exported xml snapshot dumps to
10 years ago
Michael Peter Christen 6f4fe4b175 revert of 8a7c68e4c7
10 years ago
Michael Peter Christen fed26f33a8 enhanced timezone managament for indexed data:
10 years ago
Michael Peter Christen b060ba900d added parsing of contentprop attribute in html tags for
10 years ago
Michael Peter Christen 4cb4f67f38 added parsing of dd, dt and article html fields. The parsed result is
10 years ago
Michael Peter Christen 4d00175157 <experimental> added parsing of <article> html element.
10 years ago
reger 2e8c24e02a fix link to DeReWo download file
10 years ago
Michael Peter Christen 893889bc7b added special terms for on: - Date modifier: tomorrow, today; i.e.:
10 years ago
Michael Peter Christen 535f1ebe3b added a new way of content browsing in search results:
10 years ago
reger 2d2299f484 fix mimetype of rss items in rss parser
10 years ago
Michael Peter Christen b432049d59 enhanced date parsing time
10 years ago
reger a0f04db9ea add extracted description/subject to pptParser
10 years ago
reger 7e35518787 add extracted description/subject to docParser
10 years ago
Michael Peter Christen 1f5b5c0111 npe fix for latest scraper feature
10 years ago