Commit Graph

561 Commits (23f6294a2dcb3ab4601ce0260e1a16e17e7d7b22)

Author SHA1 Message Date
reger 2f51baff4f check for loading error (includs unsupported formats)
9 years ago
reger a3195d78ae add Portuguese month names to date recognition
10 years ago
reger d2cc11ea8f fix html parser taking <style> content as text.
10 years ago
reger 1e8369e18b use a parsed date in Document.toString
10 years ago
reger 41c4eade51 extract modification date from vCard (vcfParser)
10 years ago
reger 8768896975 extract lastmodified from openoffice doc
10 years ago
sixcooler a3dd4be749 added / corrected charste to be 1.7 compatible.
10 years ago
Michael Peter Christen df3314ac1a added a new facet type based on a probabilistic classifier using
10 years ago
Michael Peter Christen 7b412e8c07 added msg (text emails) format; should be handled by html parser.
10 years ago
Michael Peter Christen 90f75c8c3d added enrichment of synonyms and vocabularies for imported documents
10 years ago
Michael Peter Christen 7829480b82 refactoring: separated condenser and tokenizer
10 years ago
Michael Peter Christen 593de05922 enhanced surrogate import process speed (dramatically!)
10 years ago
reger 7478338a40 remove augmented parsing activation from frontend
10 years ago
reger 11aa2edfe1 remove RDFa parser activation from frontend
10 years ago
Michael Peter Christen d0aff91f23 fix for index import
10 years ago
Michael Peter Christen b43811d38c added surrogate import process for exported solr dumps.
10 years ago
reger 8a9622c31c fix string OoB on getImagelinks with long alttext
10 years ago
Michael Peter Christen ff29b0e503 added option to re-index exported xml snapshot dumps to
10 years ago
Michael Peter Christen 6f4fe4b175 revert of 8a7c68e4c7
10 years ago
Michael Peter Christen fed26f33a8 enhanced timezone managament for indexed data:
10 years ago
Michael Peter Christen b060ba900d added parsing of contentprop attribute in html tags for
10 years ago
Michael Peter Christen 4cb4f67f38 added parsing of dd, dt and article html fields. The parsed result is
10 years ago
Michael Peter Christen 4d00175157 <experimental> added parsing of <article> html element.
10 years ago
reger 2e8c24e02a fix link to DeReWo download file
10 years ago
Michael Peter Christen 893889bc7b added special terms for on: - Date modifier: tomorrow, today; i.e.:
10 years ago
Michael Peter Christen 535f1ebe3b added a new way of content browsing in search results:
10 years ago
reger 2d2299f484 fix mimetype of rss items in rss parser
10 years ago
Michael Peter Christen b432049d59 enhanced date parsing time
10 years ago
reger a0f04db9ea add extracted description/subject to pptParser
10 years ago
reger 7e35518787 add extracted description/subject to docParser
10 years ago
Michael Peter Christen 1f5b5c0111 npe fix for latest scraper feature
10 years ago
Michael Peter Christen ee97302a23 hack to make date detection faster (while it becomes a bit incomplete
10 years ago
Michael Peter Christen b5ac29c9a5 added a html field scraper which reads text from html entities of a
10 years ago
Michael Peter Christen de3e373913 using precompiled CommonPattern.TAB for split
10 years ago
Michael Peter Christen 1f5047b15f using precompiled pattern CommonPattern.SEMICOLON for splits
10 years ago
Michael Peter Christen 69eacdf4eb applying precompiled CommonPattern.COMMA.split to all places where
10 years ago
reger 5ca0762179 fix: eom on parsing ico file by genericImageParser
10 years ago
Michael Peter Christen 4144c7cc52 do not write frame links to webgraph
10 years ago
reger 3ac1d14a21 improve TexParser.mimeOf( fileextension ) by returning 1st defined in supported list.
10 years ago
Michael Peter Christen d2792a43fd do not write iframe and embed links into webgraph, but use them anyway
10 years ago
Michael Peter Christen 6ad43c4a8b removed debug code
10 years ago
Michael Peter Christen 9e588944fa prevent NPE during initialization of very large vocabularies
10 years ago
Michael Peter Christen 8c3e5b7b6d added experimental pdf splitting which enables YaCy to split pdfs during
10 years ago
Michael Peter Christen 65125439fe added query modifier 'on'. This makes it possible to search for date
10 years ago
Michael Peter Christen 8b5d074715 fix for image parser (there is a class missing!)
10 years ago
reger 9edc7308aa update to metadata-extractor-2.7.0.jar
10 years ago
Michael Peter Christen bbf0ac40c3 add the actual DateDetection class... (missed in latest commit)
10 years ago
Michael Peter Christen 66b5a56976 Added and integrated new date detection class which can identify date
10 years ago
Michael Peter Christen 6a1865f507 refactoring date -> lastModified
10 years ago
Michael Peter Christen 8df8ffbb6d enhanced the snapshot functionality:
10 years ago