Commit Graph

588 Commits (f0478bb14d8e5aa8e82b5c3f83e99eb0bb3d4676)

Author SHA1 Message Date
luc f0478bb14d BMP and ICO image formats support : integrated /haraldk/TwelveMonkeys
9 years ago
reger 0d3c5b223e have psParser cleanup temp file
9 years ago
reger 7d0d19cb8e avoid File.deleteOnExit() on temp files
9 years ago
reger 02e4489a23 set tmpfile.deleteOnExit by default,
9 years ago
reger 52a9040ae6 Sort out double keywords (dc_subject) early in parsed documents
9 years ago
reger 20e18d79f8 harmonize document title for archive parsers
9 years ago
reger 112ae013f4 update bzip and bzip parser process,
9 years ago
reger e76a90837b update zip and tar parser process,
9 years ago
reger 8532565c7d optimize order of parsers to try
9 years ago
reger 5d71fc70e3 fix tarParser early exit on looping content
9 years ago
reger 2fcf6f104c fix bzipParser recognition
9 years ago
reger bbe9df2bb3 fix MediawikiImporter for bz2 dump
9 years ago
reger c6687dd560 fix a system.out to log.fine
9 years ago
Michael Peter Christen ac034db8bc Merge branch 'master' of https://github.com/luccioman/yacy_search_server
9 years ago
luc 5902ce032e Corrected NullPointerException case when ImageIO reader is not found for
9 years ago
reger c6495a5b62 add a log entry on parsing ajax crawling scheme snapshot
9 years ago
reger 9252e36aeb implement ajax crawling scheme for ajax sites which adhere to the proposed use of hash-bangs to provide html content
9 years ago
Michael Peter Christen 7d075a1d76 added log lines
9 years ago
luc d6522fa4a2 Integrated haraldk/TwelveMonkeys library to first add TIF image format
9 years ago
reger 78e8c6f3e5 refactor special handling (static override) of SUPPORTED_EXTENSIONS/MIME_TYPES
9 years ago
reger d54c5d310a add links with image extension not automatically to image links.
9 years ago
reger 851e8f6c8a check jpeg file signature in genericImageParser
9 years ago
reger d5330391de remove some unused var allocation in parser
9 years ago
reger 7c82cd4415 add a end condition to svgParser for wrong content
9 years ago
reger 356d4d1301 remove rdfParser from init (current function identical with genericParser)
9 years ago
reger c647d899e3 add svgParser to parse metadate from svg images
9 years ago
reger bad34804fe optimize parseInt for <img> tag attribute parsing
9 years ago
reger 2f51baff4f check for loading error (includs unsupported formats)
10 years ago
reger a3195d78ae add Portuguese month names to date recognition
10 years ago
reger d2cc11ea8f fix html parser taking <style> content as text.
10 years ago
reger 1e8369e18b use a parsed date in Document.toString
10 years ago
reger 41c4eade51 extract modification date from vCard (vcfParser)
10 years ago
reger 8768896975 extract lastmodified from openoffice doc
10 years ago
sixcooler a3dd4be749 added / corrected charste to be 1.7 compatible.
10 years ago
Michael Peter Christen df3314ac1a added a new facet type based on a probabilistic classifier using
10 years ago
Michael Peter Christen 7b412e8c07 added msg (text emails) format; should be handled by html parser.
10 years ago
Michael Peter Christen 90f75c8c3d added enrichment of synonyms and vocabularies for imported documents
10 years ago
Michael Peter Christen 7829480b82 refactoring: separated condenser and tokenizer
10 years ago
Michael Peter Christen 593de05922 enhanced surrogate import process speed (dramatically!)
10 years ago
reger 7478338a40 remove augmented parsing activation from frontend
10 years ago
reger 11aa2edfe1 remove RDFa parser activation from frontend
10 years ago
Michael Peter Christen d0aff91f23 fix for index import
10 years ago
Michael Peter Christen b43811d38c added surrogate import process for exported solr dumps.
10 years ago
reger 8a9622c31c fix string OoB on getImagelinks with long alttext
10 years ago
Michael Peter Christen ff29b0e503 added option to re-index exported xml snapshot dumps to
10 years ago
Michael Peter Christen 6f4fe4b175 revert of 8a7c68e4c7
10 years ago
Michael Peter Christen fed26f33a8 enhanced timezone managament for indexed data:
10 years ago
Michael Peter Christen b060ba900d added parsing of contentprop attribute in html tags for
10 years ago
Michael Peter Christen 4cb4f67f38 added parsing of dd, dt and article html fields. The parsed result is
10 years ago
Michael Peter Christen 4d00175157 <experimental> added parsing of <article> html element.
10 years ago