Commit Graph

83 Commits (b882f8590062e563f0c3a3e22d295bb552abe12d)

Author SHA1 Message Date
Michael Peter Christen 25573bd5ab added a crawl filter based on <div> tag class names
7 years ago
luccioman 8a94fef9e0 Prevent unwanted cached bytes duplication on stream parsing.
7 years ago
luccioman 5a646540cc Support parsing gzip files from servers with redundant headers.
7 years ago
luccioman 452a17a8d5 Finer control on bounded input streams with custom stream implementation
7 years ago
luccioman e0f400a0bd Support trying multiple parsers even when streaming on large resources.
7 years ago
luccioman bf55f1d6e5 Started support of partial parsing on large streamed resources.
7 years ago
luccioman 286f3018bd Made mime type and extension normalization locale independent.
7 years ago
luccioman 319231a458 Added a generic XML parser, able to parse elements text and URLs.
7 years ago
luccioman d2a4a27f52 Improved stream-oriented parsing entering conditions.
8 years ago
luccioman ce89492319 Ensure system resource release by closing document stream.
8 years ago
luccioman 6a4d51d8f9 Cleaned up some Javadoc warnings.
8 years ago
reger a4465c97d6 as requested, disable/remove old swf parser
8 years ago
Michael Peter Christen 5e165a8150 removed unused imports
8 years ago
reger 06d0e2aeb9 result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode.
9 years ago
reger 8532565c7d optimize order of parsers to try
9 years ago
reger 356d4d1301 remove rdfParser from init (current function identical with genericParser)
9 years ago
reger c647d899e3 add svgParser to parse metadate from svg images
9 years ago
reger 7478338a40 remove augmented parsing activation from frontend
10 years ago
reger 11aa2edfe1 remove RDFa parser activation from frontend
10 years ago
Michael Peter Christen fed26f33a8 enhanced timezone managament for indexed data:
10 years ago
Michael Peter Christen b5ac29c9a5 added a html field scraper which reads text from html entities of a
10 years ago
Michael Peter Christen 69eacdf4eb applying precompiled CommonPattern.COMMA.split to all places where
10 years ago
Michael Peter Christen 3073c69aee Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
10 years ago
reger eaccce3467 added metadataImageParser for tif and psd (Photoshop) images.
10 years ago
Michael Peter Christen 67cd4c37bd activated the new apk parser which was already ready but not included in
10 years ago
Michael Peter Christen aee5b108e5 added linkScraperParser, a parser which ignores the text like the
11 years ago
Michael Peter Christen da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues:
11 years ago
reger bd1685c94a fix not needed getFileExtension().toLower (double)
11 years ago
Michael Peter Christen 61c5e40687 - replaced the properties object in AnchorURL with distinct variables
11 years ago
Michael Peter Christen 5e31bad711 - the webgraph shall store all links which appear on a web page and not
11 years ago
reger b4016ff324 - remove possible double initialization of rdfa parser
11 years ago
reger aa1a1f1d2c - small adjustment to make sure genericParser is tried last
11 years ago
Michael Peter Christen 5878c1d599 - refactoring of log to ConcurrentLog:
12 years ago
Michael Peter Christen 16d1d744fa added url_file_name_s in default collection schema for the file name
12 years ago
Michael Peter Christen b5ee88c6af added more logging to get info which url causes performance problems
12 years ago
Michael Peter Christen 5f0ab25382 removed the option to prevent removal of &amp; parts inside of the
12 years ago
apfelmaennchen 88b062210c Added a parser for audio file tags (e.g. ID3 tags for MP3 files) based
12 years ago
orbiter 482afed07c reduced logging overhead (a bit)
13 years ago
orbiter 0cbda0b2b8 - replaced all length() == 0 and size() == 0 with isEmpty()
13 years ago
Michael Peter Christen 0301aba1e9 removed unused method parameters
13 years ago
Michael Peter Christen ea10766bfd cleaned unnecessary nested code
13 years ago
orbiter 78fc3cf8f8 refactoring and new usage of SentenceReader: this class appeared as one
13 years ago
Michael Peter Christen de903a53a0 parser refactoring & hacks
13 years ago
Michael Peter Christen 786be7d175 better integration of RDFaParser
13 years ago
Michael Peter Christen de3ef8ad73 removed unimportant warnings
13 years ago
Michael Peter Christen 50c576599b allow multiple parser options instead of printing an error
13 years ago
Michael Peter Christen 5fc6524ca8 - moved triple store to net.yacy.cora.lod (should be generalized there
13 years ago
cominch 90512640bf Added config switches for custom parser
13 years ago
cominch bcbd8eee33 Add several parsers, for RDFa and rdf files.
13 years ago
Michael Peter Christen 659178942f - Redesigned crawler and parser to accept embedded links from the NOLOAD
13 years ago