Commit Graph

60 Commits (9938c8137840d86e38bf0fb35cdb4d4040e5ec83)

Author SHA1 Message Date
Michael Peter Christen fed26f33a8 enhanced timezone managament for indexed data:
10 years ago
Michael Peter Christen b5ac29c9a5 added a html field scraper which reads text from html entities of a
10 years ago
Michael Peter Christen a1ee101079 recognize more html file extensions
10 years ago
Michael Peter Christen 98f45c9032 fix for image alt attachment to AnchorURLs in html parser.
10 years ago
Michael Peter Christen aee5b108e5 added linkScraperParser, a parser which ignores the text like the
11 years ago
orbiter 88f4af90da removed warnings
11 years ago
Michael Peter Christen 67beef657f strong redesign of html parser: object recursion is now made using a
11 years ago
reger 49e76a1c55 make use of detected charset in htmlParser if none is given.
11 years ago
reger 6932aa4d7a use configured admin-username for api calls
11 years ago
orbiter 3cb6c7861f fixed shutdown authenticaton problem
11 years ago
Michael Peter Christen 61c5e40687 - replaced the properties object in AnchorURL with distinct variables
11 years ago
Michael Peter Christen 5e31bad711 - the webgraph shall store all links which appear on a web page and not
11 years ago
Michael Peter Christen 35ab2cef7b added parsing of 'date', 'dc:date', 'dc.date' and 'last-modified' in
11 years ago
Michael Peter Christen 765943a4b7 Redesign of crawler identification and robots steering. A non-p2p user
11 years ago
Michael Peter Christen 58fe986cca Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
Michael Peter Christen cf12835f20 replaced the single-text description solr field with a multi-value
11 years ago
reger 92d3f71b16 htmlParser: closes input stream -> changed it to leave it open for a reset (used by AugmentParser - even if this is practically not used),
11 years ago
orbiter 17ae51e741 increased number of links limitation from 1000 to 10000 for rss feeds
12 years ago
Michael Peter Christen 788288eb9e added the generation of 50 (!!) new solr field in the core 'webgraph'.
12 years ago
Michael Peter Christen 6a4878940b fix in html parser and bookmark generation
12 years ago
Michael Peter Christen 528d6763fa - added new solr fields:
12 years ago
Michael Peter Christen 8efc1c1078 - fixed a memory leak (or bad usage) during parsing/snippet fetch
13 years ago
Michael Peter Christen ad09b786bf clean up parser data
13 years ago
Michael Peter Christen 276a66a793 Adding a limit of 1000 links that a parser shall store during indexing.
13 years ago
Michael Peter Christen ce8d4b87d9 fixes for new eclipse 'Juno' warning 'Resource leak'.
13 years ago
Michael Peter Christen 50c576599b allow multiple parser options instead of printing an error
13 years ago
Michael Peter Christen 5fc6524ca8 - moved triple store to net.yacy.cora.lod (should be generalized there
13 years ago
cominch bcbd8eee33 Add several parsers, for RDFa and rdf files.
13 years ago
Michael Peter Christen 8d63a5887c bugfixes
13 years ago
Michael Christen 1f4afb4dc0 performance hacks
13 years ago
Al Sutton d73c84f9a0 Allow initial buffer size definition in TransformWriter, and use available() method to set it in htmlParser. In this situation a ByteArrayInputStream is used so the available() method gives a good size estimation and avoid the buffer needing to be continually grown
13 years ago
orbiter 49e5ca579f added new configuration property "crawler.embedLinksAsDocuments". If this is switched on (this is default now), the all embedded image, audio and video links from all parsed documents are added to the search index as individual document. This will increase the search index size dramatically but will also enable us to create a much faster image, audio and video search. If the flag is switched on, the index entries are also stored to a solr index, if this is also enabled.
13 years ago
orbiter 610b01e1c3 - added a 'add every media object linked in a html document as a new document' to the html parser. This causes that all image, app, video or audio file that is linked in a html file is added as document. In fact that means that parsing a single html document may cause that a number of documents is inserted into the search index.
13 years ago
orbiter 1c007188ad bugfixes in html parser
13 years ago
sixcooler 9170a434ed throwing an exception again in FileUtils.copy(reader, writer)
13 years ago
orbiter 9248a4eef4 reduce teh effect of 'Bildersuche findet generierte HTML-Seiten als Bilder'
14 years ago
orbiter d8e934c085 better abstraction of http client identification
14 years ago
orbiter b77b8cac0c - enhanced html parser: recognized much more details in the content
14 years ago
orbiter 9b25d07295 - added geo information parsing to html parser
14 years ago
lotus cb6d307bba adding extension for parser
14 years ago
low012 2a6499364d *) minor changes
14 years ago
low012 c0274bd123 *) minor changes
14 years ago
orbiter d2fd93135c - moved yacybot user agent string definition to MultiProtocolURI since there are basic access mechanisms where the bot string is needed
14 years ago
f1ori e670e1ef8e add charset auto-detection for htmlParser
14 years ago
f1ori ddcd5ae78c fix for http://forum.yacy-websuche.de/viewtopic.php?f=5&t=2989
14 years ago
f1ori 8fe1102452 fix http://forum.yacy-websuche.de/viewtopic.php?p=20889#p18426
14 years ago
orbiter 22047ffad5 enhanced computation speed of many replaceAll string operations
14 years ago
orbiter 0010cd9db1 Support for indexing of RSS feeds!
14 years ago
orbiter 5e7081cd19 refactoring towards a unified loading mechanism for MultiProtocolURIs
14 years ago
orbiter b6fb239e74 redesign of parser interface:
15 years ago