Commit Graph

39 Commits (28b30231c32731e393133430a81b65b1d36cc1f0)

Author SHA1 Message Date
Michael Peter Christen 8efc1c1078 - fixed a memory leak (or bad usage) during parsing/snippet fetch
13 years ago
Michael Peter Christen ad09b786bf clean up parser data
13 years ago
Michael Peter Christen 276a66a793 Adding a limit of 1000 links that a parser shall store during indexing.
13 years ago
Michael Peter Christen ce8d4b87d9 fixes for new eclipse 'Juno' warning 'Resource leak'.
13 years ago
Michael Peter Christen 50c576599b allow multiple parser options instead of printing an error
13 years ago
Michael Peter Christen 5fc6524ca8 - moved triple store to net.yacy.cora.lod (should be generalized there
13 years ago
cominch bcbd8eee33 Add several parsers, for RDFa and rdf files.
13 years ago
Michael Peter Christen 8d63a5887c bugfixes
13 years ago
Michael Christen 1f4afb4dc0 performance hacks
13 years ago
Al Sutton d73c84f9a0 Allow initial buffer size definition in TransformWriter, and use available() method to set it in htmlParser. In this situation a ByteArrayInputStream is used so the available() method gives a good size estimation and avoid the buffer needing to be continually grown
13 years ago
orbiter 49e5ca579f added new configuration property "crawler.embedLinksAsDocuments". If this is switched on (this is default now), the all embedded image, audio and video links from all parsed documents are added to the search index as individual document. This will increase the search index size dramatically but will also enable us to create a much faster image, audio and video search. If the flag is switched on, the index entries are also stored to a solr index, if this is also enabled.
13 years ago
orbiter 610b01e1c3 - added a 'add every media object linked in a html document as a new document' to the html parser. This causes that all image, app, video or audio file that is linked in a html file is added as document. In fact that means that parsing a single html document may cause that a number of documents is inserted into the search index.
13 years ago
orbiter 1c007188ad bugfixes in html parser
13 years ago
sixcooler 9170a434ed throwing an exception again in FileUtils.copy(reader, writer)
13 years ago
orbiter 9248a4eef4 reduce teh effect of 'Bildersuche findet generierte HTML-Seiten als Bilder'
14 years ago
orbiter d8e934c085 better abstraction of http client identification
14 years ago
orbiter b77b8cac0c - enhanced html parser: recognized much more details in the content
14 years ago
orbiter 9b25d07295 - added geo information parsing to html parser
14 years ago
lotus cb6d307bba adding extension for parser
14 years ago
low012 2a6499364d *) minor changes
14 years ago
low012 c0274bd123 *) minor changes
14 years ago
orbiter d2fd93135c - moved yacybot user agent string definition to MultiProtocolURI since there are basic access mechanisms where the bot string is needed
14 years ago
f1ori e670e1ef8e add charset auto-detection for htmlParser
14 years ago
f1ori ddcd5ae78c fix for http://forum.yacy-websuche.de/viewtopic.php?f=5&t=2989
14 years ago
f1ori 8fe1102452 fix http://forum.yacy-websuche.de/viewtopic.php?p=20889#p18426
14 years ago
orbiter 22047ffad5 enhanced computation speed of many replaceAll string operations
14 years ago
orbiter 0010cd9db1 Support for indexing of RSS feeds!
14 years ago
orbiter 5e7081cd19 refactoring towards a unified loading mechanism for MultiProtocolURIs
14 years ago
orbiter b6fb239e74 redesign of parser interface:
15 years ago
orbiter 87087f12fe - scanned remote search process and enhanced some data structure and synchronizations here and there
15 years ago
orbiter de4f30bb2e UTF-8 fix
15 years ago
orbiter 3a1cebb598 bugfixes
15 years ago
orbiter 11639aef35 - added new protocol loader for 'file'-type URLs
15 years ago
orbiter cf43bdc87e This is a large bugfix and enhancement commit to support a better location detection for data
15 years ago
orbiter 54af9e6b49 - added parsing of robots meta-tag in html headers to detect a noindexing request
15 years ago
lotus 38a3d55afd added more possible php extensions for html
15 years ago
orbiter 7d400b17d0 html parser support for .cfm files
15 years ago
orbiter 007f8297de added php3 as extension type for html
15 years ago
orbiter b79f4f062f refactoring of yacy documents and parsers: they depend now only on the kelondro classes
15 years ago