Commit Graph

55 Commits (45cf553bc3389eca9e3fdebcc1a919db5d26588c)

Author SHA1 Message Date
Michael Peter Christen 61c5e40687 - replaced the properties object in AnchorURL with distinct variables
11 years ago
Michael Peter Christen 5e31bad711 - the webgraph shall store all links which appear on a web page and not
11 years ago
reger b4016ff324 - remove possible double initialization of rdfa parser
11 years ago
reger aa1a1f1d2c - small adjustment to make sure genericParser is tried last
11 years ago
Michael Peter Christen 5878c1d599 - refactoring of log to ConcurrentLog:
12 years ago
Michael Peter Christen 16d1d744fa added url_file_name_s in default collection schema for the file name
12 years ago
Michael Peter Christen b5ee88c6af added more logging to get info which url causes performance problems
12 years ago
Michael Peter Christen 5f0ab25382 removed the option to prevent removal of & parts inside of the
12 years ago
apfelmaennchen 88b062210c Added a parser for audio file tags (e.g. ID3 tags for MP3 files) based
12 years ago
orbiter 482afed07c reduced logging overhead (a bit)
13 years ago
orbiter 0cbda0b2b8 - replaced all length() == 0 and size() == 0 with isEmpty()
13 years ago
Michael Peter Christen 0301aba1e9 removed unused method parameters
13 years ago
Michael Peter Christen ea10766bfd cleaned unnecessary nested code
13 years ago
orbiter 78fc3cf8f8 refactoring and new usage of SentenceReader: this class appeared as one
13 years ago
Michael Peter Christen de903a53a0 parser refactoring & hacks
13 years ago
Michael Peter Christen 786be7d175 better integration of RDFaParser
13 years ago
Michael Peter Christen de3ef8ad73 removed unimportant warnings
13 years ago
Michael Peter Christen 50c576599b allow multiple parser options instead of printing an error
13 years ago
Michael Peter Christen 5fc6524ca8 - moved triple store to net.yacy.cora.lod (should be generalized there
13 years ago
cominch 90512640bf Added config switches for custom parser
13 years ago
cominch bcbd8eee33 Add several parsers, for RDFa and rdf files.
13 years ago
Michael Peter Christen 659178942f - Redesigned crawler and parser to accept embedded links from the NOLOAD
13 years ago
Michael Peter Christen a1a5b015d8 refactoring: moved document Classification to cora package
13 years ago
Michael Peter Christen 2e5cd6a1b2 fixed parser extension deny list generation and usage
13 years ago
Michael Peter Christen 7f9b6b7a0c added switches to ConfigParser to accept/deny documents by their
13 years ago
Al Sutton f02ea27b31 Added missing closure of ByteArrayInputSteam
13 years ago
orbiter 85d6bf4ac4 fixed urls to media content during indexing
13 years ago
orbiter 49e5ca579f added new configuration property "crawler.embedLinksAsDocuments". If this is switched on (this is default now), the all embedded image, audio and video links from all parsed documents are added to the search index as individual document. This will increase the search index size dramatically but will also enable us to create a much faster image, audio and video search. If the flag is switched on, the index entries are also stored to a solr index, if this is also enabled.
13 years ago
sixcooler ce248cc8dd less byte-arrays of response-content, less byte-array <-> stream conversation
13 years ago
orbiter 0c1b29f3c9 - applied many small performance hacks
14 years ago
low012 11ea966f9e *) added SID file (Commodore 64) sound file parser
14 years ago
low012 936e976c23 *) added FreeMind (http://freemind.sourceforge.net/) mindmap parser
14 years ago
orbiter a563b05b60 enhanced crawler:
14 years ago
orbiter 4e2c14efbb fixed bugs in parser and ftp client
14 years ago
orbiter b769cce433 - added a catch-all parser for all documents that cannot be parsed: they will contributed with their document url for the search index only
14 years ago
orbiter 9d080f387e change in handling of the all-visible home path for storage in YaCy:
14 years ago
orbiter e10cd115a9 - added a new RSS reader interface. This is not finished but you can now load and look at RSS feeds. It will be used to index RSS feeds in a way that is appropriate for such kind of data.
14 years ago
orbiter 933dc1a600 removed old rss parser (will be replaced with parser from cora package)
14 years ago
orbiter b6fb239e74 redesign of parser interface:
15 years ago
orbiter 11639aef35 - added new protocol loader for 'file'-type URLs
15 years ago
orbiter 5ab5ac80fe fix for NPE in TextParser
15 years ago
orbiter 3247f0e901 fix for deadlocks caused by self-blocking access to TreeMap in concurrent environments. The TreeMap was replaced by a ConcurrentHashMap and additional care that the strings are compared all in lowercase
15 years ago
orbiter 11983bc936 redesigned some parts of the parser entry point:
15 years ago
orbiter 24e5faee75 added exif parsing for jpg images
15 years ago
lotus 85ca96227f fix for re-enable parser
15 years ago
orbiter 69c29acb6e no exception thread dump if parser cannot parse becuase that mime-type/extension is in the deny-set
15 years ago
orbiter 56e0d9bd01 - testings with image parser
15 years ago
orbiter fbd24c2d84 integrated the torrent parser
15 years ago
orbiter 4a5100789f replaced _all_ size() == 0 with isEmpty() and all size() > 0 with !isEmpty(). The isEmpty() method is much faster in some cases, especially when used to access badly balanced hashtables where an size() operation becomes a large iteration.
15 years ago
orbiter d2938c44a1 - added bmp parser to the document parsers
15 years ago