Commit Graph

29 Commits (f8f1959ebb3f96b66e75d7d83cd70ae9714e85bd)

Author SHA1 Message Date
luccioman 6a4d51d8f9 Cleaned up some Javadoc warnings.
8 years ago
reger 14f7577231 add support for older Word versions (Word6/Word95) to docParser
8 years ago
reger 06d0e2aeb9 result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode.
9 years ago
luc 571bc55937 Refactoring : use StandardCharsets constants instead of hard-coded
9 years ago
reger 8768896975 extract lastmodified from openoffice doc
9 years ago
Michael Peter Christen fed26f33a8 enhanced timezone managament for indexed data:
10 years ago
reger 7e35518787 add extracted description/subject to docParser
10 years ago
Michael Peter Christen b5ac29c9a5 added a html field scraper which reads text from html entities of a
10 years ago
Michael Peter Christen 69eacdf4eb applying precompiled CommonPattern.COMMA.split to all places where
10 years ago
reger cb2c17d236 extract author and keywords in .doc and .ppt parser
11 years ago
Michael Peter Christen 77aeb288a2 suppress deprecation warning (for now); TODO: find alternatives
11 years ago
Michael Peter Christen 61c5e40687 - replaced the properties object in AnchorURL with distinct variables
11 years ago
Michael Peter Christen 5e31bad711 - the webgraph shall store all links which appear on a web page and not
11 years ago
Michael Peter Christen 35ab2cef7b added parsing of 'date', 'dc:date', 'dc.date' and 'last-modified' in
11 years ago
Roland Haeder 841a28ae76 Added 'final' for all exception blocks as this helps the Java compiler
11 years ago
Michael Peter Christen 528d6763fa - added new solr fields:
12 years ago
orbiter 78fc3cf8f8 refactoring and new usage of SentenceReader: this class appeared as one
13 years ago
Michael Peter Christen 5fc6524ca8 - moved triple store to net.yacy.cora.lod (should be generalized there
13 years ago
orbiter b77b8cac0c - enhanced html parser: recognized much more details in the content
14 years ago
orbiter 9b25d07295 - added geo information parsing to html parser
14 years ago
orbiter 694fa3a2a5 - replaced more direct string-based UTF-8 conversions by predefined UTF-8 conversion
14 years ago
orbiter e1b6916423 always try to guess the size of a StringBuilder to prevent too many memory re-allocations
14 years ago
low012 c0274bd123 *) minor changes
14 years ago
orbiter 0010cd9db1 Support for indexing of RSS feeds!
14 years ago
orbiter b6fb239e74 redesign of parser interface:
15 years ago
orbiter 11639aef35 - added new protocol loader for 'file'-type URLs
15 years ago
orbiter cf43bdc87e This is a large bugfix and enhancement commit to support a better location detection for data
15 years ago
orbiter 54af9e6b49 - added parsing of robots meta-tag in html headers to detect a noindexing request
15 years ago
orbiter b79f4f062f refactoring of yacy documents and parsers: they depend now only on the kelondro classes
15 years ago