Commit Graph

109 Commits (9938c8137840d86e38bf0fb35cdb4d4040e5ec83)

Author SHA1 Message Date
Michael Peter Christen fed26f33a8 enhanced timezone managament for indexed data:
10 years ago
Michael Peter Christen b060ba900d added parsing of contentprop attribute in html tags for
10 years ago
Michael Peter Christen 4cb4f67f38 added parsing of dd, dt and article html fields. The parsed result is
10 years ago
Michael Peter Christen 4d00175157 <experimental> added parsing of <article> html element.
10 years ago
Michael Peter Christen 535f1ebe3b added a new way of content browsing in search results:
10 years ago
Michael Peter Christen b5ac29c9a5 added a html field scraper which reads text from html entities of a
10 years ago
Michael Peter Christen 4144c7cc52 do not write frame links to webgraph
10 years ago
Michael Peter Christen d2792a43fd do not write iframe and embed links into webgraph, but use them anyway
10 years ago
Michael Peter Christen 98f45c9032 fix for image alt attachment to AnchorURLs in html parser.
10 years ago
Michael Peter Christen b44626e55b fixed target_alt_t in webgraph
10 years ago
Michael Peter Christen 2de159719b added an option to set 'obey nofollow' for links with rel="nofollow"
10 years ago
Michael Peter Christen 8acae852a0 write <em>-tagged texts also into the bold_txt field
11 years ago
orbiter 97983ba89f fixed generics warnings for generic array instantiation that appeared
11 years ago
reger 2eb7682772 add html5 audio/video <source> tag to html content scraper
11 years ago
reger 0b6db04e40 fix contentscraper img height/width parsing
11 years ago
reger 86f6975edc exclude html tags in in/outboundlinks_anchortext_txt parsed text
11 years ago
Michael Peter Christen ce1d1b2fa0 fix for maximum tag length in parser
11 years ago
Michael Peter Christen 67beef657f strong redesign of html parser: object recursion is now made using a
11 years ago
reger 1437c45383 merge rc1/master
11 years ago
reger 5c4ba9b5db merge rc1 master
11 years ago
reger 70c51775ae Merge remote-tracking branch 'origin/master' into jetty
11 years ago
Michael Peter Christen 31920385f7 set anchor rel attribute of all links to "nofollow" if the html meta
11 years ago
Michael Peter Christen 61c5e40687 - replaced the properties object in AnchorURL with distinct variables
11 years ago
Michael Peter Christen 5e31bad711 - the webgraph shall store all links which appear on a web page and not
11 years ago
reger f7f86d8a5d update to Jetty 9 jars
11 years ago
Michael Peter Christen 35ab2cef7b added parsing of 'date', 'dc:date', 'dc.date' and 'last-modified' in
11 years ago
Michael Peter Christen cf12835f20 replaced the single-text description solr field with a multi-value
11 years ago
Roland Haeder 841a28ae76 Added 'final' for all exception blocks as this helps the Java compiler
11 years ago
Michael Peter Christen 5878c1d599 - refactoring of log to ConcurrentLog:
12 years ago
Michael Peter Christen 16d1d744fa added url_file_name_s in default collection schema for the file name
12 years ago
Michael Peter Christen 50421171c3 added new schema fields:
12 years ago
Michael Peter Christen 788288eb9e added the generation of 50 (!!) new solr field in the core 'webgraph'.
12 years ago
orbiter 5dfd6359cb redesign of the QueryParams class: introduced QueryGoal which holds the
12 years ago
Michael Peter Christen b7ac1da6a3 gsa results shall have only one title in metadata and that should be the
12 years ago
Michael Peter Christen 5f0ab25382 removed the option to prevent removal of &amp; parts inside of the
12 years ago
orbiter 68d0f8de03 Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1
12 years ago
reger bfb0d4c69b - add language detection from <html lang="xx"> tag
12 years ago
Michael Peter Christen 7e3e45fd04 added Open Graph Metadata default fields, see http://ogp.me/ns#
12 years ago
Michael Peter Christen c3e5f667a7 added schema.org breadcrumb counter to parser and solr schema
12 years ago
Michael Peter Christen 411d0e839b added an underline text field to solr to record all underlined texts
12 years ago
Michael Peter Christen e54ac38095 - some corrections in usage of getFile() and getFileName()
12 years ago
Michael Peter Christen 528d6763fa - added new solr fields:
12 years ago
orbiter 0cbda0b2b8 - replaced all length() == 0 and size() == 0 with isEmpty()
13 years ago
Michael Peter Christen b1e7c11fba fix for pattern matcher in html parser
13 years ago
orbiter 7f851d62a7 replaced HashARC with SizeLimited Objects which are less costly
13 years ago
orbiter 78fc3cf8f8 refactoring and new usage of SentenceReader: this class appeared as one
13 years ago
Michael Peter Christen ad09b786bf clean up parser data
13 years ago
Michael Peter Christen 276a66a793 Adding a limit of 1000 links that a parser shall store during indexing.
13 years ago
Michael Peter Christen de903a53a0 parser refactoring & hacks
13 years ago
Michael Peter Christen 508a81b86c added solr field 'refresh_s' which stores the refresh url contained in
13 years ago