Commit Graph

168 Commits (f8f1959ebb3f96b66e75d7d83cd70ae9714e85bd)

Author SHA1 Message Date
luccioman bf55f1d6e5 Started support of partial parsing on large streamed resources.
7 years ago
luccioman 90a7c1affa HTML parser : removed unnecessary remaining recursive processing
7 years ago
luccioman 9b1bb2545e Refactored plain-text URLs detection implementation.
7 years ago
luccioman 8da3174867 Ensure lower case conversion consistency with any default locale.
7 years ago
luccioman 319231a458 Added a generic XML parser, able to parse elements text and URLs.
7 years ago
luccioman 306a82dd71 Fixed scraper NullPointerException cases on malformed URLs.
8 years ago
Michael Peter Christen 7678fd67e3 copied fix from yacy_grid_parser for wrong array type
8 years ago
reger f254fcfc67 fix htmlParser <script> text extraction on code containing expression
8 years ago
Michael Peter Christen 02d0b3172c Merge branch 'master' of https://github.com/yacy/yacy_search_server.git
8 years ago
Michael Peter Christen d4f45cf05e added dc.date.modified and dc.date.created to date parser
8 years ago
luccioman 6a4d51d8f9 Cleaned up some Javadoc warnings.
8 years ago
reger b522d540b9 Include itemprop latitude/longitude (see schema.org) in attribute
8 years ago
reger 083df255e4 fix html tag attribute parsing containing attribute w/o value
8 years ago
reger cb95b7339a include html5 <time> tag in content scraper,
8 years ago
luccioman 47af33a04c Advanced Crawl from local file : better processing of large files.
8 years ago
luccioman 7717a3d43d Fixed license headers on files created to improve favicon management.
8 years ago
luccioman 6e1959f469 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git
8 years ago
luccioman b1b8e69da8 Fixed NullPointerException cases
8 years ago
luc 3cc5619d93 Improved HTML icons indexing and rendering in search results.
9 years ago
reger 2048b7e057 support scraping start-/enddate from html tag with property "datetime"
9 years ago
reger 900d4584ba complet resource cleanup of lists in contentscraper's close()
9 years ago
luc 571bc55937 Refactoring : use StandardCharsets constants instead of hard-coded
9 years ago
reger 14803d58cd let html scraper accept html5 <link rel="icon"> for favicon links
9 years ago
reger d54c5d310a add links with image extension not automatically to image links.
9 years ago
reger bad34804fe optimize parseInt for <img> tag attribute parsing
9 years ago
reger d2cc11ea8f fix html parser taking <style> content as text.
9 years ago
Michael Peter Christen fed26f33a8 enhanced timezone managament for indexed data:
10 years ago
Michael Peter Christen b060ba900d added parsing of contentprop attribute in html tags for
10 years ago
Michael Peter Christen 4cb4f67f38 added parsing of dd, dt and article html fields. The parsed result is
10 years ago
Michael Peter Christen 4d00175157 <experimental> added parsing of <article> html element.
10 years ago
Michael Peter Christen 535f1ebe3b added a new way of content browsing in search results:
10 years ago
Michael Peter Christen b5ac29c9a5 added a html field scraper which reads text from html entities of a
10 years ago
Michael Peter Christen 1f5047b15f using precompiled pattern CommonPattern.SEMICOLON for splits
10 years ago
Michael Peter Christen 4144c7cc52 do not write frame links to webgraph
10 years ago
Michael Peter Christen d2792a43fd do not write iframe and embed links into webgraph, but use them anyway
10 years ago
Michael Peter Christen 8df8ffbb6d enhanced the snapshot functionality:
10 years ago
reger 28456dfc09 skip creation of unused Bluelist contenttransformer
10 years ago
Michael Peter Christen 98f45c9032 fix for image alt attachment to AnchorURLs in html parser.
10 years ago
Michael Peter Christen b44626e55b fixed target_alt_t in webgraph
10 years ago
Michael Peter Christen 2de159719b added an option to set 'obey nofollow' for links with rel="nofollow"
10 years ago
Michael Peter Christen f3a6b6e21e fix for bad URL decoding
11 years ago
Michael Peter Christen 8acae852a0 write <em>-tagged texts also into the bold_txt field
11 years ago
orbiter 97983ba89f fixed generics warnings for generic array instantiation that appeared
11 years ago
reger 2eb7682772 add html5 audio/video <source> tag to html content scraper
11 years ago
reger 0b6db04e40 fix contentscraper img height/width parsing
11 years ago
reger 86f6975edc exclude html tags in in/outboundlinks_anchortext_txt parsed text
11 years ago
Michael Peter Christen ce1d1b2fa0 fix for maximum tag length in parser
11 years ago
Michael Peter Christen 67beef657f strong redesign of html parser: object recursion is now made using a
11 years ago
reger 1437c45383 merge rc1/master
11 years ago
reger 5c4ba9b5db merge rc1 master
11 years ago