Commit Graph

376 Commits (7d5ba2afa4810fc52dc229a82a9020768fb6c150)

Author SHA1 Message Date
Michael Peter Christen b5ac29c9a5 added a html field scraper which reads text from html entities of a
10 years ago
Michael Peter Christen de3e373913 using precompiled CommonPattern.TAB for split
10 years ago
Michael Peter Christen 1f5047b15f using precompiled pattern CommonPattern.SEMICOLON for splits
10 years ago
Michael Peter Christen 69eacdf4eb applying precompiled CommonPattern.COMMA.split to all places where
10 years ago
reger 5ca0762179 fix: eom on parsing ico file by genericImageParser
10 years ago
Michael Peter Christen 4144c7cc52 do not write frame links to webgraph
10 years ago
Michael Peter Christen d2792a43fd do not write iframe and embed links into webgraph, but use them anyway
10 years ago
Michael Peter Christen 6ad43c4a8b removed debug code
10 years ago
Michael Peter Christen 8c3e5b7b6d added experimental pdf splitting which enables YaCy to split pdfs during
10 years ago
Michael Peter Christen 8b5d074715 fix for image parser (there is a class missing!)
10 years ago
reger 9edc7308aa update to metadata-extractor-2.7.0.jar
10 years ago
Michael Peter Christen 66b5a56976 Added and integrated new date detection class which can identify date
10 years ago
Michael Peter Christen 8df8ffbb6d enhanced the snapshot functionality:
10 years ago
reger 28456dfc09 skip creation of unused Bluelist contenttransformer
10 years ago
Michael Peter Christen a1ee101079 recognize more html file extensions
10 years ago
Michael Peter Christen 6a2a669db4 added loading of the synonyms file from addon/synonyms into the
10 years ago
Michael Peter Christen 07c5b57953 removed warnings
10 years ago
reger 59c6532a65 add link extraction to pdfParser
10 years ago
reger aa2e15d846 allow url parameter in worktable apicall
10 years ago
reger b0c87d8240 fix image search expand box, cut-off of 2nd capture line height
10 years ago
Michael Peter Christen 3073c69aee Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
10 years ago
reger eaccce3467 added metadataImageParser for tif and psd (Photoshop) images.
10 years ago
reger a69f5358ff use javax ImageIO getReader to add supported image extension/mime
10 years ago
Michael Peter Christen 67cd4c37bd activated the new apk parser which was already ready but not included in
10 years ago
orbiter b6d57f06eb enhanced the apk parser (up to beeing production-ready).
10 years ago
reger e9eae45b55 simplify rssreader and improve atom feed link extraction
10 years ago
Michael Peter Christen 98f45c9032 fix for image alt attachment to AnchorURLs in html parser.
10 years ago
Michael Peter Christen b44626e55b fixed target_alt_t in webgraph
10 years ago
Michael Peter Christen 2de159719b added an option to set 'obey nofollow' for links with rel="nofollow"
10 years ago
Michael Peter Christen e039e78210 small bugfixes
10 years ago
Michael Peter Christen f3a6b6e21e fix for bad URL decoding
11 years ago
Michael Peter Christen aee5b108e5 added linkScraperParser, a parser which ignores the text like the
11 years ago
reger cb2c17d236 extract author and keywords in .doc and .ppt parser
11 years ago
orbiter fec673c9d1 Merge branch 'master' of git@gitorious.org:yacy/rc1.git
11 years ago
orbiter 4a66af716d added apkParser stub (work in progress)
11 years ago
Michael Peter Christen 8acae852a0 write <em>-tagged texts also into the bold_txt field
11 years ago
reger 3b559e7846 optimize pdfParser
11 years ago
reger 09f73b790f fix pdfParser not closed warning from pdfbox
11 years ago
orbiter 97983ba89f fixed generics warnings for generic array instantiation that appeared
11 years ago
orbiter 88f4af90da removed warnings
11 years ago
reger 2eb7682772 add html5 audio/video <source> tag to html content scraper
11 years ago
reger 0b6db04e40 fix contentscraper img height/width parsing
11 years ago
reger 86f6975edc exclude html tags in in/outboundlinks_anchortext_txt parsed text
11 years ago
Michael Peter Christen da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues:
11 years ago
Michael Peter Christen ce1d1b2fa0 fix for maximum tag length in parser
11 years ago
Michael Peter Christen 67beef657f strong redesign of html parser: object recursion is now made using a
11 years ago
reger af6ad20728 fix: remove obsolete ref to yacy.home
11 years ago
reger 49e76a1c55 make use of detected charset in htmlParser if none is given.
11 years ago
Michael Peter Christen 8b44fcf0f4 added missing @Override annotation
11 years ago
reger bd1685c94a fix not needed getFileExtension().toLower (double)
11 years ago
Michael Peter Christen 022c6d3ce1 do YaCy p2p connections using a timeout-request which covers the http
11 years ago
reger 6932aa4d7a use configured admin-username for api calls
11 years ago
orbiter 3cb6c7861f fixed shutdown authenticaton problem
11 years ago
Michael Peter Christen 77aeb288a2 suppress deprecation warning (for now); TODO: find alternatives
11 years ago
reger f111f30ace Merge origin/master into jetty
11 years ago
orbiter 19a051bec8 more monitoring for postprocessing and enhanced layout in Crawler
11 years ago
reger 1437c45383 merge rc1/master
11 years ago
Michael Peter Christen 81d9e23532 fixed another memory leak in the PDF parser:
11 years ago
Michael Peter Christen a8253ca49c added missing unicode transformation in href link contents during
11 years ago
Michael Peter Christen 60187a4ec2 fix in html parser
11 years ago
reger 5c4ba9b5db merge rc1 master
11 years ago
reger 70c51775ae Merge remote-tracking branch 'origin/master' into jetty
11 years ago
Michael Peter Christen 31920385f7 set anchor rel attribute of all links to "nofollow" if the html meta
11 years ago
Michael Peter Christen 57e00baf26 fix for parsing of image links inside of anchor links (image-links)
11 years ago
Michael Peter Christen 61c5e40687 - replaced the properties object in AnchorURL with distinct variables
11 years ago
Michael Peter Christen 5e31bad711 - the webgraph shall store all links which appear on a web page and not
11 years ago
reger f7f86d8a5d update to Jetty 9 jars
11 years ago
Michael Peter Christen 35ab2cef7b added parsing of 'date', 'dc:date', 'dc.date' and 'last-modified' in
11 years ago
Michael Peter Christen 765943a4b7 Redesign of crawler identification and robots steering. A non-p2p user
11 years ago
reger b4016ff324 - remove possible double initialization of rdfa parser
11 years ago
Michael Peter Christen 58fe986cca Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
Michael Peter Christen cf12835f20 replaced the single-text description solr field with a multi-value
11 years ago
reger 92d3f71b16 htmlParser: closes input stream -> changed it to leave it open for a reset (used by AugmentParser - even if this is practically not used),
11 years ago
reger aa1a1f1d2c - small adjustment to make sure genericParser is tried last
11 years ago
Roland Haeder 841a28ae76 Added 'final' for all exception blocks as this helps the Java compiler
11 years ago
Michael Peter Christen 5878c1d599 - refactoring of log to ConcurrentLog:
12 years ago
reger 83763ee4a4 jpeg parser: extract GPS location from meta data
12 years ago
Michael Peter Christen c4538d8d91 added metadata-extractor-2.6.2.jar to eclipse classpath, removed old lib
12 years ago
reger 3760e2616b bump up lib/metadata-extractor-2.6.2.jar (used for image parser) with needed code adjustments
12 years ago
Michael Peter Christen 16d1d744fa added url_file_name_s in default collection schema for the file name
12 years ago
reger 8d1c4c423d make imageparser fileextension detection case insensitive (extensions are often upper case)
12 years ago
Michael Peter Christen 3e1e358fdc calling pdf cache flush on class initialization because calling of the
12 years ago
Michael Peter Christen 5344a1c5f7 getting the trash out
12 years ago
reger 97ab5b90e8 - odt & ooxml (office document) parser correction to add content to fulltext index
12 years ago
Michael Peter Christen 50421171c3 added new schema fields:
12 years ago
orbiter 17ae51e741 increased number of links limitation from 1000 to 10000 for rss feeds
12 years ago
Michael Peter Christen 788288eb9e added the generation of 50 (!!) new solr field in the core 'webgraph'.
12 years ago
Michael Peter Christen 6a4878940b fix in html parser and bookmark generation
12 years ago
reger 168b1d130d Adding heuristic to get search results from configured systems which support opensearch specification
12 years ago
Michael Peter Christen 95712fdc8b update to pdf parser
12 years ago
Michael Peter Christen f5ca5cea44 - added field options to all solr queries. This can be used to restrict
12 years ago
orbiter 5dfd6359cb redesign of the QueryParams class: introduced QueryGoal which holds the
12 years ago
Michael Peter Christen d88eb657fd Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1
12 years ago
Michael Peter Christen a33e2742cb - removed unnecessary synchronized and deadlock in crawler
12 years ago
reger 722a447b0d - optimize code of augmented parsing to enhence document tags
12 years ago
Michael Peter Christen b991685782 Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1
12 years ago
Michael Peter Christen b7ac1da6a3 gsa results shall have only one title in metadata and that should be the
12 years ago
reger 87aab9aa7c - fix: with augmented parsing = on; missing metadata in index (like title) due to overwriting metadata by adding multiple result docs from augmentparser with same url
12 years ago
Michael Peter Christen 21fe8339b4 - enhanced generation of url objects
12 years ago
Michael Peter Christen 5f0ab25382 removed the option to prevent removal of &amp; parts inside of the
12 years ago