Commit Graph

288 Commits (66f6797f5216b6353ddd0eafdba8b3a28817722c)

Author SHA1 Message Date
orbiter 97983ba89f fixed generics warnings for generic array instantiation that appeared 11 years ago
orbiter 88f4af90da removed warnings 11 years ago
reger 2eb7682772 add html5 audio/video <source> tag to html content scraper 11 years ago
reger 0b6db04e40 fix contentscraper img height/width parsing 11 years ago
reger 86f6975edc exclude html tags in in/outboundlinks_anchortext_txt parsed text 11 years ago
Michael Peter Christen da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues: 11 years ago
Michael Peter Christen ce1d1b2fa0 fix for maximum tag length in parser 11 years ago
Michael Peter Christen 67beef657f strong redesign of html parser: object recursion is now made using a 11 years ago
reger af6ad20728 fix: remove obsolete ref to yacy.home 11 years ago
reger 49e76a1c55 make use of detected charset in htmlParser if none is given. 11 years ago
Michael Peter Christen 8b44fcf0f4 added missing @Override annotation 11 years ago
reger bd1685c94a fix not needed getFileExtension().toLower (double) 11 years ago
Michael Peter Christen 022c6d3ce1 do YaCy p2p connections using a timeout-request which covers the http 11 years ago
reger 6932aa4d7a use configured admin-username for api calls 11 years ago
orbiter 3cb6c7861f fixed shutdown authenticaton problem 11 years ago
Michael Peter Christen 77aeb288a2 suppress deprecation warning (for now); TODO: find alternatives 11 years ago
reger f111f30ace Merge origin/master into jetty 11 years ago
orbiter 19a051bec8 more monitoring for postprocessing and enhanced layout in Crawler 11 years ago
reger 1437c45383 merge rc1/master 12 years ago
Michael Peter Christen 81d9e23532 fixed another memory leak in the PDF parser: 12 years ago
Michael Peter Christen a8253ca49c added missing unicode transformation in href link contents during 12 years ago
Michael Peter Christen 60187a4ec2 fix in html parser 12 years ago
reger 5c4ba9b5db merge rc1 master 12 years ago
reger 70c51775ae Merge remote-tracking branch 'origin/master' into jetty 12 years ago
Michael Peter Christen 31920385f7 set anchor rel attribute of all links to "nofollow" if the html meta 12 years ago
Michael Peter Christen 57e00baf26 fix for parsing of image links inside of anchor links (image-links) 12 years ago
Michael Peter Christen 61c5e40687 - replaced the properties object in AnchorURL with distinct variables 12 years ago
Michael Peter Christen 5e31bad711 - the webgraph shall store all links which appear on a web page and not 12 years ago
reger f7f86d8a5d update to Jetty 9 jars 12 years ago
Michael Peter Christen 35ab2cef7b added parsing of 'date', 'dc:date', 'dc.date' and 'last-modified' in 12 years ago
Michael Peter Christen 765943a4b7 Redesign of crawler identification and robots steering. A non-p2p user 12 years ago
reger b4016ff324 - remove possible double initialization of rdfa parser 12 years ago
Michael Peter Christen 58fe986cca Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 12 years ago
Michael Peter Christen cf12835f20 replaced the single-text description solr field with a multi-value 12 years ago
reger 92d3f71b16 htmlParser: closes input stream -> changed it to leave it open for a reset (used by AugmentParser - even if this is practically not used), 12 years ago
reger aa1a1f1d2c - small adjustment to make sure genericParser is tried last 12 years ago
Roland Haeder 841a28ae76 Added 'final' for all exception blocks as this helps the Java compiler 12 years ago
Michael Peter Christen 5878c1d599 - refactoring of log to ConcurrentLog: 12 years ago
reger 83763ee4a4 jpeg parser: extract GPS location from meta data 12 years ago
Michael Peter Christen c4538d8d91 added metadata-extractor-2.6.2.jar to eclipse classpath, removed old lib 12 years ago
reger 3760e2616b bump up lib/metadata-extractor-2.6.2.jar (used for image parser) with needed code adjustments 12 years ago
Michael Peter Christen 16d1d744fa added url_file_name_s in default collection schema for the file name 12 years ago
reger 8d1c4c423d make imageparser fileextension detection case insensitive (extensions are often upper case) 12 years ago
Michael Peter Christen 3e1e358fdc calling pdf cache flush on class initialization because calling of the 12 years ago
Michael Peter Christen 5344a1c5f7 getting the trash out 12 years ago
reger 97ab5b90e8 - odt & ooxml (office document) parser correction to add content to fulltext index 12 years ago
Michael Peter Christen 50421171c3 added new schema fields: 12 years ago
orbiter 17ae51e741 increased number of links limitation from 1000 to 10000 for rss feeds 12 years ago
Michael Peter Christen 788288eb9e added the generation of 50 (!!) new solr field in the core 'webgraph'. 12 years ago
Michael Peter Christen 6a4878940b fix in html parser and bookmark generation 12 years ago