Commit Graph

448 Commits (8cbc1c970ae158e2b67e9c2e75c3ad9a4f155079)

Author SHA1 Message Date
Michael Peter Christen 77aeb288a2 suppress deprecation warning (for now); TODO: find alternatives
11 years ago
Michael Peter Christen 7603e879dc Merge branch 'master' into HEAD
11 years ago
orbiter 937273d4e3 added parsing of metadata to surrogate reading:
11 years ago
reger effea4bca0 Merge origin/master into jetty
11 years ago
orbiter 61409788eb less word hash computations (removing some overhead because of MD5
11 years ago
reger f111f30ace Merge origin/master into jetty
11 years ago
orbiter 19a051bec8 more monitoring for postprocessing and enhanced layout in Crawler
11 years ago
Michael Peter Christen 1a4a69c226 set more logger to 'final static'
11 years ago
orbiter 4234b0ed6c Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
orbiter 909bbb49d8 added (partly commented) test code for url rewrite methods .. to be
11 years ago
reger 1437c45383 merge rc1/master
11 years ago
Michael Peter Christen 81d9e23532 fixed another memory leak in the PDF parser:
11 years ago
Michael Peter Christen a8253ca49c added missing unicode transformation in href link contents during
11 years ago
Michael Peter Christen 60187a4ec2 fix in html parser
11 years ago
reger f017066197 Merge origin/master into jetty
11 years ago
Michael Peter Christen 9bb7eab389 hacks to prevent storage of data longer than necessary during search and
11 years ago
Michael Peter Christen 1b4fa2947d - fixed a problem which ocurred when a document was not recognized with
11 years ago
reger 5c4ba9b5db merge rc1 master
11 years ago
reger 70c51775ae Merge remote-tracking branch 'origin/master' into jetty
11 years ago
orbiter 6e8377b8ad do not check all words with synonym library if the library is empty
11 years ago
Michael Peter Christen 31920385f7 set anchor rel attribute of all links to "nofollow" if the html meta
11 years ago
Michael Peter Christen 57e00baf26 fix for parsing of image links inside of anchor links (image-links)
11 years ago
Michael Peter Christen 61c5e40687 - replaced the properties object in AnchorURL with distinct variables
11 years ago
Michael Peter Christen 5e31bad711 - the webgraph shall store all links which appear on a web page and not
11 years ago
reger f7f86d8a5d update to Jetty 9 jars
11 years ago
Michael Peter Christen 35ab2cef7b added parsing of 'date', 'dc:date', 'dc.date' and 'last-modified' in
11 years ago
Michael Peter Christen 765943a4b7 Redesign of crawler identification and robots steering. A non-p2p user
11 years ago
Michael Peter Christen 47b1c81d08 - refactoring
11 years ago
reger b4016ff324 - remove possible double initialization of rdfa parser
11 years ago
Michael Peter Christen 58fe986cca Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
Michael Peter Christen cf12835f20 replaced the single-text description solr field with a multi-value
11 years ago
reger 92d3f71b16 htmlParser: closes input stream -> changed it to leave it open for a reset (used by AugmentParser - even if this is practically not used),
11 years ago
reger aa1a1f1d2c - small adjustment to make sure genericParser is tried last
11 years ago
Roland Haeder 841a28ae76 Added 'final' for all exception blocks as this helps the Java compiler
11 years ago
Michael Peter Christen 5878c1d599 - refactoring of log to ConcurrentLog:
12 years ago
Michael Peter Christen e6f361f474 adding the canonical tag to crawl queues
12 years ago
reger 83763ee4a4 jpeg parser: extract GPS location from meta data
12 years ago
Michael Peter Christen c4538d8d91 added metadata-extractor-2.6.2.jar to eclipse classpath, removed old lib
12 years ago
reger 3760e2616b bump up lib/metadata-extractor-2.6.2.jar (used for image parser) with needed code adjustments
12 years ago
Michael Peter Christen 16d1d744fa added url_file_name_s in default collection schema for the file name
12 years ago
reger 8d1c4c423d make imageparser fileextension detection case insensitive (extensions are often upper case)
12 years ago
Michael Peter Christen 3e1e358fdc calling pdf cache flush on class initialization because calling of the
12 years ago
Michael Peter Christen 5344a1c5f7 getting the trash out
12 years ago
Michael Peter Christen 8f2d3ce2f9 reduced locking situation in crawler: shifted synchronized location and
12 years ago
reger 97ab5b90e8 - odt & ooxml (office document) parser correction to add content to fulltext index
12 years ago
orbiter 7de5b9cfa0 fix for http://bugs.yacy.net/view.php?id=233
12 years ago
Michael Peter Christen 25499eead5 - added a new field for the regular expression in crawl start
12 years ago
Michael Peter Christen 50421171c3 added new schema fields:
12 years ago
Michael Peter Christen 7ab5093321 added new solr title_exact_signature_l and
12 years ago
orbiter 17ae51e741 increased number of links limitation from 1000 to 10000 for rss feeds
12 years ago