Commit Graph

547 Commits (ff11ac89f7cc04475ed645c15e8e42c7c8672145)

Author SHA1 Message Date
reger 03a7a29db3 limit OAI import urn resolver try for Deutsche National Library
11 years ago
orbiter b6d57f06eb enhanced the apk parser (up to beeing production-ready).
11 years ago
orbiter c9e593cf78 removed warnings
11 years ago
reger e9eae45b55 simplify rssreader and improve atom feed link extraction
11 years ago
reger 8f77719091 fix "Ljava.lang.String" in crawl queue anchor name
11 years ago
Michael Peter Christen 98f45c9032 fix for image alt attachment to AnchorURLs in html parser.
11 years ago
orbiter 08409ec680 no idea why the words max was an ordered one. This change increaes speed
11 years ago
Michael Peter Christen b44626e55b fixed target_alt_t in webgraph
11 years ago
Michael Peter Christen 2de159719b added an option to set 'obey nofollow' for links with rel="nofollow"
11 years ago
Michael Peter Christen e039e78210 small bugfixes
11 years ago
Michael Peter Christen fb3dd56b02 fix for processing of noindex flag in http header
11 years ago
Michael Peter Christen f3a6b6e21e fix for bad URL decoding
11 years ago
Michael Peter Christen aee5b108e5 added linkScraperParser, a parser which ignores the text like the
11 years ago
reger 40133ba2d0 fix NPE in Condenser,
11 years ago
reger cb2c17d236 extract author and keywords in .doc and .ppt parser
11 years ago
orbiter fec673c9d1 Merge branch 'master' of git@gitorious.org:yacy/rc1.git
11 years ago
orbiter 4a66af716d added apkParser stub (work in progress)
11 years ago
reger 2d67f29244 adjust mergeDocument after parsing to
11 years ago
Michael Peter Christen 0d29b972cc Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
reger 7847a93558 fix AbstractParser.singleList not adding null strings
11 years ago
Michael Peter Christen 8acae852a0 write <em>-tagged texts also into the bold_txt field
11 years ago
reger 3b559e7846 optimize pdfParser
11 years ago
reger 09f73b790f fix pdfParser not closed warning from pdfbox
11 years ago
reger d8d318233e fix logging settings
11 years ago
orbiter 97983ba89f fixed generics warnings for generic array instantiation that appeared
11 years ago
orbiter 88f4af90da removed warnings
11 years ago
reger 8a7c68e4c7 content of surrogates/out never accessed (remove)
11 years ago
reger 2eb7682772 add html5 audio/video <source> tag to html content scraper
11 years ago
reger 0b6db04e40 fix contentscraper img height/width parsing
11 years ago
reger 121d25be38 recover sax fatal error on OAI-PMH import of xml with entity error
11 years ago
reger 86f6975edc exclude html tags in in/outboundlinks_anchortext_txt parsed text
11 years ago
Michael Peter Christen 5746aae3db add canonical links to the same crawldepth, not the next crawldepth
11 years ago
Michael Peter Christen da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues:
11 years ago
Michael Peter Christen ce1d1b2fa0 fix for maximum tag length in parser
11 years ago
Michael Peter Christen 67beef657f strong redesign of html parser: object recursion is now made using a
11 years ago
reger af6ad20728 fix: remove obsolete ref to yacy.home
11 years ago
Michael Peter Christen cca851a417 introduced new solr field crawldepth_i which records the crawl depth of
11 years ago
reger 49e76a1c55 make use of detected charset in htmlParser if none is given.
11 years ago
Michael Peter Christen 8b44fcf0f4 added missing @Override annotation
11 years ago
reger 651d057e93 surrogate import translate dc:language 3-char codes
11 years ago
Michael Peter Christen 453bfd0f17 removed unused variables and warnings
11 years ago
reger 1d01672bd3 fix DCEntry.getIdentifier
11 years ago
reger 6306d28a6a OAI import get multivalued keywords (dc:subject)
11 years ago
reger 5c9dcc269d improve OAI-PMH import identifier recognition
11 years ago
Michael Peter Christen 6e59ca4ebf removed jena library and all code that depended on jena. When jena was
11 years ago
reger bd1685c94a fix not needed getFileExtension().toLower (double)
11 years ago
Michael Peter Christen 022c6d3ce1 do YaCy p2p connections using a timeout-request which covers the http
11 years ago
reger 6932aa4d7a use configured admin-username for api calls
11 years ago
orbiter 3cb6c7861f fixed shutdown authenticaton problem
11 years ago
Michael Peter Christen 77aeb288a2 suppress deprecation warning (for now); TODO: find alternatives
11 years ago
Michael Peter Christen 7603e879dc Merge branch 'master' into HEAD
11 years ago
orbiter 937273d4e3 added parsing of metadata to surrogate reading:
11 years ago
reger effea4bca0 Merge origin/master into jetty
11 years ago
orbiter 61409788eb less word hash computations (removing some overhead because of MD5
11 years ago
reger f111f30ace Merge origin/master into jetty
11 years ago
orbiter 19a051bec8 more monitoring for postprocessing and enhanced layout in Crawler
11 years ago
Michael Peter Christen 1a4a69c226 set more logger to 'final static'
11 years ago
orbiter 4234b0ed6c Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
orbiter 909bbb49d8 added (partly commented) test code for url rewrite methods .. to be
11 years ago
reger 1437c45383 merge rc1/master
11 years ago
Michael Peter Christen 81d9e23532 fixed another memory leak in the PDF parser:
11 years ago
Michael Peter Christen a8253ca49c added missing unicode transformation in href link contents during
11 years ago
Michael Peter Christen 60187a4ec2 fix in html parser
11 years ago
reger f017066197 Merge origin/master into jetty
11 years ago
Michael Peter Christen 9bb7eab389 hacks to prevent storage of data longer than necessary during search and
11 years ago
Michael Peter Christen 1b4fa2947d - fixed a problem which ocurred when a document was not recognized with
11 years ago
reger 5c4ba9b5db merge rc1 master
12 years ago
reger 70c51775ae Merge remote-tracking branch 'origin/master' into jetty
12 years ago
orbiter 6e8377b8ad do not check all words with synonym library if the library is empty
12 years ago
Michael Peter Christen 31920385f7 set anchor rel attribute of all links to "nofollow" if the html meta
12 years ago
Michael Peter Christen 57e00baf26 fix for parsing of image links inside of anchor links (image-links)
12 years ago
Michael Peter Christen 61c5e40687 - replaced the properties object in AnchorURL with distinct variables
12 years ago
Michael Peter Christen 5e31bad711 - the webgraph shall store all links which appear on a web page and not
12 years ago
reger f7f86d8a5d update to Jetty 9 jars
12 years ago
Michael Peter Christen 35ab2cef7b added parsing of 'date', 'dc:date', 'dc.date' and 'last-modified' in
12 years ago
Michael Peter Christen 765943a4b7 Redesign of crawler identification and robots steering. A non-p2p user
12 years ago
Michael Peter Christen 47b1c81d08 - refactoring
12 years ago
reger b4016ff324 - remove possible double initialization of rdfa parser
12 years ago
Michael Peter Christen 58fe986cca Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
12 years ago
Michael Peter Christen cf12835f20 replaced the single-text description solr field with a multi-value
12 years ago
reger 92d3f71b16 htmlParser: closes input stream -> changed it to leave it open for a reset (used by AugmentParser - even if this is practically not used),
12 years ago
reger aa1a1f1d2c - small adjustment to make sure genericParser is tried last
12 years ago
Roland Haeder 841a28ae76 Added 'final' for all exception blocks as this helps the Java compiler
12 years ago
Michael Peter Christen 5878c1d599 - refactoring of log to ConcurrentLog:
12 years ago
Michael Peter Christen e6f361f474 adding the canonical tag to crawl queues
12 years ago
reger 83763ee4a4 jpeg parser: extract GPS location from meta data
12 years ago
Michael Peter Christen c4538d8d91 added metadata-extractor-2.6.2.jar to eclipse classpath, removed old lib
12 years ago
reger 3760e2616b bump up lib/metadata-extractor-2.6.2.jar (used for image parser) with needed code adjustments
12 years ago
Michael Peter Christen 16d1d744fa added url_file_name_s in default collection schema for the file name
12 years ago
reger 8d1c4c423d make imageparser fileextension detection case insensitive (extensions are often upper case)
12 years ago
Michael Peter Christen 3e1e358fdc calling pdf cache flush on class initialization because calling of the
12 years ago
Michael Peter Christen 5344a1c5f7 getting the trash out
12 years ago
Michael Peter Christen 8f2d3ce2f9 reduced locking situation in crawler: shifted synchronized location and
12 years ago
reger 97ab5b90e8 - odt & ooxml (office document) parser correction to add content to fulltext index
12 years ago
orbiter 7de5b9cfa0 fix for http://bugs.yacy.net/view.php?id=233
12 years ago
Michael Peter Christen 25499eead5 - added a new field for the regular expression in crawl start
12 years ago
Michael Peter Christen 50421171c3 added new schema fields:
12 years ago
Michael Peter Christen 7ab5093321 added new solr title_exact_signature_l and
12 years ago
orbiter 17ae51e741 increased number of links limitation from 1000 to 10000 for rss feeds
12 years ago
Michael Peter Christen addba047e2 changes in ranking computation
12 years ago