Commit Graph

742 Commits (836953bd5bcc53699888c420e5b66f702629c320)

Author SHA1 Message Date
Michael Peter Christen fed26f33a8 enhanced timezone managament for indexed data:
10 years ago
Michael Peter Christen b060ba900d added parsing of contentprop attribute in html tags for
10 years ago
Michael Peter Christen 4cb4f67f38 added parsing of dd, dt and article html fields. The parsed result is
10 years ago
Michael Peter Christen 4d00175157 <experimental> added parsing of <article> html element.
10 years ago
reger 2e8c24e02a fix link to DeReWo download file
10 years ago
Michael Peter Christen 893889bc7b added special terms for on: - Date modifier: tomorrow, today; i.e.:
10 years ago
Michael Peter Christen 535f1ebe3b added a new way of content browsing in search results:
10 years ago
reger 2d2299f484 fix mimetype of rss items in rss parser
10 years ago
Michael Peter Christen b432049d59 enhanced date parsing time
10 years ago
reger a0f04db9ea add extracted description/subject to pptParser
10 years ago
reger 7e35518787 add extracted description/subject to docParser
10 years ago
Michael Peter Christen 1f5b5c0111 npe fix for latest scraper feature
10 years ago
Michael Peter Christen ee97302a23 hack to make date detection faster (while it becomes a bit incomplete
10 years ago
Michael Peter Christen b5ac29c9a5 added a html field scraper which reads text from html entities of a
10 years ago
Michael Peter Christen de3e373913 using precompiled CommonPattern.TAB for split
10 years ago
Michael Peter Christen 1f5047b15f using precompiled pattern CommonPattern.SEMICOLON for splits
10 years ago
Michael Peter Christen 69eacdf4eb applying precompiled CommonPattern.COMMA.split to all places where
10 years ago
reger 5ca0762179 fix: eom on parsing ico file by genericImageParser
10 years ago
Michael Peter Christen 4144c7cc52 do not write frame links to webgraph
10 years ago
reger 3ac1d14a21 improve TexParser.mimeOf( fileextension ) by returning 1st defined in supported list.
10 years ago
Michael Peter Christen d2792a43fd do not write iframe and embed links into webgraph, but use them anyway
10 years ago
Michael Peter Christen 6ad43c4a8b removed debug code
10 years ago
Michael Peter Christen 9e588944fa prevent NPE during initialization of very large vocabularies
10 years ago
Michael Peter Christen 8c3e5b7b6d added experimental pdf splitting which enables YaCy to split pdfs during
10 years ago
Michael Peter Christen 65125439fe added query modifier 'on'. This makes it possible to search for date
10 years ago
Michael Peter Christen 8b5d074715 fix for image parser (there is a class missing!)
10 years ago
reger 9edc7308aa update to metadata-extractor-2.7.0.jar
10 years ago
Michael Peter Christen bbf0ac40c3 add the actual DateDetection class... (missed in latest commit)
10 years ago
Michael Peter Christen 66b5a56976 Added and integrated new date detection class which can identify date
10 years ago
Michael Peter Christen 6a1865f507 refactoring date -> lastModified
10 years ago
Michael Peter Christen 8df8ffbb6d enhanced the snapshot functionality:
10 years ago
reger 28456dfc09 skip creation of unused Bluelist contenttransformer
10 years ago
Michael Peter Christen 321840fde3 Replaced all fixed thread pools with cached thread pools. The cached
10 years ago
Michael Peter Christen a1ee101079 recognize more html file extensions
10 years ago
reger 0c97cc2440 skip unused call parameter for hashSentence()
10 years ago
reger 5790c7242e skip to tokenize punktuation as word in WordTokenizer
10 years ago
Michael Peter Christen 6a2a669db4 added loading of the synonyms file from addon/synonyms into the
10 years ago
Michael Peter Christen 07c5b57953 removed warnings
10 years ago
reger 59c6532a65 add link extraction to pdfParser
10 years ago
reger aa2e15d846 allow url parameter in worktable apicall
10 years ago
reger b0c87d8240 fix image search expand box, cut-off of 2nd capture line height
10 years ago
Michael Peter Christen 3073c69aee Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
10 years ago
reger eaccce3467 added metadataImageParser for tif and psd (Photoshop) images.
10 years ago
reger a69f5358ff use javax ImageIO getReader to add supported image extension/mime
10 years ago
Michael Peter Christen 67cd4c37bd activated the new apk parser which was already ready but not included in
10 years ago
reger 03a7a29db3 limit OAI import urn resolver try for Deutsche National Library
10 years ago
orbiter b6d57f06eb enhanced the apk parser (up to beeing production-ready).
10 years ago
orbiter c9e593cf78 removed warnings
10 years ago
reger e9eae45b55 simplify rssreader and improve atom feed link extraction
10 years ago
reger 8f77719091 fix "Ljava.lang.String" in crawl queue anchor name
10 years ago
Michael Peter Christen 98f45c9032 fix for image alt attachment to AnchorURLs in html parser.
10 years ago
orbiter 08409ec680 no idea why the words max was an ordered one. This change increaes speed
10 years ago
Michael Peter Christen b44626e55b fixed target_alt_t in webgraph
10 years ago
Michael Peter Christen 2de159719b added an option to set 'obey nofollow' for links with rel="nofollow"
10 years ago
Michael Peter Christen e039e78210 small bugfixes
10 years ago
Michael Peter Christen fb3dd56b02 fix for processing of noindex flag in http header
11 years ago
Michael Peter Christen f3a6b6e21e fix for bad URL decoding
11 years ago
Michael Peter Christen aee5b108e5 added linkScraperParser, a parser which ignores the text like the
11 years ago
reger 40133ba2d0 fix NPE in Condenser,
11 years ago
reger cb2c17d236 extract author and keywords in .doc and .ppt parser
11 years ago
orbiter fec673c9d1 Merge branch 'master' of git@gitorious.org:yacy/rc1.git
11 years ago
orbiter 4a66af716d added apkParser stub (work in progress)
11 years ago
reger 2d67f29244 adjust mergeDocument after parsing to
11 years ago
Michael Peter Christen 0d29b972cc Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
reger 7847a93558 fix AbstractParser.singleList not adding null strings
11 years ago
Michael Peter Christen 8acae852a0 write <em>-tagged texts also into the bold_txt field
11 years ago
reger 3b559e7846 optimize pdfParser
11 years ago
reger 09f73b790f fix pdfParser not closed warning from pdfbox
11 years ago
reger d8d318233e fix logging settings
11 years ago
orbiter 97983ba89f fixed generics warnings for generic array instantiation that appeared
11 years ago
orbiter 88f4af90da removed warnings
11 years ago
reger 8a7c68e4c7 content of surrogates/out never accessed (remove)
11 years ago
reger 2eb7682772 add html5 audio/video <source> tag to html content scraper
11 years ago
reger 0b6db04e40 fix contentscraper img height/width parsing
11 years ago
reger 121d25be38 recover sax fatal error on OAI-PMH import of xml with entity error
11 years ago
reger 86f6975edc exclude html tags in in/outboundlinks_anchortext_txt parsed text
11 years ago
Michael Peter Christen 5746aae3db add canonical links to the same crawldepth, not the next crawldepth
11 years ago
Michael Peter Christen da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues:
11 years ago
Michael Peter Christen ce1d1b2fa0 fix for maximum tag length in parser
11 years ago
Michael Peter Christen 67beef657f strong redesign of html parser: object recursion is now made using a
11 years ago
reger af6ad20728 fix: remove obsolete ref to yacy.home
11 years ago
Michael Peter Christen cca851a417 introduced new solr field crawldepth_i which records the crawl depth of
11 years ago
reger 49e76a1c55 make use of detected charset in htmlParser if none is given.
11 years ago
Michael Peter Christen 8b44fcf0f4 added missing @Override annotation
11 years ago
reger 651d057e93 surrogate import translate dc:language 3-char codes
11 years ago
Michael Peter Christen 453bfd0f17 removed unused variables and warnings
11 years ago
reger 1d01672bd3 fix DCEntry.getIdentifier
11 years ago
reger 6306d28a6a OAI import get multivalued keywords (dc:subject)
11 years ago
reger 5c9dcc269d improve OAI-PMH import identifier recognition
11 years ago
Michael Peter Christen 6e59ca4ebf removed jena library and all code that depended on jena. When jena was
11 years ago
reger bd1685c94a fix not needed getFileExtension().toLower (double)
11 years ago
Michael Peter Christen 022c6d3ce1 do YaCy p2p connections using a timeout-request which covers the http
11 years ago
reger 6932aa4d7a use configured admin-username for api calls
11 years ago
orbiter 3cb6c7861f fixed shutdown authenticaton problem
11 years ago
Michael Peter Christen 77aeb288a2 suppress deprecation warning (for now); TODO: find alternatives
11 years ago
Michael Peter Christen 7603e879dc Merge branch 'master' into HEAD
11 years ago
orbiter 937273d4e3 added parsing of metadata to surrogate reading:
11 years ago
reger effea4bca0 Merge origin/master into jetty
11 years ago
orbiter 61409788eb less word hash computations (removing some overhead because of MD5
11 years ago
reger f111f30ace Merge origin/master into jetty
11 years ago
orbiter 19a051bec8 more monitoring for postprocessing and enhanced layout in Crawler
11 years ago
Michael Peter Christen 1a4a69c226 set more logger to 'final static'
11 years ago
orbiter 4234b0ed6c Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
orbiter 909bbb49d8 added (partly commented) test code for url rewrite methods .. to be
11 years ago
reger 1437c45383 merge rc1/master
11 years ago
Michael Peter Christen 81d9e23532 fixed another memory leak in the PDF parser:
11 years ago
Michael Peter Christen a8253ca49c added missing unicode transformation in href link contents during
11 years ago
Michael Peter Christen 60187a4ec2 fix in html parser
11 years ago
reger f017066197 Merge origin/master into jetty
11 years ago
Michael Peter Christen 9bb7eab389 hacks to prevent storage of data longer than necessary during search and
11 years ago
Michael Peter Christen 1b4fa2947d - fixed a problem which ocurred when a document was not recognized with
11 years ago
reger 5c4ba9b5db merge rc1 master
11 years ago
reger 70c51775ae Merge remote-tracking branch 'origin/master' into jetty
11 years ago
orbiter 6e8377b8ad do not check all words with synonym library if the library is empty
11 years ago
Michael Peter Christen 31920385f7 set anchor rel attribute of all links to "nofollow" if the html meta
11 years ago
Michael Peter Christen 57e00baf26 fix for parsing of image links inside of anchor links (image-links)
11 years ago
Michael Peter Christen 61c5e40687 - replaced the properties object in AnchorURL with distinct variables
11 years ago
Michael Peter Christen 5e31bad711 - the webgraph shall store all links which appear on a web page and not
11 years ago
reger f7f86d8a5d update to Jetty 9 jars
11 years ago
Michael Peter Christen 35ab2cef7b added parsing of 'date', 'dc:date', 'dc.date' and 'last-modified' in
11 years ago
Michael Peter Christen 765943a4b7 Redesign of crawler identification and robots steering. A non-p2p user
11 years ago
Michael Peter Christen 47b1c81d08 - refactoring
11 years ago
reger b4016ff324 - remove possible double initialization of rdfa parser
11 years ago
Michael Peter Christen 58fe986cca Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
Michael Peter Christen cf12835f20 replaced the single-text description solr field with a multi-value
11 years ago
reger 92d3f71b16 htmlParser: closes input stream -> changed it to leave it open for a reset (used by AugmentParser - even if this is practically not used),
11 years ago
reger aa1a1f1d2c - small adjustment to make sure genericParser is tried last
11 years ago
Roland Haeder 841a28ae76 Added 'final' for all exception blocks as this helps the Java compiler
11 years ago
Michael Peter Christen 5878c1d599 - refactoring of log to ConcurrentLog:
12 years ago
Michael Peter Christen e6f361f474 adding the canonical tag to crawl queues
12 years ago
reger 83763ee4a4 jpeg parser: extract GPS location from meta data
12 years ago
Michael Peter Christen c4538d8d91 added metadata-extractor-2.6.2.jar to eclipse classpath, removed old lib
12 years ago
reger 3760e2616b bump up lib/metadata-extractor-2.6.2.jar (used for image parser) with needed code adjustments
12 years ago
Michael Peter Christen 16d1d744fa added url_file_name_s in default collection schema for the file name
12 years ago
reger 8d1c4c423d make imageparser fileextension detection case insensitive (extensions are often upper case)
12 years ago
Michael Peter Christen 3e1e358fdc calling pdf cache flush on class initialization because calling of the
12 years ago
Michael Peter Christen 5344a1c5f7 getting the trash out
12 years ago
Michael Peter Christen 8f2d3ce2f9 reduced locking situation in crawler: shifted synchronized location and
12 years ago
reger 97ab5b90e8 - odt & ooxml (office document) parser correction to add content to fulltext index
12 years ago
orbiter 7de5b9cfa0 fix for http://bugs.yacy.net/view.php?id=233
12 years ago
Michael Peter Christen 25499eead5 - added a new field for the regular expression in crawl start
12 years ago
Michael Peter Christen 50421171c3 added new schema fields:
12 years ago
Michael Peter Christen 7ab5093321 added new solr title_exact_signature_l and
12 years ago
orbiter 17ae51e741 increased number of links limitation from 1000 to 10000 for rss feeds
12 years ago
Michael Peter Christen addba047e2 changes in ranking computation
12 years ago
Michael Peter Christen 788288eb9e added the generation of 50 (!!) new solr field in the core 'webgraph'.
12 years ago
Michael Peter Christen 6a4878940b fix in html parser and bookmark generation
12 years ago
reger 3897bb4409 added (manual) urldb migration (link on: Index Administraton -> Federated Solr Index)
12 years ago
reger 168b1d130d Adding heuristic to get search results from configured systems which support opensearch specification
12 years ago
Michael Peter Christen 95712fdc8b update to pdf parser
12 years ago
Michael Peter Christen 34f8786508 removed dependency of vocabulary navigation from Jena and it's
12 years ago
Michael Peter Christen 72f165d58b added a Boost class which stores solr query boost values. The class can
12 years ago
Michael Peter Christen b5ee88c6af added more logging to get info which url causes performance problems
12 years ago
Michael Peter Christen d6b82840f8 added a feature to find similarities in documents.
12 years ago
Michael Peter Christen f5ca5cea44 - added field options to all solr queries. This can be used to restrict
12 years ago
orbiter 5dfd6359cb redesign of the QueryParams class: introduced QueryGoal which holds the
12 years ago
Michael Peter Christen d88eb657fd Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1
12 years ago
Michael Peter Christen 6905182d41 - fix for number of words log message
12 years ago
Michael Peter Christen a33e2742cb - removed unnecessary synchronized and deadlock in crawler
12 years ago
reger 722a447b0d - optimize code of augmented parsing to enhence document tags
12 years ago
orbiter 276dd6452b removed warnings
12 years ago
Michael Peter Christen b991685782 Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1
12 years ago
Michael Peter Christen b7ac1da6a3 gsa results shall have only one title in metadata and that should be the
12 years ago
reger 87aab9aa7c - fix: with augmented parsing = on; missing metadata in index (like title) due to overwriting metadata by adding multiple result docs from augmentparser with same url
12 years ago
Michael Peter Christen ccc3760a47 Refactoring and redesign of data architecture to make URIMetadataRow
12 years ago
Michael Peter Christen 21fe8339b4 - enhanced generation of url objects
12 years ago
Michael Peter Christen 5f0ab25382 removed the option to prevent removal of &amp; parts inside of the
12 years ago
orbiter 68d0f8de03 Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1
12 years ago
reger bfb0d4c69b - add language detection from <html lang="xx"> tag
12 years ago
Michael Peter Christen 7e3e45fd04 added Open Graph Metadata default fields, see http://ogp.me/ns#
12 years ago
Michael Peter Christen c3e5f667a7 added schema.org breadcrumb counter to parser and solr schema
12 years ago
Michael Peter Christen 4b5e0c1500 added an url rewriter which can be used to remove session ids from urls
12 years ago
Michael Peter Christen 584663ae8c - redesign of solr query construction
12 years ago
Michael Peter Christen 6ab64746d7 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
12 years ago
sof 5cb244b79b Merge remote branch 'origin/master'
12 years ago
apfelmaennchen 88b062210c Added a parser for audio file tags (e.g. ID3 tags for MP3 files) based
12 years ago
Michael Peter Christen 31485a963d refactoring
12 years ago
Michael Peter Christen 3d33a5bdf6 turned the synonyms_t Text field into a multi-valued String field
12 years ago
Michael Peter Christen 3b959ee002 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
12 years ago
orbiter 3190347814 added a synonyms_t field to solr and a process to read synonym files.
12 years ago
Michael Peter Christen 411d0e839b added an underline text field to solr to record all underlined texts
12 years ago
Michael Peter Christen 24d2ee3c52 - better date ranking
12 years ago
sixcooler 6c50d016ed pdf- and zipParser should not use forced Memory-Limits
12 years ago
Michael Peter Christen 1533bfd63b refactoring
12 years ago
Michael Peter Christen 8219a445f3 refactoring
12 years ago
Michael Peter Christen 00c1c777fa refactoring
12 years ago
orbiter 63762d8f89 removed kelondro dependencies from cora
12 years ago
Michael Peter Christen e54ac38095 - some corrections in usage of getFile() and getFileName()
12 years ago
Michael Peter Christen 528d6763fa - added new solr fields:
12 years ago
Michael Peter Christen e8acd542b5 - added faceted drill-down for host and geolocation to solr queries
12 years ago
orbiter 67f2866cd0 small fixes
12 years ago
orbiter 67edfd991c Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
12 years ago
orbiter d9173ba7ed added more solr fields to integrate values from URIMetadataRow. All
12 years ago
Michael Peter Christen 24d9db1613 snippet retrieval loading processes may use a smaller minimum load time
12 years ago
Michael Peter Christen 1687737771 Abstraction of HandleMap and HandleSet
12 years ago
orbiter 482afed07c reduced logging overhead (a bit)
13 years ago
orbiter bbfa497a3c replaced more size() > 0 by !isEmpty()
13 years ago
orbiter 0cbda0b2b8 - replaced all length() == 0 and size() == 0 with isEmpty()
13 years ago
Michael Peter Christen 801972fe6f fix for url camel case parser and sentence reader
13 years ago
Michael Peter Christen fbc1a2030d fix for sitemap importer: can now also import very large sitemaps within
13 years ago
Michael Peter Christen 92731e5287 fix for sevenzip parser
13 years ago
Michael Peter Christen 8efc1c1078 - fixed a memory leak (or bad usage) during parsing/snippet fetch
13 years ago
Michael Peter Christen b1e7c11fba fix for pattern matcher in html parser
13 years ago
Michael Peter Christen b0c408788b made class methods static where possible
13 years ago
Michael Peter Christen 7c1ba99755 removed more unused method parameters
13 years ago
Michael Peter Christen 0301aba1e9 removed unused method parameters
13 years ago
Michael Peter Christen d3964253ae - added @SuppressWarnings to unused servlet method parameters
13 years ago
Michael Peter Christen ea10766bfd cleaned unnecessary nested code
13 years ago
orbiter fc0f9543fe More SentenceReader cleanup
13 years ago
orbiter 586bb0eb6a Simplified SentenceReader (no more Reader inside..)
13 years ago
orbiter 7f851d62a7 replaced HashARC with SizeLimited Objects which are less costly
13 years ago
orbiter 78fc3cf8f8 refactoring and new usage of SentenceReader: this class appeared as one
13 years ago
orbiter bb8dcb4911 automatically adopt size of word cache to available memory
13 years ago
Michael Peter Christen ad09b786bf clean up parser data
13 years ago
Michael Peter Christen 276a66a793 Adding a limit of 1000 links that a parser shall store during indexing.
13 years ago
Michael Peter Christen de903a53a0 parser refactoring & hacks
13 years ago
Michael Peter Christen 1825f165b8 better integration of blacklist according to use case
13 years ago
Michael Peter Christen ce8d4b87d9 fixes for new eclipse 'Juno' warning 'Resource leak'.
13 years ago
Michael Peter Christen 0c345d1559 giving threads name so its easier to see whats happening during
13 years ago
Michael Peter Christen 508a81b86c added solr field 'refresh_s' which stores the refresh url contained in
13 years ago
Michael Peter Christen f3167def64 do not fill the keywords with title content if keywords do not exist.
13 years ago
Michael Peter Christen 77f795756c fixing redirects and status codes: storing of status code in
13 years ago
Michael Peter Christen dbdd697f4d moved RDFaParser.xsl configuration file to defaults
13 years ago
Michael Peter Christen 786be7d175 better integration of RDFaParser
13 years ago
Michael Peter Christen de3ef8ad73 removed unimportant warnings
13 years ago
Michael Peter Christen 24bbe359ca integrate also geonames library files for less cities. these are more
13 years ago
Michael Peter Christen 223a5440ab preventing that an empty pnd is inserted into the vocabularies
13 years ago
Michael Peter Christen 963f92ed9a - merged files
13 years ago
Michael Peter Christen dd88d0ace2 more logging
13 years ago
Michael Peter Christen 94d54e2d91 added recognition of multi-word terms in vocabulary matching
13 years ago
Michael Peter Christen 64c0268b2b show triplestore metadata in yacydoc and viewfile
13 years ago
Michael Peter Christen c2f0d16d2c fixed vocabulary initialization
13 years ago
Michael Peter Christen df3531f8d5 added the generation of virtual vocabularies using the pnd
13 years ago
Michael Peter Christen a0f1decd82 - added loading of the dbpedia pnd triplestore in the dictionary loader
13 years ago
Michael Peter Christen 16d8f33795 added objectlink generation to vocabulary generation and editor
13 years ago
Michael Peter Christen d45718251e refactoring (Localization -> Location)
13 years ago
Michael Peter Christen b8b3c87ba7 - renamed localization to location (that was confusing)
13 years ago
Michael Peter Christen e89747bb67 - added automated generation of vocabularies from url stubs
13 years ago
Michael Peter Christen 79464189a4 The 'Locale' vocabulary, which is generated by geo data, has now the
13 years ago
Michael Peter Christen 61bb52d55c - using http://purl.org/dc/terms/references to refer from an
13 years ago
Michael Peter Christen 50c576599b allow multiple parser options instead of printing an error
13 years ago
Michael Peter Christen 8b53771db2 changed behavior of navigation processing:
13 years ago
Michael Peter Christen 5fc6524ca8 - moved triple store to net.yacy.cora.lod (should be generalized there
13 years ago
cominch bbfc53b663 bugfix
13 years ago
cominch 65c5826d93 bugfix
13 years ago
cominch 5f8ba7f4f2 small changes
13 years ago
cominch 90512640bf Added config switches for custom parser
13 years ago
cominch bcbd8eee33 Add several parsers, for RDFa and rdf files.
13 years ago
cominch 9cbfc1a1c0 augmentedProxy, which forwards every proxy request to a
13 years ago
Michael Peter Christen cde20911bb saved a bit more ram using UTF8 String compression for OpenGeoDB and
13 years ago