Commit Graph

102 Commits (90f75c8c3db1d322aca97e5c5888d6657d26bc70)

Author SHA1 Message Date
Michael Peter Christen 90f75c8c3d added enrichment of synonyms and vocabularies for imported documents
10 years ago
reger 8a9622c31c fix string OoB on getImagelinks with long alttext
10 years ago
Michael Peter Christen 4144c7cc52 do not write frame links to webgraph
10 years ago
Michael Peter Christen d2792a43fd do not write iframe and embed links into webgraph, but use them anyway
10 years ago
Michael Peter Christen 6a1865f507 refactoring date -> lastModified
10 years ago
orbiter c9e593cf78 removed warnings
10 years ago
reger 8f77719091 fix "Ljava.lang.String" in crawl queue anchor name
10 years ago
Michael Peter Christen 98f45c9032 fix for image alt attachment to AnchorURLs in html parser.
10 years ago
Michael Peter Christen 2de159719b added an option to set 'obey nofollow' for links with rel="nofollow"
10 years ago
Michael Peter Christen fb3dd56b02 fix for processing of noindex flag in http header
11 years ago
Michael Peter Christen aee5b108e5 added linkScraperParser, a parser which ignores the text like the
11 years ago
reger 2d67f29244 adjust mergeDocument after parsing to
11 years ago
reger d8d318233e fix logging settings
11 years ago
Michael Peter Christen 5746aae3db add canonical links to the same crawldepth, not the next crawldepth
11 years ago
Michael Peter Christen da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues:
11 years ago
Michael Peter Christen cca851a417 introduced new solr field crawldepth_i which records the crawl depth of
11 years ago
orbiter 909bbb49d8 added (partly commented) test code for url rewrite methods .. to be
11 years ago
Michael Peter Christen 1b4fa2947d - fixed a problem which ocurred when a document was not recognized with
11 years ago
Michael Peter Christen 61c5e40687 - replaced the properties object in AnchorURL with distinct variables
11 years ago
Michael Peter Christen 5e31bad711 - the webgraph shall store all links which appear on a web page and not
11 years ago
Michael Peter Christen 35ab2cef7b added parsing of 'date', 'dc:date', 'dc.date' and 'last-modified' in
11 years ago
Michael Peter Christen 47b1c81d08 - refactoring
11 years ago
Michael Peter Christen cf12835f20 replaced the single-text description solr field with a multi-value
11 years ago
Roland Haeder 841a28ae76 Added 'final' for all exception blocks as this helps the Java compiler
11 years ago
Michael Peter Christen 5878c1d599 - refactoring of log to ConcurrentLog:
12 years ago
Michael Peter Christen e6f361f474 adding the canonical tag to crawl queues
12 years ago
Michael Peter Christen 16d1d744fa added url_file_name_s in default collection schema for the file name
12 years ago
orbiter 7de5b9cfa0 fix for http://bugs.yacy.net/view.php?id=233
12 years ago
Michael Peter Christen 25499eead5 - added a new field for the regular expression in crawl start
12 years ago
Michael Peter Christen 788288eb9e added the generation of 50 (!!) new solr field in the core 'webgraph'.
12 years ago
reger 3897bb4409 added (manual) urldb migration (link on: Index Administraton -> Federated Solr Index)
12 years ago
Michael Peter Christen 34f8786508 removed dependency of vocabulary navigation from Jena and it's
12 years ago
Michael Peter Christen d6b82840f8 added a feature to find similarities in documents.
12 years ago
Michael Peter Christen d88eb657fd Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1
12 years ago
Michael Peter Christen 6905182d41 - fix for number of words log message
12 years ago
reger 722a447b0d - optimize code of augmented parsing to enhence document tags
12 years ago
orbiter 276dd6452b removed warnings
12 years ago
reger 87aab9aa7c - fix: with augmented parsing = on; missing metadata in index (like title) due to overwriting metadata by adding multiple result docs from augmentparser with same url
12 years ago
Michael Peter Christen ccc3760a47 Refactoring and redesign of data architecture to make URIMetadataRow
12 years ago
Michael Peter Christen 5f0ab25382 removed the option to prevent removal of & parts inside of the
12 years ago
Michael Peter Christen 00c1c777fa refactoring
12 years ago
Michael Peter Christen 528d6763fa - added new solr fields:
12 years ago
orbiter 0cbda0b2b8 - replaced all length() == 0 and size() == 0 with isEmpty()
13 years ago
orbiter fc0f9543fe More SentenceReader cleanup
13 years ago
orbiter 78fc3cf8f8 refactoring and new usage of SentenceReader: this class appeared as one
13 years ago
Michael Peter Christen ad09b786bf clean up parser data
13 years ago
Michael Peter Christen 786be7d175 better integration of RDFaParser
13 years ago
Michael Peter Christen 9264d8b4af removed old navigation practice using subject tags in favor of
13 years ago
Michael Peter Christen 16d8f33795 added objectlink generation to vocabulary generation and editor
13 years ago
Michael Peter Christen 61bb52d55c - using http://purl.org/dc/terms/references to refer from an
13 years ago