Commit Graph

129 Commits (7525594315126b13bb390c0cfe5a71e7d51ff040)

Author SHA1 Message Date
luccioman fb3032c530 Added a crawl filtering possibility on documents Media Type (MIME)
7 years ago
luccioman 9a7a353d0e Removed some unnecessary intermediate list creation on array copy.
7 years ago
luccioman 9412881230 Added basic support for autotagging microdata annotated item types.
7 years ago
luccioman 651fad6da5 Added RSS parser support for maximum content bytes parsing limit
8 years ago
luccioman bf55f1d6e5 Started support of partial parsing on large streamed resources.
8 years ago
luccioman 8da3174867 Ensure lower case conversion consistency with any default locale.
8 years ago
luccioman d98c04853d Ensure proper closing of file input streams.
8 years ago
reger 077d062be3 Adjust mergeDocuments to keep youngest last-modified date of document
8 years ago
luccioman 654801523e Fixed StringIndexOutOfBoundsException case.
8 years ago
reger c77e43a391 Take out mailto collect in internal parsed document
8 years ago
luccioman 6a4d51d8f9 Cleaned up some Javadoc warnings.
8 years ago
reger 4c9be29a55 fix concurrency issue with htmlParser using not current scraper data
8 years ago
reger 8fe28a83f2 harmonize used lastmodified date for rwi and fulltext in storeDocument
8 years ago
luccioman 6e1959f469 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git
8 years ago
reger 4c7a77662a eleminate dependency on file-extension in storeDocument but use supported mime-type
9 years ago
reger 27163af0e1 improve detection of referenced links by taking http and https link protocol
9 years ago
luc 9f712146df Display icons in ViewFile "links" mode.
9 years ago
luc 3cc5619d93 Improved HTML icons indexing and rendering in search results.
9 years ago
luc 571bc55937 Refactoring : use StandardCharsets constants instead of hard-coded
9 years ago
reger 45b9bd8403 adjust MultiProtocolURL.protocol detection to handle mailto with "://" in parameters,
9 years ago
reger 0c5548a7ff fix (todo) remove redundant holding of email link nameproperty in parser document
9 years ago
reger 6b7c10cef8 fix dc:date in mediawikiimporter/document.writexml to use lastmodified
9 years ago
reger 4d2b934487 prevent mailto links getting into parser result document's in/outbound link collection
9 years ago
reger 52a9040ae6 Sort out double keywords (dc_subject) early in parsed documents
9 years ago
reger e76a90837b update zip and tar parser process,
9 years ago
reger 1e8369e18b use a parsed date in Document.toString
10 years ago
Michael Peter Christen df3314ac1a added a new facet type based on a probabilistic classifier using
10 years ago
Michael Peter Christen 90f75c8c3d added enrichment of synonyms and vocabularies for imported documents
10 years ago
reger 8a9622c31c fix string OoB on getImagelinks with long alttext
10 years ago
Michael Peter Christen 4144c7cc52 do not write frame links to webgraph
10 years ago
Michael Peter Christen d2792a43fd do not write iframe and embed links into webgraph, but use them anyway
10 years ago
Michael Peter Christen 6a1865f507 refactoring date -> lastModified
10 years ago
orbiter c9e593cf78 removed warnings
11 years ago
reger 8f77719091 fix "Ljava.lang.String" in crawl queue anchor name
11 years ago
Michael Peter Christen 98f45c9032 fix for image alt attachment to AnchorURLs in html parser.
11 years ago
Michael Peter Christen 2de159719b added an option to set 'obey nofollow' for links with rel="nofollow"
11 years ago
Michael Peter Christen fb3dd56b02 fix for processing of noindex flag in http header
11 years ago
Michael Peter Christen aee5b108e5 added linkScraperParser, a parser which ignores the text like the
11 years ago
reger 2d67f29244 adjust mergeDocument after parsing to
11 years ago
reger d8d318233e fix logging settings
11 years ago
Michael Peter Christen 5746aae3db add canonical links to the same crawldepth, not the next crawldepth
11 years ago
Michael Peter Christen da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues:
11 years ago
Michael Peter Christen cca851a417 introduced new solr field crawldepth_i which records the crawl depth of
11 years ago
orbiter 909bbb49d8 added (partly commented) test code for url rewrite methods .. to be
11 years ago
Michael Peter Christen 1b4fa2947d - fixed a problem which ocurred when a document was not recognized with
11 years ago
Michael Peter Christen 61c5e40687 - replaced the properties object in AnchorURL with distinct variables
12 years ago
Michael Peter Christen 5e31bad711 - the webgraph shall store all links which appear on a web page and not
12 years ago
Michael Peter Christen 35ab2cef7b added parsing of 'date', 'dc:date', 'dc.date' and 'last-modified' in
12 years ago
Michael Peter Christen 47b1c81d08 - refactoring
12 years ago
Michael Peter Christen cf12835f20 replaced the single-text description solr field with a multi-value
12 years ago