yacy_search_server

Commit Graph

Author	SHA1	Message	Date
reger	c6495a5b62	add a log entry on parsing ajax crawling scheme snapshot (prev. commit `9252e36aeb`)	9 years ago
reger	9252e36aeb	implement ajax crawling scheme for ajax sites which adhere to the proposed use of hash-bangs to provide html content see freshly deprecated https://developers.google.com/webmasters/ajax-crawling/ Implementation improves parsing of the homepage (ajax page) which uses metatag "fragment" in header and parses supplied html snapshot instead of mostly empty ajax/scripted page. Implementation supports also hash-bang urls (url with anchor starting with ! like ...path#!hashfragment) but our crawler filters it (use of hash-bang is controversly discussed and proposal is deprecated, makes no sense to adjust the crawler, but as long as it is used by some sites the minor change/improvement in htmlparser is good for some time). Quick - how does it work - if metatag fragment with content "!" is found - htmlparser tries to get content of htmls snapshot (using a different url) - htmlparser returns 2 documents (original url and snapshot content - but using same original url) - after parsing result documents are joined (and stored to index containing content also from snapshot page... as the original ajax page contains typically no parseable html content)	9 years ago
Michael Peter Christen	7d075a1d76	added log lines	9 years ago
luc	d6522fa4a2	Integrated haraldk/TwelveMonkeys library to first add TIF image format support.	9 years ago
reger	78e8c6f3e5	refactor special handling (static override) of SUPPORTED_EXTENSIONS/MIME_TYPES not used for genericImageParser	9 years ago
reger	d54c5d310a	add links with image extension not automatically to image links. With the wide spread use e.g. of Wikimedia the url file extension of links with image extension often point to html.	9 years ago
reger	851e8f6c8a	check jpeg file signature in genericImageParser to fail early without further object allocation if source is not a jpeg.	9 years ago
reger	d5330391de	remove some unused var allocation in parser	9 years ago
reger	7c82cd4415	add a end condition to svgParser for wrong content (if parser choosen just by file extension)	9 years ago
reger	356d4d1301	remove rdfParser from init (current function identical with genericParser)	9 years ago
reger	c647d899e3	add svgParser to parse metadate from svg images Reads document level included title and description and skips the graphic content to save bandwidth. svg metadata element is not interpreted - remove rdfParser from init (current function identical with genericParser)	9 years ago
reger	bad34804fe	optimize parseInt for <img> tag attribute parsing Performance better as using Numberformat.parse or parseInt(substring())	9 years ago
reger	2f51baff4f	check for loading error (includs unsupported formats) to prevent blank thumbnail display in image search because of not handled source which don't load on click. Now the cross icon indicates the problem (inlcuding not supported format)	9 years ago
reger	a3195d78ae	add Portuguese month names to date recognition	9 years ago
reger	d2cc11ea8f	fix html parser taking <style> content as text. Noticed some result description contain css content from style tag. Added <style> to tag list to scrape it's content not as text + test case included	9 years ago
reger	1e8369e18b	use a parsed date in Document.toString	9 years ago
reger	41c4eade51	extract modification date from vCard (vcfParser)	9 years ago
reger	8768896975	extract lastmodified from openoffice doc set lastmod date in office document parsers	9 years ago
sixcooler	a3dd4be749	added / corrected charste to be 1.7 compatible. @Orbiter: please check is this is ok for you	9 years ago
Michael Peter Christen	df3314ac1a	added a new facet type based on a probabilistic classifier using bayesian filters. This can be used to classify documents during indexing-time using a pre-definied bayesian filter. New wordings: - a context is a class where different categories are possible. The context name is equal to a facet name. - a category is a facet type within a facet navigation. Each context must have several categories, at least one custom name (things you want to discover) and one with the exact name "negative". To use this, you must do: - for each context, you must create a directory within DATA/CLASSIFICATION with the name of the context (the facet name) - within each context directory, you must create text files with one document each per line for every categroy. One of these categories MUST have the name 'negative.txt'. Then, each new document is classified to match within one of the given categories for each context.	9 years ago
Michael Peter Christen	7b412e8c07	added msg (text emails) format; should be handled by html parser.	9 years ago
Michael Peter Christen	90f75c8c3d	added enrichment of synonyms and vocabularies for imported documents during surrogate reading: those attributes from the dump are removed during the import process and replaced by new detected attributes according to the setting of the YaCy peer. This may cause that all such attributes are removed if the importing peer has no synonyms and/or no vocabularies defined.	10 years ago
Michael Peter Christen	7829480b82	refactoring: separated condenser and tokenizer	10 years ago
Michael Peter Christen	593de05922	enhanced surrogate import process speed (dramatically!)	10 years ago
reger	7478338a40	remove augmented parsing activation from frontend experimental implementation not used and based on error prone experimental rdfaparser	10 years ago
reger	11aa2edfe1	remove RDFa parser activation from frontend reason: experimental implementatin of RDFa parser not executed (limited to special urls) but may cause error on normal html parsing due to a inputstream.reset	10 years ago
Michael Peter Christen	d0aff91f23	fix for index import	10 years ago
Michael Peter Christen	b43811d38c	added surrogate import process for exported solr dumps. Just throw your solr dump file into DATA/SURROGATES/in/ and it will be imported!	10 years ago
reger	8a9622c31c	fix string OoB on getImagelinks with long alttext in description calculation	10 years ago
Michael Peter Christen	ff29b0e503	added option to re-index exported xml snapshot dumps to HTCACHE/snapshots by just placing them in the SURROGATES/in path	10 years ago
Michael Peter Christen	6f4fe4b175	revert of `8a7c68e4c7` keeping surrogates after processing is essential for some users. If the space they are taking is too high, please set up an automatic deletion process (like a cronjob).	10 years ago
Michael Peter Christen	fed26f33a8	enhanced timezone managament for indexed data: to support the new time parser and search functions in YaCy a high precision detection of date and time on the day is necessary. That requires that the time zone of the document content and the time zone of the user, doing a search, is detected. The time zone of the search request is done automatically using the browsers time zone offset which is delivered to the search request automatically and invisible to the user. The time zone for the content of web pages cannot be detected automatically and must be an attribute of crawl starts. The advanced crawl start now provides an input field to set the time zone in minutes as an offset number. All parsers must get a time zone offset passed, so this required the change of the parser java api. A lot of other changes had been made which corrects the wrong handling of dates in YaCy which was to add a correction based on the time zone of the server. Now no correction is added and all dates in YaCy are UTC/GMT time zone, a normalized time zone for all peers.	10 years ago
Michael Peter Christen	b060ba900d	added parsing of contentprop attribute in html tags for content='startDate' and content='endDate'. The value of these field is now written to new solr fields startDates_dts and endDates_dts.	10 years ago
Michael Peter Christen	4cb4f67f38	added parsing of dd, dt and article html fields. The parsed result is written to special solr fields which are deactivated by default.	10 years ago
Michael Peter Christen	4d00175157	<experimental> added parsing of <article> html element. Whenever such an element occurs, the complete content of all article elements replaces the parsed <content> part of documents.	10 years ago
reger	2e8c24e02a	fix link to DeReWo download file	10 years ago
Michael Peter Christen	893889bc7b	added special terms for on: - Date modifier: tomorrow, today; i.e.: search for: "Berlin on:tomorrow" to find events happening tomorrow in Berlin	10 years ago
Michael Peter Christen	535f1ebe3b	added a new way of content browsing in search results: - date navigation The date is taken from the CONTENT of the documents / web pages, NOT from a date submitted in the context of metadata (i.e. http header or html head form). This makes it possible to search for documents in the future, i.e. when documents contain event descriptions for future events. The date is written to an index field which is now enabled by default. All documents are scanned for contained date mentions. To visualize the dates for a specific search results, a histogram showing the number of documents for each day is displayed. To render these histograms the morris.js library is used. Morris.js requires also raphael.js which is now also integrated in YaCy. The histogram is now also displayed in the index browser by default. To select a specific range from a search result, the following modifiers had been introduced: from:<date> to:<date> These modifiers can be used separately (i.e. only 'from' or only 'to') to describe an open interval or combined to have a closed interval. Both dates are inclusive. To select a specific single date only, use the 'to:' - modifier. The histogram shows blue and green lines; the green lines denot weekend days (saturday and sunday). Clicking on bars in the histogram has the following reaction: 1st click: add a from:<date> modifier for the date of the bar 2nd click: add a to:<date> modifier for the date of the bar 3rd click: remove from and date modifier and set a on:<date> for the bar When the on:<date> modifier is used, the histogram shows an unlimited time period. This makes it possible to click again (4th click) which is then interpreted as a 1st click again (sets a from modifier). The display feature is NOT switched on by default; to switch it on use the /ConfigSearchPage_p.html servlet.	10 years ago
reger	2d2299f484	fix mimetype of rss items in rss parser - remove self reference as anchor for items	10 years ago
Michael Peter Christen	b432049d59	enhanced date parsing time	10 years ago
reger	a0f04db9ea	add extracted description/subject to pptParser	10 years ago
reger	7e35518787	add extracted description/subject to docParser	10 years ago
Michael Peter Christen	1f5b5c0111	npe fix for latest scraper feature	10 years ago
Michael Peter Christen	ee97302a23	hack to make date detection faster (while it becomes a bit incomplete regarding language alternatives)	10 years ago
Michael Peter Christen	b5ac29c9a5	added a html field scraper which reads text from html entities of a given css class and extends a given vocabulary with a term consisting with the text content of the html class tag. Additionally, the term is included into the semantic facet of the document. This allows the creation of faceted search to documents without the pre-creation of vocabularies; instead, the vocabulary is created on-the-fly, possibly for use in other crawls. If any of the term scraping for a specific vocabulary is successful on a document, this vocabulary is excluded for auto-annotation on the page. To use this feature, do the following: - create a vocabulary on /Vocabulary_p.html (if not existent) - in /CrawlStartExpert.html you will now see the vocabularies as column in a table. The second column provides text fields where you can name the class of html entities where the literal of the corresponding vocabulary shall be scraped out - when doing a search, you will see the content of the scraped fields in a navigation facet for the given vocabulary	10 years ago
Michael Peter Christen	de3e373913	using precompiled CommonPattern.TAB for split	10 years ago
Michael Peter Christen	1f5047b15f	using precompiled pattern CommonPattern.SEMICOLON for splits	10 years ago
Michael Peter Christen	69eacdf4eb	applying precompiled CommonPattern.COMMA.split to all places where split(",") was used	10 years ago
reger	5ca0762179	fix: eom on parsing ico file by genericImageParser trace: java.lang.OutOfMemoryError: Java heap space at java.awt.image.DataBufferInt.<init>(DataBufferInt.java:75) at java.awt.image.Raster.createPackedRaster(Raster.java:467) at java.awt.image.DirectColorModel.createCompatibleWritableRaster(DirectColorModel.java:1032) at java.awt.image.BufferedImage.<init>(BufferedImage.java:331) at net.yacy.document.parser.images.bmpParser$IMAGEMAP.<init>(bmpParser.java:149) at net.yacy.document.parser.images.bmpParser.parse(bmpParser.java:69) at net.yacy.document.parser.images.genericImageParser.parse(genericImageParser.java:116)	10 years ago
Michael Peter Christen	4144c7cc52	do not write frame links to webgraph	10 years ago

1 2 3 4 5 ...

573 Commits (5445f38070af55ed56a5c826e20838925cfc2519)