yacy_search_server

Commit Graph

Author	SHA1	Message	Date
orbiter	d86d2be5c3	automatically removed Places autotagging if no location library is wanted	12 years ago
orbiter	214a087cdf	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
Michael Peter Christen	96ed0c980e	- added hosthash to all documents (also fail documents which is needed there for deletion), this fixes a problem for the deletion of old documents for new crawl starts - added clickdepth and citation computation for fail documents	12 years ago
Michael Peter Christen	179ad281f9	close include byte buffer after usage	12 years ago
reger	6b9a624808	remove double declaration of TLD_any_zone_filter	12 years ago
orbiter	d2effd21db	fix for npe during location search	12 years ago
orbiter	828603e4f1	fix for 100%CPU problem in error cache cleaning process	12 years ago
orbiter	c64b51134e	hack to add all tokens from the url to text_t. This was working for the RWI index (and still is working) but not for solr-only search indexes. Maybe we should find a solution using a separate search field instead.	12 years ago
orbiter	6e8377b8ad	do not check all words with synonym library if the library is empty	12 years ago
orbiter	70ba74b23a	disabled ipv4 preference to enable ipv6-only networks like freifunk	12 years ago
orbiter	f3be1930cb	CPU problem when pusing to the error cache; wrong class, ConcurrentHashMap needed for concurrency	12 years ago
Michael Peter Christen	e40671ddb7	better and consistent deletions for error urls	12 years ago
Michael Peter Christen	2602be8d1e	- removed ZURL data structure; removed also the ZURL data file - replaced load failure logging by information which is stored in Solr - fixed a bug with crawling of feeds: added must-match pattern application to feed urls to filter out such urls which shall not be in a wanted domain - delegatedURLs, which also used ZURLs are now temporary objects in memory	12 years ago
Michael Peter Christen	31920385f7	set anchor rel attribute of all links to "nofollow" if the html meta contains a robots:nofollow or if the http header contains a "X-Robots-Tag: nofollow"	12 years ago
Michael Peter Christen	57e00baf26	fix for parsing of image links inside of anchor links (image-links)	12 years ago
Michael Peter Christen	61c5e40687	- replaced the properties object in AnchorURL with distinct variables for anchor attributes. - this caused that large portions of the parser code had to be adopted as well - added a counter target_order_i for anchor links in webgraph computation	12 years ago
Michael Peter Christen	3ea9bb4427	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
Michael Peter Christen	5e31bad711	- the webgraph shall store all links which appear on a web page and not all unique links! This made it necessary, that a large portion of the parser and link processing classes must be adopted to carry a different type of link collection which carry a property attribute which are attached to web anchors. - introduction of a new URL class, AnchorURL - the other url classes, DigestURI and MultiProtocolURI had been renamed and refactored to fit into a new document package schema, document.id - cleanup of net.yacy.cora.document package and refactoring	12 years ago
reger	603368fc3e	remove redundant declaration of USER_AGENT	12 years ago
Michael Peter Christen	1a8c64117f	decreased the responseHeaderDB database which is now flushed more frequently. This will preserve more documents in the cache in case of a crash.	12 years ago
Michael Peter Christen	3e22d05290	added option for daterange properties in GSA interface to use an left- or right-open date range; i.e. using daterange=..2013-09-09 or daterange=2013-09-02.. additional to daterange=2013-09-02..2013-09-09	12 years ago
reger	fe87fb638a	adjust test/ParserTest to dc_description data type	12 years ago
Michael Peter Christen	35ab2cef7b	added parsing of 'date', 'dc:date', 'dc.date' and 'last-modified' in html meta fields to get a correct (or: better) date timestamp. The http:last-modified mostly does not work because it is set to the current date from most CMS.	12 years ago
Michael Peter Christen	9cc8468b30	added tools to visualize image generation (i.e. during testing)	12 years ago
Michael Peter Christen	dbef8ccfcb	forced deletion of ZURL entries for a specific host for each host that appears in the crawl url list	12 years ago
Michael Peter Christen	e137ff4171	refactoring (im preparation for new removeHost method)	12 years ago
Michael Peter Christen	7a5574cd51	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
Michael Peter Christen	85456f46b2	added two new fields, exact_signature_copycount_i and fuzzy_signature_copycount_i, which count the number of copies of non-unique documents and assigns this to each document. Thus, each document there is a number assigned which shows how many copies of this document exists. These fields are disabled by default.	12 years ago
orbiter	26366596d9	fix for a problem which ocurres when a site is crawled where the start url is redirected.	12 years ago
Michael Peter Christen	a2511b5600	turned images_alt_txt back to images_alt_sxt because it is not necessary to index the alt text. Indexed image Text is in images_text_t	12 years ago
Michael Peter Christen	85b1922244	activated image type navigation for image search	12 years ago
Michael Peter Christen	9e12fdff23	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
Michael Peter Christen	ab1201fdfd	fixed wrong facet count	12 years ago
Michael Peter Christen	049c3b3f2e	added an option to exclude image search results from text search. This is on by default.	12 years ago
Michael Peter Christen	69f85265e1	added an option to put image links to the crawl queue and handle these like normal documents. Using this option (by default on at this moment; this might change soon) it is possible to get the exif data into the search index to be used in image search.	12 years ago
Michael Peter Christen	e8e558a9b7	fix for content domain classification in URIMetadataNode	12 years ago
Michael Peter Christen	a8c5bfcf58	avoid to create unnecessary objects	12 years ago
Michael Peter Christen	5a0de1b77d	moving image description text to image text field	12 years ago
Michael Peter Christen	dc179bd61f	fix for catchall query goal for image search	12 years ago
Michael Peter Christen	5d71a4c8bc	fix for dc:description field	12 years ago
reger	392174de8c	remove all_words, all_strings lists from QueryGoal - only used for text highlighting in parser text (ViewFile.html) which can be done with include_strings only	12 years ago
Michael Peter Christen	169ef8963d	one more fix for image search	12 years ago
Michael Peter Christen	cb85b22725	redesign of the image search process (with much better results, unfortunately the index schema has changed and p2p image search will not be muchmuch better until many people update)	12 years ago
Michael Peter Christen	6184fd9d9a	fix for solr/gsa result logging	12 years ago
reger	29967102a2	optimized QueryGoal (reducing mem and computation by removing all_hashes) - all_hashes used for text highlighting and word distance computation which can be done with include_hashes only	12 years ago
orbiter	f106345eef	link strings should not be tokenized	12 years ago
orbiter	deadeb406e	image alt tag strings should be tokenized	12 years ago
orbiter	5b14bdfffd	npe fix	12 years ago
orbiter	3e5f8e29e2	next development release step to reflect the extension of the solr api with javabin format capability	12 years ago
orbiter	1ca4b9612c	added special handling of the BinaryResponseWriter in the solr interface which makes it possible to use solrj with the javabin format which is much better (compressed, no xml overhead, java object streams) and faster. Furthermore, this enables the 'shards' option in the solr interface which connects one solr (YaCy) to another solr (YaCy) ad-hoc.	12 years ago

... 3 4 5 6 7 ...

10058 Commits (69829fc41792af71f9a7b45524cde3f8b959ceec) All Branches Search

10058 Commits (69829fc41792af71f9a7b45524cde3f8b959ceec)

All Branches