yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Michael Peter Christen	70f03f7c8e	do not cache search requests to Solr if the result is used for doublechecking. If a double-check comes from cached results the doublecheck fails.	10 years ago
Michael Peter Christen	c67c5c0709	added new solr schema fields which record the occurences of vocabulary matchings. These matches can be used for result boosting, i.e. if a document contains words from a specific vocabulary, boost it.	10 years ago
Michael Peter Christen	0550b54d56	added fix to postprocessing: avoid caching of postprocessing collection to always get fresh lists of documents. This is necessary since the postprocessing changes the same documents which the postprocessing-collection query selects.	10 years ago
Michael Peter Christen	0a879c98e7	added new 'firstSeen' database table and necessary data structures which hold a date for each URL to record when a url was first seen. This is then used to overwrite the modification date for urls upon recrawl in case that the first-seen date is before the latest document date. This behaviour is necessary due to the common behaviour of content management systems which attach always the current date to all documents. Using the firstSeen database it is possible to approximate a real first document creation date in case that the crawler starts frequently for the same domain. As a result the search results ordered by date have a much better quality and the usage of YaCy as search agent for latest news has a better quality.	10 years ago
Michael Peter Christen	95d87f00b3	fix for bad query generation in doublecheck in postprocessing	10 years ago
Michael Peter Christen	92007e5d2d	more enhancements to posprocessing speed	10 years ago
Michael Peter Christen	9a7fe9e0d1	fix for bad timing computation in postprocessing	10 years ago
Michael Peter Christen	bd16119a00	another fix for postprocessing (the query for "" on numeric field did not work in external solr)	10 years ago
Michael Peter Christen	327e83bfe7	more fixes in postprocessing: partitioning of the complete queue to enable smaller queries	10 years ago
orbiter	71758f0d62	enhanced postprocessing by usage of a field-list generation to prevent lazy initialization of the documents. This is useful because the documents must be read completely anyway.	10 years ago
Michael Peter Christen	fe537679de	fix for exact_signature_unique_b, exact_signature_copycount_i, fuzzy_signature_unique_b and fuzzy_signature_copycount_i: apply same criteria for 'valid document' as for title and description uniqueness test.	10 years ago
Michael Peter Christen	2e5214eb21	added field postprocessing.partialUpdate to settings which can be used to switch on or off partial updates. Both options should cause the same result. Default is on.	10 years ago
Michael Peter Christen	2e09da9832	npe fix	10 years ago
Michael Peter Christen	d80418f1b1	added partial updates to solr during postprocessing: during postprocessing the solr documents are now not completely retrieved. instead, only fiels, needed for the postprocessing are extracted. When Solr document are written, this is done using partial updates. This increases postprocessing speed by about 50% for embedded Solr configurations. For external Solr configurations the enhancement should be much higher because the postprocessing with remote Solr is very slow. When doing partial updates to a remote Solr, this method should perform much better than before, it is expected that this is even much higher than the increase with local Solr.	10 years ago
Michael Peter Christen	b1cfbc4a04	added new solr field url_paths_count_i which can be used to enhance the index browser and maybe also for ranking; possibly also for SEO-with-YaCy applications.	10 years ago
Michael Peter Christen	30d4402cd1	fixed location search	10 years ago
orbiter	f3a12801f0	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	10 years ago
orbiter	d93325a578	lazy handling of process_sxt field (part of postprocessing)	10 years ago
reger	b5ca20de15	preserve content_type (mime) if supplied in preference of construct in from file type. (this eventually can benefit image search by using mime only) reduce redundant field assignment for Solrdocuments created from URIMetadataNode (URIMetadataNode = SolrDocument with partially assigned fields)	10 years ago
reger	fb1fcc2b03	handle noarchive tag, skip writing page to cache http://mantis.tokeek.de/view.php?id=44	10 years ago
Michael Peter Christen	2645dc816a	added warning for not well-formed postprocessing queries	11 years ago
Michael Peter Christen	6d3d4c4ea6	changed the concurrent enumeration of query results in such a way that it is now possible to get the results in two steps: - first retrieve all IDs as given for a query - then retieve each document individually This was necessary for very large result sets where a query may run for hours and is possibly terminated by a solr-internal timeout. This occurs regulary during postprocessing and therefore this commit may fix unwanted postprocessing terminations.	11 years ago
Michael Peter Christen	e87dc08c0d	set the correct fail time in error docs	11 years ago
Michael Peter Christen	a7dd89c4de	changed method to write the citation index: do not catch up references during document parsing; instead use the same references that would also be written into the webgraph. That should cause that the webgraph and the citation index express the exact same semantic.	11 years ago
orbiter	d68438c3d9	make sure that the postprocessing background thread never dies by any exception	11 years ago
orbiter	927aaa95a6	concurrency bugfix	11 years ago
reger	f9db5dd6c5	reduce doublecontent check document (prevent out of memory) see http://mantis.tokeek.de/view.php?id=437 test result (concurrency=7) 2000 docs = eom always 1000 docs = eom always 100 docs = eom never chosen -> 200 docs (eom not encountered during test with 1GB mem setting)	11 years ago
reger	a8508417d1	catch NPE during crawl (OAI import) - condenseDocument mime=null (allowed) - collectionconfiguration responseheader = null (allowed)	11 years ago
Michael Peter Christen	6344718f8b	reducing the concurrent query stack size and reduced concurrency of postprocessing to avoid OOM situations	11 years ago
Michael Peter Christen	191ec8c82a	added concurrency to postprocess rewrite process	11 years ago
Michael Peter Christen	a1e8bdd5e9	log ppm instead of docs/second	11 years ago
Michael Peter Christen	338f574bdc	no sorting if http/www unique fields are not demanded (makes query faster) and some code restrucuring	11 years ago
Michael Peter Christen	0ceeceb35e	more logic on Solr queries; usage of the query terms in posprocessing, saving one query for double document detection now per document	11 years ago
orbiter	4099296b45	added new classes which shall reduce call overhead to Solr (stub)	11 years ago
orbiter	3491ab4c38	removed unused images from webgraph edge computation	11 years ago
orbiter	1027f3d04a	fix for the usage of ready-prepared solr queries, some queries are formulated as edismax query but this was not set as query attribut. The defType=edismax property needs a qf-field, so this was added as well. Do not remove that field again! This fixes also a problem with title-unique computation.	11 years ago
Michael Peter Christen	504327b15c	fix for condition for writing the webgraph	11 years ago
Michael Peter Christen	4eec1a7452	refactoring (change Metadata name of load time data structure to avoid confusion with Node data which is also called metadata)	11 years ago
reger	f96cfdc84d	prevent array out of bound exception on getRankingProfile(x) on faulty &profileNr= query parameter	11 years ago
Michael Peter Christen	2de159719b	added an option to set 'obey nofollow' for links with rel="nofollow" attribute in the <a> tag for each crawl. This introduces a lot of changes because it extends the usage of the AnchorURL Object type which now also has a different toString method that the underlying DigestURL.toString. It is therefore not advised to use .toString at all for urls, just just toNormalform(false) instead.	11 years ago
Michael Peter Christen	bf1b6b93e7	do not write CR values to webgraph if no CR values are computed	11 years ago
Michael Peter Christen	8514bffc22	enhanced postprocessing status report	11 years ago
Michael Peter Christen	b5d78ba156	reduced number of solr queries during crawling	11 years ago
Michael Peter Christen	fb3dd56b02	fix for processing of noindex flag in http header	11 years ago
Michael Peter Christen	b0d941626f	fixed bugs in canonical, robots and title/description unique calculation	11 years ago
Michael Peter Christen	1092e798a5	fixed double content postprocessing	11 years ago
Michael Peter Christen	36e623d8bf	enhanced metadata enrichment for media file type search: - Web servers may now deliver YaCy-specific http header field with a title and keywords. The new http header fields are: X-YaCy-Media-Title - to be used for media (image, audio, video) titles X-YaCy-Media-Keywords - to be used for media (image, audio, video) keywords - both fields are written to document fields title and keywords and are searched also during image search. - to make the usage of arbitrary http header fields (including this new fields) possible in the /api/push_p.json servlet, a new POST argument is also introduced to push http header fields. The new POST attribute is named "responseHeader-X" (where X is the counter). It is allowed to use this attribute as multi-attribute several times, each can be filled with a http header line. - see /api/push_p.html for examples	11 years ago
Michael Peter Christen	922979aae1	added option to prefer http over https in unique-protocol ranking	11 years ago
Michael Peter Christen	b3b174e2b8	fixed webgraph postprocessing and status display in Crawler_p servlet	11 years ago
Michael Peter Christen	8ad41a882c	fixed several problems with postprocessing: - unique-postprocessing was destroying results from other postprocessings; removed cross-updates as they had been not necessary - unique-postprocessing did not restrict on same protocol - inefficient concurrent update cache was redesigned completely - increased limits for concurrent blocking queues to prevent early time-out	11 years ago

1 2 3 4

161 Commits (a34f8375925222c1af4e296bf9d457b21a019910)