yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Michael Peter Christen	a8c64b1af2	added an artificial snippet [Synonym Match] in case that there is only a match in the synonyms]	2 months ago
Michael Peter Christen	910a496c9f	replaced http links with https	4 months ago
Michael Peter Christen	8eb0d490aa	migrated solr to 9.0 This is a major step because solr removed support for embedded solr instances in 9.0 and we want to keep it because we want to ship YaCy with an embedded solr. It was necessary to add parts of solr code into YaCy to make this migration possible. Further on with Solr 9.1 they removed even more parts which are required for embedded operation, therefore we cannot migrate yet further without big changes. If you are running a YaCy instance with Solr 8.x, the migration should be done automatically. If not you require to first migrate to a YaCy version 1.93 with Solr 8.x to migrate to Solr 8 data.	6 months ago
Michael Peter Christen	7db0534d8a	Added a zim parser to the surrogate import option. You can now import zim files into YaCy by simply moving them to the DATA/SURROGATE/IN folder. They will be fetched and after parsing moved to DATA/SURROGATE/OUT. There are exceptions where the parser is not able to identify the original URL of the documents in the zim file. In that case the file is simply ignored. This commit also carries an important fix to the pdf parser and an increase of the maximum parsing speed to 60000 PPM which should make it possible to index up to 1000 files in one second.	1 year ago
Michael Peter Christen	8285fe715a	tab to spaces for classes supporting the condenser. This is a preparation step to make changes in condenser and parser more visible; no functional changes so far.	1 year ago
Daleth Darko	3ced06c731	Various javadoc fixes	3 years ago
Michael Peter Christen	bd3f2483a1	replaced url and date retrieval by only url retrieval This should prevent that the search index is used for freshnes of the index entry.	3 years ago
luccioman	e97580dfc7	Fixed unsafe conccurent access to generic SimpleDateFormat instances SimpleDateFormat must not be used by concurrent threads without synchronization for parsing or formating dates as it is not thread-safe (internally holds a calendar instance that is not synchronized). Prefer now DateTimeFormatter when possible as it is thread-safe without concurrent access performance bottleneck (does not internally use synchronization locks).	6 years ago
luccioman	addd18c993	Removed some remaining uses of deprecated Seed.getIP()	7 years ago
luccioman	dd9cb06d25	Fixed RWI distance calculation on multi words search queries. Distance was lost when storing/retrieving references to intermediate result container. Now all JUnit tests are again successfully passing!	7 years ago
luccioman	5d3ceb31b7	Improved search navigators counters accuracy and consistency. - added some missing increments from RWI results - decrement relevant navigator counts when solr or RWI results are evicted because duplicates detection or constraints checked belatedly - do not compute facets when unnecessary to avoid unwanted CPU load - do not increment from facets when already done - do not rely on facets on remote solr peers requests, as most of the time only a limited part of their total results if fetched (thus also preventing unnecessary load on remote peers) - use a concurrency friendly score map for the dates navigators to prevent unwanted ConcurrentModificationExceptions This improves the situation for the most obvious inconsistencies in search navigators counts, but more has to be done for a true accuracy (notably when query modifiers constraints are applied belatedly - after the solr or RWI retrieval request - such as the content domain constraint)	7 years ago
luccioman	b23a563065	Prevent search result failure on incomplete images information. Complements the recent modification related to images in commit `7f395ef`. Unfortunately many documents metadata fetched from the freeworld p2p network have only partial information about embedded images. Without proper error handling, this made many searches in p2p mode to fail completely.	7 years ago
Michael Peter Christen	7f395ef937	added image link in search results This should be a help to make a preview of search results. The image is computed from the list of embedded images, it is always the first image in that list. In rss-type results the image is presented like <media:content medium="image" url="https://abc.xyz/logo.png"/> as defined in http://www.rssboard.org/media-rss#media-content	7 years ago
luccioman	8da3174867	Ensure lower case conversion consistency with any default locale. Especially for Turkish speaking users using "tr" as their system default locale : strings for technical stuff (URLs, tag names, constants...) must not be lower cased with the default locale, as 'I' doesn't becomes 'i' like in other locales such as "en", but becomes 'ı'.	7 years ago
luccioman	0da1e6ba16	Factored code re-implementing DigestURL.hosthash() method. This ensure consistent implementation of the url host hash generation and easier usage finding in source code. Also added a unit test for this function.	8 years ago
reger	eedee6eabb	fix exception on URIMetadataNote instantiation with corrected id hash on host_id_s. Use Solr setField instead of addField to prevent java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String at net.yacy.kelondro.data.meta.URIMetadataNode.hosthash(URIMetadataNode.java:247) at net.yacy.search.query.SearchEvent.addNodes(SearchEvent.java:966) at net.yacy.peers.Protocol.solrQuery(Protocol.java:1242) at net.yacy.peers.RemoteSearch$2.run(RemoteSearch.java:349)	8 years ago
reger	20a1b29ed3	add simple test case for ReferenceContainer helpful for debugging calculated ranking parameter	8 years ago
reger	3c7220bc7b	Refacture rwi reference word position and word distance calculation used for rwi ranking. Main changes: - introduce a posintext() to access the stored value. This reduces also mem alloc of position array for WordReferenceRow (index access) - use the positions() array for joined references on multi-word queries if needed (otherwise allow positions() to be null - adjust assignments and the min() max() and distance() calculation accordingly	8 years ago
luccioman	f0639d810c	Customized name for Threads still using the default "Thread-n" pattern. This makes threads monitoring easier to read.	8 years ago
reger	8b74a6bf57	fix min/max calculation of WordReferenceVars.distance() Issue was the calculation in AbstractReference with positions.clear() call, this made distance result always 0 (distance needs min 2 positions) and created concurrency issues. + unit test of changes	8 years ago
luccioman	6e1959f469	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git Conflicts: htroot/yacysearchitem.java source/net/yacy/cora/federate/solr/responsewriter/YJsonResponseWriter.java source/net/yacy/search/schema/CollectionConfiguration.java source/net/yacy/server/serverObjects.java	8 years ago
reger	685d8e86bf	Avoid frequent data type casting (float/long) for rwi score refactor to using long in URIMetadataNode too (and related call parameters) As remote rwi score's are not used (since v1.83) skip reading float-score , but keep in toString() for communication with older versions.	8 years ago
reger	681a61dafb	adjust rwi index result word position handling used for rwi ranking - correct WordReferenceVars.toRowEntry posintext parameter to set expected min posintext (the difference is on multi-word queries, while positions are ordered by search word order). - modified posofphrase/posinphrase join operation - to set min posofphrase - and keep posinphrase if not same posofphrase (was set to 0, no differentiation during ranking) + fix compiler msg (missing type declaration)	8 years ago
reger	ff6589fc0f	test case: simulating multi word query for local rwi index Purpose of the test case is to be able to (controlled) analyse the rwi ranking for multi word searches (with focus on posintext and word-distance ranking)	8 years ago
reger	3b694b3935	add some javadoc to rwi wordreference distance, position to remember facts for http://mantis.tokeek.de/view.php?id=683 Init missing word position to 0 like in other non text body words	8 years ago
reger	96467c5467	remove not needed counter in Tokeninzer (completing last changes) including a small change, word posintext counting. We remember/store 1st posintext. Previously following words got a handle (posintext) excluding found. Now it just counts and assigns true posintext as handle (posintext)	8 years ago
reger	7efb66ee10	adjust the WordReference.join wordsintext calc to take the max (instead of sum) The reference is for the same url (add same for title and phrases). + del redundant join() procedure	8 years ago
reger	120bf7e6e2	implemented RWI WordReference to return the word position value (was always left empty) This is needed and enables existing word position ranking for RWI. The upcoming concurrency issue in word position min/max calculation were eliminated by iterator.hasHext check before next() access.	8 years ago
luc	26f1ead57c	Created ViewFavicon class specialized in favicon viewing. Main image processing is now in ImageViewer, used by both ViewImage and ViewFavicon. Fixed URIMetadataNode.getFavicon to use non-standard icons with no size ass fallback.	9 years ago
luc	07222b3e1a	Added favicon url transmission in RWI chunks.	9 years ago
luc	3cc5619d93	Improved HTML icons indexing and rendering in search results. See http://mantis.tokeek.de/view.php?id=629	9 years ago
reger	b4b6910d60	fix (todo): correct doc.id of remote search result if no match with newly calculated doc hash if different. Testing showed that in some cases delivered url doesn't match the local calculated hash. In this case replace doc.id (and host_id_s) with calculation from url.	9 years ago
reger	cb83e65f89	drop returning document language "en" if unknown (fix todo) which also harmonizes handling of query.modifier for rwi and solr results (to result must match a given language filter)	9 years ago
reger	cdb8f3b10d	make current ranking score value avail. to search interface / api Update the result score result field with the result queue ranking value to reflect the actual calculated/used score, for rwi & solr stack results. (calc. etc. is unchanged, it's just that result entry carries the latest val as api retrieves the number from it)	9 years ago
reger	1160b13172	remove unused md5 from ViewFile servlet params	9 years ago
reger	b2c8bc0ae6	remove md5_s from default index fields it is not assigned a value / not used Due to above also excluded from transfer protocol.	9 years ago
reger	ca3d26a401	harmonize wordsintitle & CollectionSchema.title_words_val calculation, remove obsolete partial init of wordreference from urimetadata	9 years ago
luc	c38d6c1f37	Correction for mantis 535: inurl: parameter doesn't work on URLs with upper-case letters	9 years ago
reger	3f2b8ab5e5	optionally include mime in p2p url exchange string if doctype decodes to ambiguous mime and default conversion is not equal to original	9 years ago
reger	e37a4f0b3d	prevent metadata records in index w/o valid url by throwing MalformedURL exception on URIMetadataNode creation	9 years ago
reger	0e4ba0360b	fix NPE on .yacyh result url of disconnected peer (cleanup yacyshare remaining)	9 years ago
Michael Peter Christen	dbbad23e12	removed warnings	9 years ago
Michael Peter Christen	b94bd7f20a	a collection of search query enhancements: - fixed superfluous space in query field list - fixed filter query logic - removed look-ahead query which caused that each new search page submitted two solr queries - fixed random solr result orders in case that the solr score was equal: this was then re-ordered by YaCy using the document hash which came from the solr object and that appeared to be random. Now the hash of the url is used and the score is additionally modified by the url length to prevent that this particular case appears at all.	9 years ago
reger	8b35656007	remove hard throw exception in makeResultEntry remove not used "share." peername.yacy url rewrite	10 years ago
reger	000dde9511	Eleminate duplication of values for search ResultEntry by instatiation from URIMetadataNode, by eleminating differentiation of ResultEntry/URIMetadataNode. - moved remaining ResultEntry functionallity to URIMetadataNode - for 1:1 functionallity added a function makeResultEntry() - removed ResultEntry - refactored related code Main difference is after makeResultEntry the text_t content is removed and alternative title/url strings for display are calculated. Main difference left is, that	10 years ago
Michael Peter Christen	fed26f33a8	enhanced timezone managament for indexed data: to support the new time parser and search functions in YaCy a high precision detection of date and time on the day is necessary. That requires that the time zone of the document content and the time zone of the user, doing a search, is detected. The time zone of the search request is done automatically using the browsers time zone offset which is delivered to the search request automatically and invisible to the user. The time zone for the content of web pages cannot be detected automatically and must be an attribute of crawl starts. The advanced crawl start now provides an input field to set the time zone in minutes as an offset number. All parsers must get a time zone offset passed, so this required the change of the parser java api. A lot of other changes had been made which corrects the wrong handling of dates in YaCy which was to add a correction based on the time zone of the server. Now no correction is added and all dates in YaCy are UTC/GMT time zone, a normalized time zone for all peers.	10 years ago
Michael Peter Christen	9bf0d7ecb9	added a new collection type 'dht' to all documents from the peer-to-peer interface to distinguish rich and poor document data. This also reverts some changes from commit `796770e070` because the firstSeen database is the wrong method to distinguish these types of data	10 years ago
reger	431311df42	fix get fresh_date_dt to allow returned value to be date in future	10 years ago
Michael Peter Christen	fd4e2c809a	Show dates in the content of a document in the search result: - if an eventDate is given in the search result, replace the document date with the event date and prefix it with the string "on ". - the document date is omitted if a date from the cent is shown Added also the date as fields in the json and rss result sets.	10 years ago
Michael Peter Christen	3d717b749a	fix for urlmaskfilter	10 years ago

1 2 3 4 5 ...

270 Commits (master)