yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Daleth Darko	3ced06c731	Various javadoc fixes	3 years ago
Michael Peter Christen	bd3f2483a1	replaced url and date retrieval by only url retrieval This should prevent that the search index is used for freshnes of the index entry.	3 years ago
Michael Peter Christen	e9c5e78868	replaced new Number(Number) with Number.instanceOf to remove deprecation warnings for Java 9	3 years ago
Michael Peter Christen	c0d9a3e9a7	turned HostBrowser into a admin-only page, now called IndexBrowser This was required because spiders and bots crawled through this page and created load on the peer without use for the user or the YaCy network.	4 years ago
sgaebel	3431f91db9	removes unused 'unused' tokens	4 years ago
luccioman	e85f231bdf	Fixed termination of Host browser and link structure Solr query threads On some conditions (especially when reaching timeout), concurrent Solr query tasks used by the /HostBrowser.html and /api/linkstructure.json never terminated, thus leaking resources, as reported by @Vort in issue #246	6 years ago
sgaebel	4b79851e12	corrected icons_sizes_sxt to SolrType.string	7 years ago
reger	a8234b7ea7	Make sure for image resource url enabled index image pixel size fields are filled if at least one of the image size fields is enabled in index (images_height_val, images_width_val, images_pixel_val). Previously all fields were required to be enabled (hint: default setting is height + width enabled)	7 years ago
reger	c31d94664a	Update deprecated SolrInputDocument.addField() with boost value remove unused SchemaConfiguration.getDate (as it is designed to return only past dates which might be unexpected for general configuration schema)	7 years ago
reger	e918ec199e	Replace deprecated ConcurrentHashSet with recommended Java8 ConcurrentHashMap.newKeySet() in postprocessDocuments()	7 years ago
reger	275d65fffe	Patch last_modified date with internal FirstSeenTime() if no date provided to make sure updated documents are indexed with their last-modified date as provided in current crawl. (to patch moddate always with firstseen might bear the risk of miss actual updates).	7 years ago
luccioman	8da3174867	Ensure lower case conversion consistency with any default locale. Especially for Turkish speaking users using "tr" as their system default locale : strings for technical stuff (URLs, tag names, constants...) must not be lower cased with the default locale, as 'I' doesn't becomes 'i' like in other locales such as "en", but becomes 'ı'.	7 years ago
Michael Peter Christen	3b1d640a3c	enhanced debugging	8 years ago
luccioman	654801523e	Fixed StringIndexOutOfBoundsException case. Revealed by commit `c77e43a` : the exception was then thrown when indexing pages containing mailto: scheme URL links with the Solr Webgraph core enabled. Fixed the error case and restored filtering on mailto links in Document.resortLinks() as these URLs still should not appear in Document.hyperlinks.	8 years ago
Michael Peter Christen	973d74712f	added yacy grid flatjson surrogate parser	8 years ago
luccioman	ac766327d3	Switched a few more Solr fields from strictly mandatory to optional	8 years ago
luccioman	cdc7f3e431	Switched some Solr fields from mandatory to optional These fields are default enabled but with no doubt not strictly mandatory with the current code base. As reported by @reger24, splitting between essential mandatory and optional fields is still to be improved to reflect the current YaCy needs.	8 years ago
luccioman	c68a8be2d9	Refactored and enforced Solr mandatory fields for proper operation - Added a new method to check activation of mandatory fields on Collection Configuration commit, consistently with checks previously performed in Switchboard startup and with mandatory fields in the default schema. - Reorganized default schema and CollectionConfiguration enumeration : moved no more mandatory fields in a specific section, and moved fields enabled at startup to the mandatory section. - Marked mandatory fields as required and with stronger font in the IndexSchema_p.html page	8 years ago
reger	5e8879beb7	Reduce self generated content for text_t (visible text index field) to avoid repeat of tokenized url as description, continuation of `7e09bff4a1` `1409cabe8b` Add some javadoc, and not needed remove of omitted fields in postprocessing.	8 years ago
reger	1f497ccad5	Add consistency check for related index fields upon load and save of index schema. To assemble the original link url for out-/inboundlinks, icons and pictures the _protocol_sxt and _urlstub_sxt is needed (due to the used data-reduced storage methode). Auto-enable _protocol_sxt if _urlstub_sxt is enabled. to be able to correctly assemble the original link url.	8 years ago
reger	581b00cc20	remove obsolete lastmodified calculation in WebgraphConfig	8 years ago
luccioman	6a4d51d8f9	Cleaned up some Javadoc warnings.	8 years ago
reger	4c9be29a55	fix concurrency issue with htmlParser using not current scraper data resulting in incorrect data for some html index metadata. Details see http://mantis.tokeek.de/view.php?id=717	8 years ago
reger	b522d540b9	Include itemprop latitude/longitude (see schema.org) in attribute parsing for lat/lon. Harmonize number parsing for lat/lon to parseDouble. Fix endDate_dts value assignment.	8 years ago
reger	9db68acb4f	remove obsolete X_YACY... header declarations not in use (no writes, only remove and try to read). Obsolete parameter setupHttpClient	8 years ago
luccioman	6e1959f469	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git Conflicts: htroot/yacysearchitem.java source/net/yacy/cora/federate/solr/responsewriter/YJsonResponseWriter.java source/net/yacy/search/schema/CollectionConfiguration.java source/net/yacy/server/serverObjects.java	8 years ago
luccioman	8c49a755da	Postprocessing refactoring Added Javadocs to refactored methods. Added log warnings instead of silently failing some errors. Only fill collection1hosts when required ( shallComputeCR true).	8 years ago
luccioman	42f45760ed	Refactored postprocessing For easier understanding and performances profiling.	8 years ago
reger	4c7a77662a	eleminate dependency on file-extension in storeDocument but use supported mime-type to also support handling of urls w/o corresponding file-extension. For this refactor use of document.getParserObject() to alway return a Parser (for clean logic) and define/move the scraperObject as local var of AbstractParser. Adjust related calls to getParserObject (where actually a scraperObject is wanted). Addionally skip appending url token to parsed text for dht metadata entries (by default returned as result by rwi index).	8 years ago
luccioman	6e96c7341a	Merge remote-tracking branch 'origin/master' Conflicts: htroot/Load_MediawikiWiki.java htroot/Load_PHPBB3.java htroot/ViewImage.java	8 years ago
reger	caf9e98f09	put metadata dc_publisher in corresponding schema field	9 years ago
luc	3f338777f7	Also check and index eventual icon url information from metadata.	9 years ago
luc	3cc5619d93	Improved HTML icons indexing and rendering in search results. See http://mantis.tokeek.de/view.php?id=629	9 years ago
luc	571bc55937	Refactoring : use StandardCharsets constants instead of hard-coded charset names.	9 years ago
reger	45b9bd8403	adjust MultiProtocolURL.protocol detection to handle mailto with "://" in parameters, and feeding hyperlinks to webgraph processing.	9 years ago
sixcooler	646afe9183	do not store subfield *_coordinate + make all num-fields being docvalues	9 years ago
reger	a58ee49307	Optimize internal imagequery focus on using content_type to select images (in favor of url file extension)	9 years ago
Michael Peter Christen	151ccd50a9	fix for image size field values (must be multi-valued)	9 years ago
reger	802ccaead6	fix init of error cache, use latest faildates => load_date_dt	9 years ago
sixcooler	87e4abe393	fight the fieldcache by usind DocValues: in Solr-5.x the fieldcache has moved and was not cleared anymore. This results in an huge fieldcache. (http://lucene.apache.org/#highlights-of-the-lucene-release-include https://issues.apache.org/jira/browse/LUCENE-5666) Here I try to use DovValues where it is possible. For this I used the Api-Scheme as new basis für the Solr-Schema. This needs at least a complete optimization of the Solr-Index to get a smaller FieldCache. Everything that is indexed with these setting will not use the Fieldcache at all.	9 years ago
reger	eaf0e8ff2c	start recording/indexing pixel size for image document as for linked images	9 years ago
reger	c33229fc0c	check mime prior to ext for metadata modification for images	9 years ago
Michael Peter Christen	8028410ab7	Merge branch 'master' of git@github.com:yacy/yacy_search_server.git	9 years ago
Michael Peter Christen	df3314ac1a	added a new facet type based on a probabilistic classifier using bayesian filters. This can be used to classify documents during indexing-time using a pre-definied bayesian filter. New wordings: - a context is a class where different categories are possible. The context name is equal to a facet name. - a category is a facet type within a facet navigation. Each context must have several categories, at least one custom name (things you want to discover) and one with the exact name "negative". To use this, you must do: - for each context, you must create a directory within DATA/CLASSIFICATION with the name of the context (the facet name) - within each context directory, you must create text files with one document each per line for every categroy. One of these categories MUST have the name 'negative.txt'. Then, each new document is classified to match within one of the given categories for each context.	9 years ago
reger	1409cabe8b	exclude more default search fields from text copy to text_t for metadata index documents	9 years ago
Michael Peter Christen	0aa6fcf259	remove old vocabularies and synonyms before adding new	9 years ago
reger	f91298d3b6	fix one implicit Integer/Long type conversion -> causes Java 1.8 compile error	9 years ago
reger	821262a179	add CommonPattern for multiple spaces to eliminate empty split words on following spaces	10 years ago
Michael Peter Christen	90f75c8c3d	added enrichment of synonyms and vocabularies for imported documents during surrogate reading: those attributes from the dump are removed during the import process and replaced by new detected attributes according to the setting of the YaCy peer. This may cause that all such attributes are removed if the importing peer has no synonyms and/or no vocabularies defined.	10 years ago
reger	f3ce99bfb8	fix extract of inboundlinks_protocol_sxt url counter maybe > 999	10 years ago

1 2 3 4 5 ...

255 Commits (d23dea26422fb685b470fabc619dd4386e03dc85)