yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Michael Peter Christen	535f1ebe3b	added a new way of content browsing in search results: - date navigation The date is taken from the CONTENT of the documents / web pages, NOT from a date submitted in the context of metadata (i.e. http header or html head form). This makes it possible to search for documents in the future, i.e. when documents contain event descriptions for future events. The date is written to an index field which is now enabled by default. All documents are scanned for contained date mentions. To visualize the dates for a specific search results, a histogram showing the number of documents for each day is displayed. To render these histograms the morris.js library is used. Morris.js requires also raphael.js which is now also integrated in YaCy. The histogram is now also displayed in the index browser by default. To select a specific range from a search result, the following modifiers had been introduced: from:<date> to:<date> These modifiers can be used separately (i.e. only 'from' or only 'to') to describe an open interval or combined to have a closed interval. Both dates are inclusive. To select a specific single date only, use the 'to:' - modifier. The histogram shows blue and green lines; the green lines denot weekend days (saturday and sunday). Clicking on bars in the histogram has the following reaction: 1st click: add a from:<date> modifier for the date of the bar 2nd click: add a to:<date> modifier for the date of the bar 3rd click: remove from and date modifier and set a on:<date> for the bar When the on:<date> modifier is used, the histogram shows an unlimited time period. This makes it possible to click again (4th click) which is then interpreted as a 1st click again (sets a from modifier). The display feature is NOT switched on by default; to switch it on use the /ConfigSearchPage_p.html servlet.	10 years ago
reger	ba276d3e64	add description_txt to default query fields, Dublin Core Metadata field extracted by most parsers.	10 years ago
reger	fe6f5a395d	fix Umlaut handling in blekko heuristic search term http://mantis.tokeek.de/view.php?id=169 observation: blekko seams to block xxxbot agents (=0 results)	10 years ago
Michael Peter Christen	97ba5ddbb7	configuration option for maxload limit for remote search	10 years ago
Michael Peter Christen	ac19690d30	refactoring with CommonPattern.COMMA	10 years ago
Michael Peter Christen	cf9b22ca5c	do not reindex based on vocabulary fields (there are meanwhile many of them) and some default settings	10 years ago
reger	24f68a4eb7	refactor opensearch heuristic introduce FederateSearchManager handling search heuristic to external systems via specific FederateSearchConnectors, which provide the query() functionallity, the translation to YaCy schema .toYaCySchema() and the search() routine to deliver results to searchevents, which is generally implemented in Abstract connector. The manager enforces now a min 15s delay between calls to external systems. Besides the OpensearchConnector a SolrFederateSearchConnector is available. It uses a additional config file for fieldname translation. default heuristicopensearch.conf: - openbdb.com removed - seems not longer to deliver results - config via solrconnector to datacite.org added (large technical library archive)	10 years ago
reger	4eb89d7f15	revert clickservlet (default was indeed a mistakenly)	10 years ago
Michael Peter Christen	61ae9d2d11	do not use the clickservlet by default. From my personal view, this technique should not be used at all! This project is about privacy, the existence of a click servlet is one example why people should NOT use a search portal if such exists.	10 years ago
sixcooler	5594c43d2e	bump to Solr-/Lucene-4.10.3	10 years ago
reger	d44d8996d0	Added a “don't store remote search results” option This is intended for peers who want to participate in the P2P network but don't wish to load/fill-up their index with metadata of every received search result. The DHT transfer is not effected by this option (and will work as usual, so that a peer disabling the new store to index switch still receives and holds the metadata according to DHT rules). Downside for the local peer is that search speed will not improve if search terms are only avail. remote or by quick hits in local index. To be able to improve the local index a Click-Servlet option was added additionally. If switched on, all search result links point to this servlet, which forwards the users browser (by html header) to the desired page and feeds the page to the fulltext-index. The servlet accepts a parameter defining the action to perform (see defaults/web.xml, index, crawl, crawllinks) The option check-boxes are placed in ConfigPortal.html	10 years ago
reger	e177d69387	remove obsolete config footer option (ConfigPortal user.login) no footer or footer-option in use remove unused yacy.init item allowUnlimitedReceiveIndexFrom	10 years ago
reger	6a04563578	Init Jetty using setDefaultDescriptor (web.xml) to defaults/web.xml so web.xml in defaults dir is applied first and optional DATA/SETTINGS/web.xml loaded on top. By using this Jetty feature (default web.xml) we assure that changes to the default are applied to existing installations and individual addition/changes are still respected.	10 years ago
Michael Peter Christen	eb78388a98	changed prefer strategy for http unique in such a way that http is preferred over https. While this is a bad idea from the standpoint of security it is more common applicable for environments where http and https mix and for some domains https is not available. Then the double-check is possible even if no postprocessing is performed.	10 years ago
Michael Peter Christen	d14114697c	the miss cache does not seem to work, it sometimes contains urlhashes from documents which actually are inside the index. This can be reproduced using the crawl result table at http://localhost:8090/CrawlResults.html?process=5 The cache is temporary disabled to remove the bad behaviour, however a later reactivation of that feater may be possible.	10 years ago
reger	446f374ba9	fix yacy.init comment http://mantis.tokeek.de/view.php?id=513	10 years ago
Michael Peter Christen	66b5a56976	Added and integrated new date detection class which can identify date notions within the fulltext of a document. This class attempts to identify also dates given abbreviated or with missing year or described with names for special days, like 'Halloween'. In case that a date has no year given, the current year and following years are considered. This process is therefore able to identify a large set of dates to a document, either because there are several dates given in the document or the date is ambiguous. Four new Solr fields are used to store the parsing result: dates_in_content_sxt: if date expressions can be found in the content, these dates are listed here in order of the appearances dates_in_content_count_i: the number of entries in dates_in_content_sxt date_in_content_min_dt: if dates_in_content_sxt is filled, this contains the oldest date from the list of available dates #date_in_content_max_dt: if dates_in_content_sxt is filled, this contains the youngest date from the list of available dates, that may also be possibly in the future These fields are deactiviated by default because the evaluation of regular expressions to detect the date is yet too CPU intensive. Maybe future enhancements will cause that this is switched on by default. The purpose of these fields is the creation of calendar-like search facets, to be implemented next.	10 years ago
Michael Peter Christen	114f0afc1e	enable sku as anchor in html response writer	10 years ago
Michael Peter Christen	60f27bdf49	added the property timeoutrequests to configuration to disable TimeoutRequests. The purpose is to test if YaCy runs better on VMs where there is a limitation of concurrent processes; see /proc/user_beancounters in row numproc; this value is limited and should be low. Try to set timeoutrequests to keep this low. (works only after restart)	10 years ago
Michael Peter Christen	1d45d9405a	security bugfix	10 years ago
Michael Peter Christen	c94c24638f	disabled postprocessing by default. If you read this: please disable postprocessing in your peer as well: open /IndexSchema_p.html, then deselect field process_sxt	10 years ago
Michael Peter Christen	c0f9f6ac66	added option to change the navbar-default, i.e. usable for dark skins	10 years ago
Michael Peter Christen	84763126e0	added option to make the YaCy proxy act as the cache is never stale. If set to 'Always Fresh' the cache is always used if the entry in the cache exist. This is a good way to archive web content and access it without going online again in case the documents exist. To do so, open /Settings_p.html?page=ProxyAccess and check the "Always Fresh" checkbox. This is set do false which behave as set before. If you set this to true, then you have your web archive in DATA/HTCACHE. Copy this to carry around your private copy of the internet!	10 years ago
reger	ee277b9b3e	allow for local yacy.stopwords and yacy.badwords list (in DATA/SETTINGS/) if file in DATA/SETTINGS it is loaded otherwise file in ./defaults is loaded (if locale ./defaults/stopwords.xx doesn't exist take solr/lang/stopwords_xx.txt as default) move yacy.stopwords, yacy.stopwords.de and yacy.badwords.example out of root directory to ./defaults directory	10 years ago
Michael Peter Christen	c67c5c0709	added new solr schema fields which record the occurences of vocabulary matchings. These matches can be used for result boosting, i.e. if a document contains words from a specific vocabulary, boost it.	10 years ago
Michael Peter Christen	68e8039fd1	added high-precision scheduler for API processes. This allows also to make the execution in dependency of available RAM or CPU load. The default value for CPU load is 4.0 and the check runs once a minute.	10 years ago
sixcooler	725b206fb4	update to solr-/lucene-4.10.2	10 years ago
Michael Peter Christen	26279b0993	added debug code for statistics about document attributes related to domains	10 years ago
Michael Peter Christen	2e5214eb21	added field postprocessing.partialUpdate to settings which can be used to switch on or off partial updates. Both options should cause the same result. Default is on.	10 years ago
Michael Peter Christen	b1cfbc4a04	added new solr field url_paths_count_i which can be used to enhance the index browser and maybe also for ranking; possibly also for SEO-with-YaCy applications.	10 years ago
Michael Peter Christen	8c1a89cb34	added another decoration flag to switch off network graphics in crawler monitor and index browser: decoration.grafics.linkstructure Please set this to false to remove the graphics from the interface.	10 years ago
Michael Peter Christen	bc221a0f9c	less load and more ram prerequisite for crawl steps	10 years ago
Michael Peter Christen	2a052f446a	Added an experimental audio feedback system. This is the first element of a new 'decoration' component which may hold switches for different external appearance parameters. The first switch in that context is decoration.audio (as usual in yacy.init). This value is set to false by default, that means the audio feedback element is switched off by default. To switch it on, set decoration.audio = true (using /ConfigProperties_p.html). You will then hear sounds for the following events: - remote searches - incoming dht transmissions - new documents from the crawler Sound clips are stored in htroot/env/soundclips/ which is done so because a future implementation will read these files using the http client and with configurable urls which will make it very easy for the user to replace the given sounds with own sounds.	10 years ago
Michael Peter Christen	f03dd0df24	updated seedlist	10 years ago
Michael Peter Christen	2b1cf26828	removed solr warning during startup	10 years ago
Michael Peter Christen	57ce7eeff3	fixed localhost authorization and replaced the adminRealm with an info string which is visible in the browser. That makes it possible that the browser instructs the user how to change a forgotten admin password (during runtime).	10 years ago
orbiter	f318d7c285	enhanced date-ordered ranking	10 years ago
orbiter	b3ebd38079	removed the HTDOCS repository concept because the concept to host files on the YaCy http server is obsolete; YaCy can index file:// and smb:// paths	10 years ago
reger	ec5b1d9e33	let NETWORK_WHITELIST take precedence over NETWORK_BLACKLIST this makes it easier to config exception (for private networks), like blacklist= .* whitelist= 10\..,127\.. ..... allows only listed ip pattern	10 years ago
orbiter	2371d6b8db	target linktexts must be string to enable search facets on these fields	10 years ago
orbiter	161a11070c	yacystats is gone :(	10 years ago
reger	7328c2883b	fix type in .init description http://mantis.tokeek.de/view.php?id=430	10 years ago
reger	94819f0797	set .ini default boost fields to same as assigned by button "reset to default" (in RankingSolr_p) - fix typo http://mantis.tokeek.de/view.php?id=430	10 years ago
reger	a2cb366b25	Combine /heuristic search modifier with opensearch configured targets - with search modifier /heuristic a request is send to all configured opensearch target systems (old /heuristic/blekko modifier not longer valid) - this allows to use opensearch heuristic on individual search request (in contrast to configuration HEURISTIC_OPENSEARCH=true which sends a osd request on all global searches - the index.html searchoption text adjusted to be displayed only if option configured - add Archive-It to predefined systems	10 years ago
Michael Peter Christen	2de159719b	added an option to set 'obey nofollow' for links with rel="nofollow" attribute in the <a> tag for each crawl. This introduces a lot of changes because it extends the usage of the AnchorURL Object type which now also has a different toString method that the underlying DigestURL.toString. It is therefore not advised to use .toString at all for urls, just just toNormalform(false) instead.	10 years ago
Michael Peter Christen	1092e798a5	fixed double content postprocessing	11 years ago
Michael Peter Christen	09dcdb9b19	update to solr 4.9.0	11 years ago
orbiter	0bbb5040b8	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
orbiter	9d5d86cd03	Added filter query options to the ranking servlet /RankingSolr_p.html. Filter queries are not actually related to ranking, but user requests have pointed out that specific boost queries to move results to the end of the result list are not sufficient. Such boost filters may be better executed as actual filter and therefore such a filter can now be statically applied to every search request. A typical use could be the expression "http_unique_b:true AND www_unique_b:true" which uses the recently introduced fields http_unique_b and www_unique_b which are true only for one of the alternatives with/without http(s) and with/without prefix 'www.' in host names.	11 years ago
Michael Peter Christen	d2151857f1	Added collection navigation: The collection field (can be filled i.e. in Crawl Start) can be used to add categories to YaCy index entries. The usage of that field was restricted to solr searches and post argument filters as implemented in commit `f7571386a3`. This commit extends collections to a full navigation option in the standard YaCy search interface. The field is not active by default but can be activated easily in the /ConfigSearchPage_p.html servlet (just check the 'Collection' facet field). Collections can now be used for (at least) two purposes: - to provide search tenants (through post argument collection) - to provide self-made category navigation Search requests may now have (independently from switched on or off collection facet) a "collection:<collection-name>" modifier attached; firthermore collection names may use disjunctions using the '\|' pipe symbol. For example, this is a valid search request: www collection:user\|proxy	11 years ago

1 2 3 4 5 ...

555 Commits (5789c96292ca808c48c5cd2e043040cf6d9afac6)