yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Michael Peter Christen	6aabc4e5c8	reduced logging line memory, 10000 lines had filled up 450MB! grrr. (thank you, a bomb from the past)	11 years ago
Michael Peter Christen	1b4fa2947d	- fixed a problem which ocurred when a document was not recognized with the right content domain (i.e. identifying that it is an image, text etc.) because it used the file extension and not an existing mime type assignment. - fixed the new setting that images shall be loaded for a better image search. - both fixes together makes it now possible to crawl commons.wikimedia.org which makes use of 'funny' document names (i.e. ending with .jpg while the document is html)	11 years ago
reger	f46c723398	allow to choose used http server, YaCy-Anomic or Jetty - defaults to Jetty (in this branch) - add server version info & config option -> Admin Console -> Advanced Settings -> Http Networking	11 years ago
Michael Peter Christen	820b896146	Replaced the inframe loading from yacy.net for donations with the loading of this iframe from the local host. To make this more flexible, this iframe is loaded once after startup from yacy.net.	11 years ago
reger	cf32a92629	- add size check to multipart form data handling of YaCyDefaultServlet (same as in HTTPDemon.parseMultipart) - reduce Jetty logging - give build.run a bit more memory (set to YaCy.default 600m from 512m)	11 years ago
reger	a44eede8b8	merge rc1/master	11 years ago
Michael Peter Christen	90c8577840	enhanced ranking; patches to replace old ranking	11 years ago
Michael Peter Christen	1b61bd40ed	- Added new solr field url_file_name_tokens_t which stores the file name tokens. This can be used to enhance the ranking. - Added also a rating_i field as basis for later usage. - enhanced the tokenization process.	11 years ago
orbiter	5f5a97bafc	added the anchor text within web pages to the searcheable entities of a web page. This can be of benefit for the ranking if these fields are used for boosts.	11 years ago
Michael Peter Christen	21aa6a0321	migration to Solr 4.5.0	11 years ago
reger	c7c706fd9f	merge with rc1/master	11 years ago
Michael Peter Christen	b28d43decc	added two more fields source_cr_host_norm_i,target_cr_host_norm_i in webgraph and an addition to postprocessing to copy all cr ranking attributes to the link edges associated to the postprocessing documents	11 years ago
Michael Peter Christen	4f83d5f18c	added the new field harvestkey_s to the collection index and the webgraph index which is temporary filled with the crawl profile key. This is used to select a set of documents for post-processing as soon as a crawl is finished. Now the postprocessing for a specific crawl is started when that specific crawl is finished and not at the end of all post-processing steps.	11 years ago
orbiter	8ac2e8c8c9	added location navigator which causes that the image to the map search is visible whenever a location is available in the search result. To activate this, the search.navigation property in yacy.conf must be modified to the new default values.	11 years ago
reger	5111841e5b	- reduce Jetty debug logging - fix Context path initialization	11 years ago
Michael Peter Christen	61c5e40687	- replaced the properties object in AnchorURL with distinct variables for anchor attributes. - this caused that large portions of the parser code had to be adopted as well - added a counter target_order_i for anchor links in webgraph computation	11 years ago
Michael Peter Christen	85456f46b2	added two new fields, exact_signature_copycount_i and fuzzy_signature_copycount_i, which count the number of copies of non-unique documents and assigns this to each document. Thus, each document there is a number assigned which shows how many copies of this document exists. These fields are disabled by default.	11 years ago
Michael Peter Christen	a2511b5600	turned images_alt_txt back to images_alt_sxt because it is not necessary to index the alt text. Indexed image Text is in images_text_t	11 years ago
Michael Peter Christen	69f85265e1	added an option to put image links to the crawl queue and handle these like normal documents. Using this option (by default on at this moment; this might change soon) it is possible to get the exif data into the search index to be used in image search.	11 years ago
orbiter	f106345eef	link strings should not be tokenized	11 years ago
orbiter	deadeb406e	image alt tag strings should be tokenized	11 years ago
Michael Peter Christen	1a3e42eca4	index migration to lucene 4.4	11 years ago
Michael Peter Christen	765943a4b7	Redesign of crawler identification and robots steering. A non-p2p user in intranets and the internet can now choose to appear as Googlebot. This is an essential necessity to be able to compete in the field of commercial search appliances, since most web pages are these days optimized only for Google and no other search platform any more. All commercial search engine providers have a built-in fake-Google User Agent to be able to get the same search index as Google can do. Without the resistance against obeying to robots.txt in this case, no competition is possible any more. YaCy will always obey the robots.txt when it is used for crawling the web in a peer-to-peer network, but to establish a Search Appliance (like a Google Search Appliance, GSA) it is necessary to be able to behave exactly like a Google crawler. With this change, you will be able to switch the user agent when portal or intranet mode is selected on per-crawl-start basis. Every crawl start can have a different user agent.	11 years ago
sixcooler	1bc6003057	rise autoCommit maxTime to 3 Minutes to reduce IO lower mergeFactor again (5) for less segments	11 years ago
orbiter	944ae5686c	added donation plea to the about box as default (you can replace this in your peer!)	11 years ago
Michael Peter Christen	58fe986cca	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	cf12835f20	replaced the single-text description solr field with a multi-value description_txt text field	11 years ago
orbiter	e7fcb81cea	we should not do too much greedylearning at this time as we don't have enough experience with it. set greedylearning.limit.doccount to a much lower limit.	11 years ago
orbiter	bf0ad04e1b	apply load limitation also to dht-in	11 years ago
orbiter	f50b596e0b	do not run dht ditribution if system load is over 2.5	11 years ago
orbiter	e24016e30a	added the property federated.service.solr.indexing.timeout to yacy.init to provide a configurable time-out for solr; see also: http://bugs.yacy.net/view.php?id=254	11 years ago
Roland Haeder	98e10f95e2	Added some cora package loggers	11 years ago
orbiter	1b43e02b86	Merge branch 'master' of git://gitorious.org/~quix0r/yacy/quix0rs-yacy-rc1	12 years ago
orbiter	a548354c71	replaced type of solr schema object sku of text_en_splitting_tight by string	12 years ago
Roland Haeder	ebbb3bc5c1	Fixed CHMOD on many files + added missing loggers (e.g. jena) and made some noisy loggers quiet	12 years ago
orbiter	e609ec388a	metager whitelist update	12 years ago
Michael Peter Christen	2716dfc46c	increase crawler speed by reduction if the busysleep time	12 years ago
Michael Peter Christen	57ffdfad4c	added a crawl option to obey html-meta-robots-noindex. This is on by default.	12 years ago
Michael Peter Christen	5a5d411ec0	new robots_i attribute fields	12 years ago
orbiter	7c6ccc426c	set crawlingQ to true by default because most webpages are dynamic and crawlingQ should only be switched off in case of crawler traps	12 years ago
Michael Peter Christen	16d1d744fa	added url_file_name_s in default collection schema for the file name without the file extension. This part of the file path is removed from the multi-field url_paths_sxt, which has now not the file name as last part of the path list. The same applies to the new fields source_file_name_s and target_file_name_s in the webgraph schema.	12 years ago
orbiter	8792e6c6e9	stub for better image indexing	12 years ago
Michael Peter Christen	570511f3c8	removed fields references_internal_id_sxt and references_internal_url_sxt because they had been shown to be superfluous. The citation of referrer in the host browser is possible without them. Therefore now the host browser does not only show internal, but also external referrer to each link.	12 years ago
Michael Peter Christen	fd1776a3b0	added a new 'Citations' function: each search result item can now be explored for citations within other documents. A click on the 'Citations' link shows an analysis with all text lines in the document each with a complete list of documents which contain the same line. A second section shows the linking documents in ascending order of number of citations from the original document. Because documents from different hosts are most interesting here, they are listed at the top of the page as possible 'copypasta' source.	12 years ago
Michael Peter Christen	7754a1263b	switching back to the merge factor 10; the solr default.	12 years ago
Michael Peter Christen	1762911f57	added synchronizations and timeouts in solr api; missing synchronizations in index modification methods causes deadlocks inside solr.	12 years ago
Michael Peter Christen	959ccc4675	increased the solr merge factor because 4 was too much IO load for frequent index receiving and re-indexing after clickdepth/cr calculation.	12 years ago
Michael Peter Christen	20fab1feb6	allip net has greedy learning disabled	12 years ago
Michael Peter Christen	6115bef335	added a 'greedy learning' mechanismn which will cause that a 'fresh' yacy will load linked web pages from search results until the total number of web pages reaches 15000. This shall give fresh peers a 'boost' to get faster a personalized search index.	12 years ago
Michael Peter Christen	856e5c42ae	the line "Web Search by the People, for the People" is more generic for P2P and portal search as default search string. Otherwise, if people switch to Portal mode, the "P2P Web Search" does not make sense.	12 years ago
Michael Peter Christen	713a6199ef	activated citation ranking by default	12 years ago
Michael Peter Christen	f7a4377812	usage of the new normalized link polularity CRn as default ranking function. This replaces the previous formula, which was bad. Before you update to this version, please check if you changed the ranking function yourself before, since it will be overwritten.	12 years ago
Michael Peter Christen	f7e77a21bf	Added a citation reference computation for intra-domain link structures. While the values for the reference evaluation are computed, also a backlink-structure can be discovered and written to the index as well. The host browser has been extended to show such backlinks to each presented links. The host browser therefore can now show an information where an document is linked. The new citation reference is computed as likelyhood for a random click path with recursive usage of previously computed likelyhood. This process is repeated until the likelyhood converges to a specific number. This number is then normalized to a ranking value CRn, 0<=CRn<=1. The value CRn can therefore be used to rank popularity within intra-domain link structures.	12 years ago
reger	8a7fcb391d	enable use of solrcore.properties for property substitution of solrconfig.xml - move setting of system property solr.directoryFactory=solr.MMapDirectoryFactory to solrcore.properties - add check of os.arch for 64bit system, if it fails use default/solrcore.x86.properties (if exists) as solrcore.properties reason: on 32bit MMapDirectoryFactory may fail with..... Caused by: java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:849) at org.apache.lucene.store.MMapDirectory.map(MMapDirectory.java:283)	12 years ago
Michael Peter Christen	eb9d0ba5b1	ranking and boost function update, small bugfixes, better default search field for solr	12 years ago
Michael Peter Christen	a8dc4346e8	default configuration of MMapDirectoryFactory for solr, increased lock timeout, less documents from remote searches (too many results had easily blocked a peer)	12 years ago
Michael Peter Christen	0c1a018bbd	removed 'later' tactic because it used too much RAM, reduced number of soft commits, reduced caching size of search events, ensured that solr results are processed before connection is closed to keep that stuff not too long in RAM	12 years ago
Michael Peter Christen	536fd1450e	added new keys for update locations	12 years ago
orbiter	a83c2fe833	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
orbiter	4baa0d4a97	Added a default keystore for ssl encryption of the YaCy web interface. This will enable https-access to YaCy, but this feature is disabled by default using the new server.https=false attribute. This has two purposes: - make it easier for everyone to use https (just set server.https=true) - provide the basis for secure yacy-to-yacy communication in the future	12 years ago
reger	da191c839d	reduce SolrConnectorLogging setting (from default ALL to INFO)	12 years ago
Michael Peter Christen	9bd2aee180	migrated to solr 4.3.0	12 years ago
Michael Peter Christen	cca19d94d4	re-declared some fields to be of type string rather than text which makes them more efficient and less large	12 years ago
Michael Peter Christen	cc90f82dbb	increased default proxy client timeout to one minute	12 years ago
Michael Peter Christen	50421171c3	added new schema fields: hreflang_url_sxt and hreflang_cc_sxt for http://support.google.com/webmasters/bin/answer.py?hl=de&answer=189077 navigation_url_sxt and navigation_type_sxt for http://googlewebmastercentral.blogspot.de/2011/09/pagination-with-relnext-and-relprev.html publisher_url_s for http://support.google.com/plus/answer/1713826?hl=de all fields are disabled by default and not written to the index.	12 years ago
Michael Peter Christen	d05dc07cff	setting of new default values for ranking	12 years ago
Michael Peter Christen	97775fbebc	fixed ranking for add-function queries: this did not work. The option was removed. All function queries are now boosts (multiplies the score according to a function). This is also the recommended way to boost rankings based on functions as explained in http://nolanlawson.com/2012/06/02/comparing-boost-methods-in-solr/	12 years ago
Michael Peter Christen	7ab5093321	added new solr title_exact_signature_l and description_exact_signature_l to be able to identify unique title and unique description fields.	12 years ago
Michael Peter Christen	27d6222880	added new field host_extent_i which, after a crawl and postprocessing, holds the number of documents for the host where the document is hosted. This is necessary for ranking and the norming of references per local host in the ranking computation.	12 years ago
Michael Peter Christen	ada3f27de7	added three new field for a better ranking: references_internal_i, references_external_i and references_exthosts_i. These can be used to count and evaluate the number of external links to every web page. An experimental ranking function can be i.e.: div(add(references_internal_i,product(references_external_i,references_exthosts_i)),add(clickdepth_i,1))	12 years ago
reger	e89491271f	- fix opensearch discover err msg - webgraph not enabled - if no opensearchdescription link found in index - remove search2.net from sample config (is down)	12 years ago
orbiter	17ae51e741	increased number of links limitation from 1000 to 10000 for rss feeds and html documents	12 years ago
Michael Peter Christen	2d36a7eaf5	- do not create a new query for all remote peers - no document search this time - adjusted banner and network to not show 'WORDS' but DHT Chunks. This is to avoid confusion for robinson peers which do not create Word Entries	12 years ago
Michael Peter Christen	4af0839be2	use appropriate ranking for each search situation: - when using the /date modifier, a date ranking profile is used - when using a site: modifier, a ranking profile supporting longer urls is used	12 years ago
Michael Peter Christen	2080fc7406	removed unused tag fields	12 years ago
orbiter	6b13dd0d3d	added clickdepth field writing for webgraph core (unfinished)	12 years ago
Michael Peter Christen	addba047e2	changes in ranking computation - an existing ranking servlet for solr was extended. It is now possible to set boost values for fields, boost functions and boost queries. - The ranking can have different instances, but currently only the first one is used - added an abstraction layer for fields which can be used for search and those fields can be edited in the solr ranking configruation - the ranking value from solr within the field score is used to combine remote search requests, which all are created using the same locally defined boost values - reduced the number of fields which are used for search (makes it faster) - replaced some text fields by string fields (makes indexing faster) - removed classes which had no use - made a large number of experiments for a better ranking and created a temporary setting which prefers hits inside titles - adjusted also the RWI-based ranking computation to 'prefer title' - made special cases like for portal search where no post-processing and post-ranking is wanted: this keeps the original ranking order as done by Solr - fixed many bugs with old settings for ranking	12 years ago
Michael Peter Christen	25300913fa	fixes to search debugging after testing with the different search debugging options	12 years ago
orbiter	b1140e3d82	added debug switches for detailed search testing	12 years ago
Michael Peter Christen	0d7b4bc891	better protection against OOM during search flush and fixed missing result push	12 years ago
Michael Peter Christen	3b1d9dc884	made index storage from DHT search result concurrently. This prevents blocking by high CPU usage during search. Also: removed query from Solr for DHT search results; results are taken from the pending queue.	12 years ago
orbiter	0f7ea7ad9f	- enhanced solr.add procedure for mass adds - removed unused solr access classes - made snippet generation for documents aus YaCy RWI/DHT concurrent (as it was before the search process removation) - reduced the number of remote results in settings file because the processing of such mass documents add is too CPU-intensive (in Solr)	12 years ago
Michael Peter Christen	089dee1770	- generalized SchemaConfiguration into super-class Configuration and adopted other classes which used the configuration-only access for that class - removed many warnings - adjusted logging	12 years ago
Michael Peter Christen	56d5946a59	- added flags in IndexFederated_p.html to switch on or off the webgraph index (new solr core webgraph) .. this is now off by default - completely redesigned this servlet - added description how to attach a remote solr - adjusted naming of servlet and menues - moved 'lazy initialization' attribut from IndexSchema to IndexFederated (this is a general option) back again.	12 years ago
Michael Peter Christen	461d46101d	- Removed log4j from libraries. This can be removed because the package log4j-over-slf4j is there. From slf4j all loggings are routed to the jdk logger. Now all loggings are consistently done to the jdk logger. - added some lines to the logging properties to suppress many solr logging statements. The number of the logging entries had already become a performance issue, therefore removing these from the log should increase performance.	12 years ago
Michael Peter Christen	788288eb9e	added the generation of 50 (!!) new solr field in the core 'webgraph'. The default schema uses only some of them and the resting search index has now the following properties: - webgraph size will have about 40 times as much entries as default index - the complete index size will increase and may be about the double size of current amount As testing showed, not much indexing performance is lost. The default index will be smaller (moved fields out of it); thus searching can be faster. The new index will cause that some old parts in YaCy can be removed, i.e. specialized webgraph data and the noload crawler. The new index will make it possible to: - search within link texts of linked but not indexed documents (about 20 times of document index in size!!) - get a very detailed link graph - enhance ranking using a complete link graph To get the full access to the new index, the API to solr has now two access points: one with attribute core=collection1 for the default search index and core=webgraph to the new webgraph search index. This is also avaiable for p2p operation but client access is not yet implemented.	12 years ago
Michael Peter Christen	91a0401d59	introduced a second core named 'webgraph'. This core will hold the link structure, but is not filled yet. To have the opportunity of a second core, multi-core functionality had to be implemented to the deep-embedded solr: - migrated the solr_40 directory content to a subdirectory 'collection1'; the previously used default core is now called collection1 - added solr_40/webgraph subdirectory as second core - added a servlet configuration for the second core 'webgraph' in /IndexSchema_p.html - added instance handling as addition to solr connections: all solr connectors are now instances of an solr 'instance' object; this required a complete re-design of the solr embedding - migrated also caching and sharding ontop of new instance handling - migrated the search apis to handle now the access to a specific core, the default core named 'collection1' - migrated the remote solr search interface to access shards of cores; for the yacy remote search the default core is now called 'solr'; using the peer address as solr address - migrated the solr backup and restore process: old backups cannot be used after this migration! - redesign of solr instance handling in all methods which access the instances: they cannot hold copies of these instances any more; the must retrieve the actuall connection object every time they want to write to it (this solves also some bugs when switching the index/network) - added another schema 'solr.webgraph.schema', the old solr.keys.list is replaced by solr.collection.schema	12 years ago
Michael Peter Christen	4111606654	removed the commitWithin attribute because that is not the way how the index is updated the right way for us. May also be be superfluous with the solr 4.0 softcommit.	12 years ago
Michael Peter Christen	d70d99fab5	added more metadata fields and facets to OpensearchResponseWriter. This should make it possible to replace the original and enriched yacy opensearch result with a solr output in opensearch format.	12 years ago
Michael Peter Christen	8651ec35fe	turned author_s into the multi-valued field author_sxt	12 years ago
Michael Peter Christen	4735bd47f4	- changed solr commit call and added an optimize option. Since Solr 4.0.0 there is a new softcommit feature which implements a near-real-time (NRT) search option. The softcommit does not do IO and does not cause performance issues. YaCy has now an extension in its solr connectors to use the softcommit feature. The softcommit call now replaces all places where a hard commit was used. Furthermore the commit strategy in when doing a search from the web interface was changed (it's done every time before a search is done). The softcommit feature was implemented because it was needed for the following changes (customer demands), which is also included in this git commit: - added a feature to identify all documents which have unique titles and/or unique descriptions. These unique flags are disabled by default. - added also a feature to set a flag when the url from a canonical tag is equal to the document url. This is also disabled by default. To support the new softcommit strategy, the commitWithinMs option was set to -1 do disable automatic commit based on document insert times. If documents are inserted permanently then also a commit would happen permanently whenever the commitWithinMs time is reached. This would conflict with the regular autocommit of 10 minutes and the new softcommit strategy.	12 years ago
Michael Peter Christen	db024a4e19	added new solr fields (unused yet; implementation will follow)	12 years ago
Michael Peter Christen	9b5bdae1b4	Reverted setting of MMapDirectoryFactory from solrconfig; see http://forum.yacy-websuche.de/viewtopic.php?p=27509#p27509 Instead, in the start script is checked if the host is a 64 host and -Dsolr.directoryFactory=solr.MMapDirectoryFactory is set as java option Reverted the ramBufferSizeMB setting (this was not enabled anyway) because that may be too much memory for small peers and embedded systems. Activated the mergeFactor 4; this was commented out by mistake	12 years ago
orbiter	eb68a30947	solr performance settings the target of these performance settings is the reduction of IO in general and during search in particual. - reduced mergeFactor to 4. This will increase the IO during indexing, but will reduce IO during search. It will also greatly reduce the number of open files which should make it possible to have overall larger indexes until the number of open files in an OS is reached. - increased ramBufferSizeMB to 256mb. This will reduce the number of commits. This change may compensate the reduction of the mergeFactor. - disabled updateLog. This is a real-time search feature which is available in YaCy anyway because a commit is forced if index.html is called. The updateLog feature causes a lot of IO during indexing and search and produced a lot of files in SEGMENTS/solr_40/data/tlog	12 years ago
Michael Peter Christen	f53703df62	using MMapDirectoryFactory as solution for ClosedChannelException given in https://issues.apache.org/jira/browse/SOLR-2247	12 years ago
Michael Peter Christen	22c694f906	activated the clickdepth_i attribute for solr again because the calculcation of that value is not as extensive as expected and furthermore the value is very useful for ranking	12 years ago
Michael Peter Christen	5a0eb1b268	clickpath should not be active by default because it needs extensive computation - partly to be implemented	12 years ago
Michael Peter Christen	5c0c56cfe1	Preparations to produce a click depth attribute in the search index. This attribute can be used for ranking and for other purpose (demand by customer) The click depth is computed in two steps: - during indexing the current fill-state of the reverse link index is used to backtrack the current page to the root page. The length of that backtrack is the clickdepth. But this does not discover the shortest click depth. To get this, a second process to check again is needed - added a process tag that can be used to do operations on the existing index after a crawl; i.e. calculation the shortest clickpath. Added a field to control this operation but not a method to operate on this. - added a visualization of the clickpath length in the host browser	12 years ago
Michael Peter Christen	295884fd54	- Merge commit '168b1d130d9d67b5e8855a0b50c4ba7ad4a416f8' - fixed conflict in htroot/yacysearch.java - removed nedres check because that causes that the remote server is not called at all in most cases (local index has already results but we want more) - fixed a regex bug (a '=' too much)	12 years ago
reger	168b1d130d	Adding heuristic to get search results from configured systems which support opensearch specification - any system supporting opensearch specification can be configured - search query is only forwarded to remote system if not enough results available on local peer - discover function provided, checking the local Solr index for links to opensearchdescription files, to add to the config - sample config file with some general search engines with opensearch support	12 years ago

1 2 3 4 5 ...

477 Commits (ca7444dbdf2f0a07853450dcc7ddec02ea3c974b)