yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Michael Peter Christen	1b61bd40ed	- Added new solr field url_file_name_tokens_t which stores the file name tokens. This can be used to enhance the ranking. - Added also a rating_i field as basis for later usage. - enhanced the tokenization process.	11 years ago
orbiter	5f5a97bafc	added the anchor text within web pages to the searcheable entities of a web page. This can be of benefit for the ranking if these fields are used for boosts.	11 years ago
orbiter	705b3338ee	list more fields available for search and for ranking boosts	11 years ago
Michael Peter Christen	78e7aadb26	removed unused initialization method	11 years ago
Michael Peter Christen	4fbc4740df	removed warnings	11 years ago
Michael Peter Christen	21aa6a0321	migration to Solr 4.5.0	11 years ago
Michael Peter Christen	101a6e6e14	Patch the citation index for links with canonical tags. This shall fulfill the following requirement: If a document A links to B and B contains a 'canonical C', then the citation rank computation shall consider that A links to C and B does not link to C. To do so, we first must collect all canonical links, find all references to them, get the anchor list of the documents and patch the citation reference of these links.	11 years ago
Michael Peter Christen	b28d43decc	added two more fields source_cr_host_norm_i,target_cr_host_norm_i in webgraph and an addition to postprocessing to copy all cr ranking attributes to the link edges associated to the postprocessing documents	12 years ago
Michael Peter Christen	a52f3a597e	fix for canonical-from-http-header feature	12 years ago
Michael Peter Christen	2dd7c5be44	added parsing of http-canonical tags (untested, could not find an example page)	12 years ago
Michael Peter Christen	3bf0104199	fix for crawl domain counter limitation (limit was reached too early)	12 years ago
Michael Peter Christen	82bfd9e00a	- crawl profiles shall be deleted from active and passive stacks if they are deleted to terminate the crawl because otherwise the crawl will go on after the load-from-passive stack policy. - better check if a crawl is terminated using the loader queue.	12 years ago
Michael Peter Christen	91a875dff5	self-healing of mistakenly deactivated crawl profiles. This fixes a bug which can happen in rare cases when a crawl start and a cleanup process happen at the same time.	12 years ago
Michael Peter Christen	095053a9b4	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
sixcooler	0cae420d8e	some dns-timing changes: since httpclient uses the domain-cache it is useful not to clean the domain cache until crawling is running (domains are filled into this cache) On huge crawl-starts (eg. from file) my DNS did not follow the high rates - so I reduced the rate and give some more time(-out)	12 years ago
Michael Peter Christen	4f83d5f18c	added the new field harvestkey_s to the collection index and the webgraph index which is temporary filled with the crawl profile key. This is used to select a set of documents for post-processing as soon as a crawl is finished. Now the postprocessing for a specific crawl is started when that specific crawl is finished and not at the end of all post-processing steps.	12 years ago
orbiter	14442efa6d	when profiles are cleaned, there shall be first a callback showing which profiles are cleaned. This shall enable a profile-termination-driven postprocessing. To do this, index writings must carry the profile key which will be implemented in another (next) step.	12 years ago
orbiter	8ac2e8c8c9	added location navigator which causes that the image to the map search is visible whenever a location is available in the search result. To activate this, the search.navigation property in yacy.conf must be modified to the new default values.	12 years ago
Michael Peter Christen	96ed0c980e	- added hosthash to all documents (also fail documents which is needed there for deletion), this fixes a problem for the deletion of old documents for new crawl starts - added clickdepth and citation computation for fail documents	12 years ago
orbiter	828603e4f1	fix for 100%CPU problem in error cache cleaning process	12 years ago
orbiter	c64b51134e	hack to add all tokens from the url to text_t. This was working for the RWI index (and still is working) but not for solr-only search indexes. Maybe we should find a solution using a separate search field instead.	12 years ago
orbiter	f3be1930cb	CPU problem when pusing to the error cache; wrong class, ConcurrentHashMap needed for concurrency	12 years ago
Michael Peter Christen	e40671ddb7	better and consistent deletions for error urls	12 years ago
Michael Peter Christen	2602be8d1e	- removed ZURL data structure; removed also the ZURL data file - replaced load failure logging by information which is stored in Solr - fixed a bug with crawling of feeds: added must-match pattern application to feed urls to filter out such urls which shall not be in a wanted domain - delegatedURLs, which also used ZURLs are now temporary objects in memory	12 years ago
Michael Peter Christen	31920385f7	set anchor rel attribute of all links to "nofollow" if the html meta contains a robots:nofollow or if the http header contains a "X-Robots-Tag: nofollow"	12 years ago
Michael Peter Christen	61c5e40687	- replaced the properties object in AnchorURL with distinct variables for anchor attributes. - this caused that large portions of the parser code had to be adopted as well - added a counter target_order_i for anchor links in webgraph computation	12 years ago
Michael Peter Christen	5e31bad711	- the webgraph shall store all links which appear on a web page and not all unique links! This made it necessary, that a large portion of the parser and link processing classes must be adopted to carry a different type of link collection which carry a property attribute which are attached to web anchors. - introduction of a new URL class, AnchorURL - the other url classes, DigestURI and MultiProtocolURI had been renamed and refactored to fit into a new document package schema, document.id - cleanup of net.yacy.cora.document package and refactoring	12 years ago
Michael Peter Christen	35ab2cef7b	added parsing of 'date', 'dc:date', 'dc.date' and 'last-modified' in html meta fields to get a correct (or: better) date timestamp. The http:last-modified mostly does not work because it is set to the current date from most CMS.	12 years ago
Michael Peter Christen	9cc8468b30	added tools to visualize image generation (i.e. during testing)	12 years ago
Michael Peter Christen	dbef8ccfcb	forced deletion of ZURL entries for a specific host for each host that appears in the crawl url list	12 years ago
Michael Peter Christen	e137ff4171	refactoring (im preparation for new removeHost method)	12 years ago
Michael Peter Christen	7a5574cd51	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
Michael Peter Christen	85456f46b2	added two new fields, exact_signature_copycount_i and fuzzy_signature_copycount_i, which count the number of copies of non-unique documents and assigns this to each document. Thus, each document there is a number assigned which shows how many copies of this document exists. These fields are disabled by default.	12 years ago
orbiter	26366596d9	fix for a problem which ocurres when a site is crawled where the start url is redirected.	12 years ago
Michael Peter Christen	a2511b5600	turned images_alt_txt back to images_alt_sxt because it is not necessary to index the alt text. Indexed image Text is in images_text_t	12 years ago
Michael Peter Christen	85b1922244	activated image type navigation for image search	12 years ago
Michael Peter Christen	9e12fdff23	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
Michael Peter Christen	ab1201fdfd	fixed wrong facet count	12 years ago
Michael Peter Christen	049c3b3f2e	added an option to exclude image search results from text search. This is on by default.	12 years ago
Michael Peter Christen	69f85265e1	added an option to put image links to the crawl queue and handle these like normal documents. Using this option (by default on at this moment; this might change soon) it is possible to get the exif data into the search index to be used in image search.	12 years ago
Michael Peter Christen	a8c5bfcf58	avoid to create unnecessary objects	12 years ago
Michael Peter Christen	5a0de1b77d	moving image description text to image text field	12 years ago
Michael Peter Christen	dc179bd61f	fix for catchall query goal for image search	12 years ago
reger	392174de8c	remove all_words, all_strings lists from QueryGoal - only used for text highlighting in parser text (ViewFile.html) which can be done with include_strings only	12 years ago
Michael Peter Christen	169ef8963d	one more fix for image search	12 years ago
Michael Peter Christen	cb85b22725	redesign of the image search process (with much better results, unfortunately the index schema has changed and p2p image search will not be muchmuch better until many people update)	12 years ago
reger	29967102a2	optimized QueryGoal (reducing mem and computation by removing all_hashes) - all_hashes used for text highlighting and word distance computation which can be done with include_hashes only	12 years ago
orbiter	f106345eef	link strings should not be tokenized	12 years ago
orbiter	deadeb406e	image alt tag strings should be tokenized	12 years ago
Michael Peter Christen	1a3e42eca4	index migration to lucene 4.4	12 years ago
Michael Peter Christen	a88a62f7aa	added a feature to set a collection for a crawl result based on a regular expression on th url: the collection attribut for a crawl start may be now either a token or a list of tokens, seperated by ',' where a token is either a string or a pair <string,pattern> where the string is separated to the pattern with a ':' and the string is assigned to the document as collection only if the pattern matches with the url.	12 years ago
Michael Peter Christen	765943a4b7	Redesign of crawler identification and robots steering. A non-p2p user in intranets and the internet can now choose to appear as Googlebot. This is an essential necessity to be able to compete in the field of commercial search appliances, since most web pages are these days optimized only for Google and no other search platform any more. All commercial search engine providers have a built-in fake-Google User Agent to be able to get the same search index as Google can do. Without the resistance against obeying to robots.txt in this case, no competition is possible any more. YaCy will always obey the robots.txt when it is used for crawling the web in a peer-to-peer network, but to establish a Search Appliance (like a Google Search Appliance, GSA) it is necessary to be able to behave exactly like a Google crawler. With this change, you will be able to switch the user agent when portal or intranet mode is selected on per-crawl-start basis. Every crawl start can have a different user agent.	12 years ago
Michael Peter Christen	47b1c81d08	- refactoring - generalized writing of url attributes to solr documents - added more url attributes to error documents	12 years ago
Michael Peter Christen	697613170d	less logging for postprocessing (this was a debugging logging with high CPU load)	12 years ago
reger	a5019bc470	make Vocabulary Navigator tags a hard result entry filter by checking vocabulary tags also for rwi results (currently a filter is applied to the solr query) TODO: as vocabularies are only locally valid, auto-switch to Searchdom.LOCAL could be considered.	12 years ago
reger	a67a4b7d86	improve tld: query modifier filter pattern (to prevent tld:net accepting www.abcinet.org)	12 years ago
reger	02fe8b43ba	Field Re-Indexing: display list of fields in reindex queue change servlet to display statistic on 1st click (instead after refresh)	12 years ago
sixcooler	7f501b7c38	clear some caches before reporting low Memory do not break lines in Network-table-rows	12 years ago
Michael Peter Christen	2857499467	fix to collection schema; bug appeared for _txt fields with empty String as content	12 years ago
Michael Peter Christen	58fe986cca	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
Michael Peter Christen	cf12835f20	replaced the single-text description solr field with a multi-value description_txt text field	12 years ago
reger	f2d99053ed	Field Re-Indexing: prevent endless error loop in ReindexSolrBusyThread on Solr exception (by skipping query causing the exception) (occured during testing while working on q=store:[* TO *])	12 years ago
orbiter	d05e0c5368	wait a bit longer before doing the first peer ping	12 years ago
orbiter	b8f57f7703	don't be noisy when doing background tasks that may be allowed to fail	12 years ago
Roland Haeder	0343f0668c	Fix for NPE: E 2013/07/26 20:29:29 BUSYTHREAD Runtime Error in serverInstantThread.job, thread 'net.yacy.search.Switchboard.cleanupJob': null; target exception: null java.lang.NullPointerException at net.yacy.search.schema.CollectionConfiguration.convergenceStep(CollectionConfiguration.java:1116) at net.yacy.search.schema.CollectionConfiguration.postprocessing(CollectionConfiguration.java:897) at net.yacy.search.Switchboard.cleanupJob(Switchboard.java:2296) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at net.yacy.kelondro.workflow.InstantBusyThread.job(InstantBusyThread.java:107) at net.yacy.kelondro.workflow.AbstractBusyThread.run(AbstractBusyThread.java:165) Conflicts: source/net/yacy/search/schema/CollectionConfiguration.java	12 years ago
Roland Haeder	b58ca8622d	Some cleanups: - added SKINS_PATH_DEFAULT as same as LISTS_PATH_DEFAULT was added - Added 'final' keyword to a string	12 years ago
Roland Haeder	7263bb82fb	Fix for NPE on shutdown: java.lang.NullPointerException at net.yacy.search.Switchboard.storeDocumentIndex(Switchboard.java:2732) at net.yacy.search.Switchboard.access00(Switchboard.java:207) at net.yacy.search.Switchboard.run(Switchboard.java:3049)	12 years ago
orbiter	080d80c9de	do not write an empty failreason in case that there is no fail. Because of the lazy instantiation rule this value was not actually written, but if lazy instantiation is switched on, then this causes that all crawl starts delete all crawl-start-hosts completely because this looks for filled error reasons.	12 years ago
Michael Peter Christen	61e015268b	fix in forced deletion: forced commit needed	12 years ago
Michael Peter Christen	c3b2301b2f	fix for http://bugs.yacy.net/view.php?id=268	12 years ago
orbiter	3e901dcb06	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
orbiter	f50b596e0b	do not run dht ditribution if system load is over 2.5	12 years ago
orbiter	056b42f5aa	- added information about segment count to status_p.xml - also moved this information from the old index structure, which is still in use for the RWI/DHT index to that front-end	12 years ago
orbiter	6fb2811e68	fixes for problems with remote solr and non-activated webgraph index	12 years ago
sixcooler	af740f3058	changed optimization to a segment-size of index-size/5.000.000 + one if not idle + one (and force) if postprocessing	12 years ago
orbiter	5364c4dcc9	delayed first peer-ping to send the first ping out after the http got up; if the ping comes before the http is up, it cannot be recognized as senior peer (if at all). See also: http://bugs.yacy.net/view.php?id=266	12 years ago
orbiter	e24016e30a	added the property federated.service.solr.indexing.timeout to yacy.init to provide a configurable time-out for solr; see also: http://bugs.yacy.net/view.php?id=254	12 years ago
orbiter	c124037f19	removed forced non-soft commits to prevent index fragmentation	12 years ago
Michael Peter Christen	c15aa758dc	removed failreason_t removal patch because that causes too much confusion using an external solr. to clean up the index after a schema change, use the index cleaner function from the online servlet	12 years ago
Roland Haeder	be0ff6018f	Removed trailing spaces + some more final	12 years ago
Roland Haeder	841a28ae76	Added 'final' for all exception blocks as this helps the Java compiler to optimize memory usage Conflicts: source/net/yacy/search/Switchboard.java	12 years ago
Michael Peter Christen	89c0aa0e74	added collection_sxt to error documents	12 years ago
Michael Peter Christen	0df5195cb0	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
Michael Peter Christen	1fd006cc56	fixes using the embedded connector	12 years ago
orbiter	d0dc86cf3d	logging of deadlocks (if any) during cleanup process	12 years ago
Michael Peter Christen	c6a6f159e8	fix for crawl stack domain counter	12 years ago
Michael Peter Christen	93d1bac140	do a more frequent optimization, reduces IO after optimization	12 years ago
orbiter	290e24564b	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
orbiter	5533fc8e01	fix for bug 260	12 years ago
Michael Peter Christen	b79471ee67	grr	12 years ago
Michael Peter Christen	a79f288ac1	automatically running optimize on solr if user/search is idle for some time	12 years ago
orbiter	a9c8046c87	do a light optimization at the end of a crawl postprocessing	12 years ago
orbiter	a548354c71	replaced type of solr schema object sku of text_en_splitting_tight by string	12 years ago
orbiter	2f1ec8d4a2	npe fix	12 years ago
Michael Peter Christen	bcc623a843	refactoring of load_delay: this is a matter of client identification	12 years ago
orbiter	0d0b3a30f5	activate api actions after postprocessing of crawls	12 years ago
orbiter	2be456e7fb	added a postprocessing field into api/status_p.xml to show if the postprocessing task is running at that time (status: busy) or not (status:idle)	12 years ago
Michael Peter Christen	5878c1d599	- refactoring of log to ConcurrentLog: jdk-based logger tend to block at java.util.logging.Logger.log(Logger.java:476) in concurrent environments. This makes logging a main performance issue. To overcome this problem, this is a add-on to jdk logging to put log entries on a concurrent message queue and log the messages one by one using a separate process. - FTPClient uses the concurrent logging instead of the log4j logger	12 years ago
Michael Peter Christen	a2c8116a8f	accept (but ignore) a '+' sign in front of search words	12 years ago
sixcooler	d5d8936f9d	For indexes that are changing rapidly in NRT situations, fcs (stands for Field Cache per Segment) may be a better choice than the default fc. (saves memory) see: http://wiki.apache.org/solr/SimpleFacetParameters#facet.method	12 years ago
Michael Peter Christen	57ffdfad4c	added a crawl option to obey html-meta-robots-noindex. This is on by default.	12 years ago
Michael Peter Christen	5a5d411ec0	new robots_i attribute fields	12 years ago
Michael Peter Christen	f1c5338210	prepartion for greedy crawl profiles and refactoring	12 years ago
Michael Peter Christen	e6f361f474	adding the canonical tag to crawl queues	12 years ago
Michael Peter Christen	203921006a	redesign of citation index storage	12 years ago
Michael Peter Christen	32aa1d4569	removed unused option for queries	12 years ago
sixcooler	e5abccdfe4	added optimize-option	12 years ago
Michael Peter Christen	8caaf6203a	fixed false multiple-generation of remote facet search which caused high cpu usage on remote side.	12 years ago
Michael Peter Christen	823ae4d6a7	added url_protocol_s to error documents	12 years ago
Michael Peter Christen	9a6fcdf597	npe fix	12 years ago
Michael Peter Christen	16d1d744fa	added url_file_name_s in default collection schema for the file name without the file extension. This part of the file path is removed from the multi-field url_paths_sxt, which has now not the file name as last part of the path list. The same applies to the new fields source_file_name_s and target_file_name_s in the webgraph schema.	12 years ago
Michael Peter Christen	f9d859f5dc	now writing image alt texts and (camelcase-)parsed urls into a text search field for a better image retrieval	12 years ago
orbiter	8792e6c6e9	stub for better image indexing	12 years ago
Michael Peter Christen	bdf306e0a7	increased time-out for loading of seed-lists	12 years ago
Michael Peter Christen	570511f3c8	removed fields references_internal_id_sxt and references_internal_url_sxt because they had been shown to be superfluous. The citation of referrer in the host browser is possible without them. Therefore now the host browser does not only show internal, but also external referrer to each link.	12 years ago
Michael Peter Christen	1762911f57	added synchronizations and timeouts in solr api; missing synchronizations in index modification methods causes deadlocks inside solr.	12 years ago
Michael Peter Christen	ffc570f95f	removed forced soft commit since this may be the cause for a performance problem	12 years ago
Michael Peter Christen	6115bef335	added a 'greedy learning' mechanismn which will cause that a 'fresh' yacy will load linked web pages from search results until the total number of web pages reaches 15000. This shall give fresh peers a 'boost' to get faster a personalized search index.	12 years ago
Michael Peter Christen	8e965ffd16	fix for host compare in case that the host is null. This happens when doing a search in the intranet for file resources (they don't have a host).	12 years ago
Michael Peter Christen	f7a4377812	usage of the new normalized link polularity CRn as default ranking function. This replaces the previous formula, which was bad. Before you update to this version, please check if you changed the ranking function yourself before, since it will be overwritten.	12 years ago
Michael Peter Christen	f7e77a21bf	Added a citation reference computation for intra-domain link structures. While the values for the reference evaluation are computed, also a backlink-structure can be discovered and written to the index as well. The host browser has been extended to show such backlinks to each presented links. The host browser therefore can now show an information where an document is linked. The new citation reference is computed as likelyhood for a random click path with recursive usage of previously computed likelyhood. This process is repeated until the likelyhood converges to a specific number. This number is then normalized to a ranking value CRn, 0<=CRn<=1. The value CRn can therefore be used to rank popularity within intra-domain link structures.	12 years ago
reger	d367b1f4d9	add null pointer check to stopword fix	12 years ago
reger	7480e87386	- fix stopword handling for RWI see example http://bugs.yacy.net/view.php?id=247 - append language setting specific stopword list - remove unused OVERHANG stack type	12 years ago
Michael Peter Christen	9fc0c4df98	fix for bad exists 'enhancement'; see bug: http://bugs.yacy.net/view.php?id=245	12 years ago
reger	8a7fcb391d	enable use of solrcore.properties for property substitution of solrconfig.xml - move setting of system property solr.directoryFactory=solr.MMapDirectoryFactory to solrcore.properties - add check of os.arch for 64bit system, if it fails use default/solrcore.x86.properties (if exists) as solrcore.properties reason: on 32bit MMapDirectoryFactory may fail with..... Caused by: java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:849) at org.apache.lucene.store.MMapDirectory.map(MMapDirectory.java:283)	12 years ago
Michael Peter Christen	f7e887bf49	added missing class	12 years ago
Michael Peter Christen	5f92c68f1f	removed block rank ranking and all YBR files in /ranking	12 years ago
Michael Peter Christen	164603b946	cleanup	12 years ago
Michael Peter Christen	409d6edf53	Store node/solr search threads to be able to send them an interrupt signal in case that a cleanup process wants to remove the search process. Added also a new cleanup process which can reduce the number of stored searches to a specific number which can be higher or lower according to the remaining RAM. The cleanup process is called every time a search ist started.	12 years ago
Michael Peter Christen	2a8b99ea82	remove text_t in search result after snippet has been computed to save space in search result cache	12 years ago
Michael Peter Christen	a1644ca0fd	new workflow processor in Segment to enqueue indexing documents to solr	12 years ago
Michael Peter Christen	0c1a018bbd	removed 'later' tactic because it used too much RAM, reduced number of soft commits, reduced caching size of search events, ensured that solr results are processed before connection is closed to keep that stuff not too long in RAM	12 years ago
Michael Peter Christen	5344a1c5f7	getting the trash out	12 years ago
Michael Peter Christen	709e9b8ce7	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
Michael Peter Christen	281959a2d7	added option to re-boot the embedded solr during run-time. Added also API recording for this method so it can be repeated automatically. The index dump generation is now also available for API recording. Added some synchronization in backend which was necessary for this.	12 years ago
orbiter	da621e827e	prevent NPE in case RWI is disabled	12 years ago
Michael Peter Christen	c2b1075dcf	activating pollImmediately in case that DHT receive is off. This will cause a much faster search result when running in public robinson mode.	12 years ago
Michael Peter Christen	2b563debbf	javadoc of new multiple-exist test	12 years ago
Michael Peter Christen	8f2d3ce2f9	reduced locking situation in crawler: shifted synchronized location and reduced time-out of robots.txt load limit	12 years ago
Michael Peter Christen	b68fbe7d21	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Conflicts: source/net/yacy/migration.java	12 years ago
Michael Peter Christen	06d3063dc9	- no downcase when using collection modifier - removed warnings	12 years ago
Michael Peter Christen	8dbc80da70	redesign of index.exist-test: this shall now not be done using a single id to be tested, but with a collection of ids. This will cause only a single call to solr instead of many. The result is a much better performace when testing the existence of many urls. The effect should cause very much less IO during index transmission, both on sender and receiver side.	12 years ago
reger	7f63d3747d	more generic field selection for reindex option of documents with disabled fields using Luke request to compare config with actual fields in index	12 years ago
Michael Peter Christen	44e363f37f	refactoring of WorkflowProcessor, added process counter, update of process counter if an blocking thread dies. Added also a new column in PerformanceConcurrency_p servlet to show the actual number of concurrent processes.	12 years ago
Michael Peter Christen	4058369288	fixed query expressions for collection selection (added quotes)	12 years ago
reger	79401cb938	added reindex option for documents with disabled or obsolete fields to Solr Schema Editor page (IndexSchema_p.html) this allows to remove obsolete fields from the index (according to current schema config) by selecting all documents containig disabled fields.	12 years ago
orbiter	cf36c1614f	prevent that concurrent deletion process causes wrong double-check in crawl start	12 years ago
Michael Peter Christen	b24d1d18e4	removed synchronization and concurrency in Fulltext class, concurrent deletions are now handled in ConcurrentUpdateSolrConnector	12 years ago
Michael Peter Christen	b9b446bca6	- added ssl configuration sign (a lock) to network statistic/table - fixed a bug in bitfield	12 years ago
reger	4fc6837690	- fix monitor url of crawl job in PerformanceQueues_p.html - reduce logging of every index add (switch embeddedsolr.add from info to debug)	12 years ago
Michael Peter Christen	ad050ec88d	- upgraded httpclient, httpcore and httpmime - removed httpclient 3.1 which has been used by solrj < 4.x.x and is now not used any more - fixed some parts in YaCy which used methods from httpclient 3.1	12 years ago
orbiter	a1c989002b	fix for http://forum.yacy-websuche.de/viewtopic.php?f=5&t=4652 generate dht data even if dht receive and dht transmission is switched off	12 years ago
Michael Peter Christen	e26bdd4a52	fixes to deletion methods (removed unnecessary concurrency and added removal of crawl queue entries)	12 years ago
Michael Peter Christen	f7f3e28c5e	prevent that the size of the index is computed too many times. Because the index size is now provided by solr, and the only way to do that is a match for [* TO *], a size computation is quite complex and time-consuming. Therefore this patch prevents that the method is called at all and if necessary puts a DOS-preventing barrier in front of it.	12 years ago
Michael Peter Christen	cca19d94d4	re-declared some fields to be of type string rather than text which makes them more efficient and less large	12 years ago
Michael Peter Christen	3841854c97	abstraction of catchall term	12 years ago
Michael Peter Christen	ea85674be2	added the date to error documents	12 years ago
orbiter	7de5b9cfa0	fix for http://bugs.yacy.net/view.php?id=233 - check geolocation coordinates and accept only those, which are well-formed - the solr push process does not stop crawling any more if after 20 requests to Solr Solr does not accept the record. Instead, a severe log entry asks the user to create a bug request	12 years ago
Michael Peter Christen	bb4bf3d8fd	infinity timeout bug protection patch	12 years ago
Michael Peter Christen	d1be4127e7	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
Michael Peter Christen	f36a7da5f6	- re-introduced existById in solr connector. - intruduced raw-queries for the re-introduced byId-Queries (they are hopefully faster than full edismax queries) - removed the cached solr connector (testing this) to rely only on the solr built-in search caches. That should save some RAM (also). We will see if this is usable.	12 years ago
reger	46fa800bc7	added httpstatus_i to automatically switched on fields (used in all search queries)	12 years ago
Michael Peter Christen	3502b4c697	refactoring (renaming) of yacy-solr api	12 years ago
Michael Peter Christen	3a0fcfbeda	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
Michael Peter Christen	25499eead5	- added a new field for the regular expression in crawl start - added the field in crawl profile - adopted logging end error management - adopted duplicate document detection - added a new rule to the indexing process to reject non-matching content - full redesign of the expert crawl start servlet The new filter field can now be seen in /CrawlStartExpert_p.html at Section "Document Filter", subsection item "Filter on Content of Document"	12 years ago
orbiter	e1bfe9d07a	- reduction of the concurrently running processes to make YaCy more adjusted to smaller and 1-core devices. - the workflow processor now starts no process at all. these are started as soon as parser/condenser/indexing queues are filled. - better abstraction	12 years ago
Michael Peter Christen	c091000165	added collection attribute also to the rss feed reader	12 years ago
orbiter	f7571386a3	added a 'collection' property attribute in yacysearch.html which can be used to select between different collections as defined during a crawl start with the 'collection' attribute. This actually implements the ability to prepare search tenants which restrict their search results to a specific collection. The main use for this is to provide tenants to the yaml4 interface (at this time).	12 years ago
Michael Peter Christen	d937c55204	extended limitation of dom export size from 100000 to 100000000	12 years ago
Michael Peter Christen	50421171c3	added new schema fields: hreflang_url_sxt and hreflang_cc_sxt for http://support.google.com/webmasters/bin/answer.py?hl=de&answer=189077 navigation_url_sxt and navigation_type_sxt for http://googlewebmastercentral.blogspot.de/2011/09/pagination-with-relnext-and-relprev.html publisher_url_s for http://support.google.com/plus/answer/1713826?hl=de all fields are disabled by default and not written to the index.	12 years ago
Michael Peter Christen	566d6c980c	checking of document signature for a double-document check now refers only to documents within the same domain	12 years ago
Michael Peter Christen	d05dc07cff	setting of new default values for ranking	12 years ago
Michael Peter Christen	97775fbebc	fixed ranking for add-function queries: this did not work. The option was removed. All function queries are now boosts (multiplies the score according to a function). This is also the recommended way to boost rankings based on functions as explained in http://nolanlawson.com/2012/06/02/comparing-boost-methods-in-solr/	12 years ago
Michael Peter Christen	7ab5093321	added new solr title_exact_signature_l and description_exact_signature_l to be able to identify unique title and unique description fields.	12 years ago
Michael Peter Christen	f24ac518e6	redesign of exists()-query (can now be called with query) and the CachedSolrConnector which based its cache on the key value. This will be used to correct the title_unique_b and description_unique_b field.	12 years ago
Michael Peter Christen	27d6222880	added new field host_extent_i which, after a crawl and postprocessing, holds the number of documents for the host where the document is hosted. This is necessary for ranking and the norming of references per local host in the ranking computation.	12 years ago
reger	518b20147c	skip postprocessing during document.store if no citation index connected (prevent null pointer exception)	12 years ago
Michael Peter Christen	ada3f27de7	added three new field for a better ranking: references_internal_i, references_external_i and references_exthosts_i. These can be used to count and evaluate the number of external links to every web page. An experimental ranking function can be i.e.: div(add(references_internal_i,product(references_external_i,references_exthosts_i)),add(clickdepth_i,1))	12 years ago
Michael Peter Christen	082e3274d6	- setting the same default ranking in the solr interface as for YaCy search interfaces if no other ranking attributes are given - using the YaCy ranking in the GSA interface only if there was not given a GSA-style sort attribute - to avoid confusion about correct ranking attributes, only the default '0'-ranking profile is used and not scenario-adopted (site, date) because that should be configurable in the web interface before it is used actually for ranking.	12 years ago
Michael Peter Christen	a20941c067	resume paused crawls on startup; user expects that restarts 'heal' everything	12 years ago
Michael Peter Christen	edc0b33f6d	- showing references count and clickdepth in host browser - fixed generation and presentation of both values	12 years ago
reger	566a3b0294	fix: Index Administration > Reverse Word Index (IndexControlRWIs_p) corrected use of word search to word-hash search - removed duplicate QueryParams.hashes2Handles , redundant with .hashes2Set	12 years ago
Michael Peter Christen	cf0acd2cb4	upgrade to solr 4.2.1	12 years ago
orbiter	e4d26d1cb4	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
orbiter	940c6849ee	enhanced did-you-mean (a bit): can now remember previously searched words (plus small enhancements)	12 years ago
reger	d57b221921	add: reset Solr schema filed selection to default button in IndexSchema_p	12 years ago
Michael Peter Christen	9406a2e438	fixed NPE during index abstract computation	12 years ago
Michael Peter Christen	2d36a7eaf5	- do not create a new query for all remote peers - no document search this time - adjusted banner and network to not show 'WORDS' but DHT Chunks. This is to avoid confusion for robinson peers which do not create Word Entries	12 years ago
Michael Peter Christen	4af0839be2	use appropriate ranking for each search situation: - when using the /date modifier, a date ranking profile is used - when using a site: modifier, a ranking profile supporting longer urls is used	12 years ago
Michael Peter Christen	b8ed66a55d	added all clickdepth computations for source and target paths in webstructure core	12 years ago
Michael Peter Christen	6300730d7f	refactoring of clickdepth computation as preparation for clickdepth computation of webgraph links	12 years ago
Michael Peter Christen	2080fc7406	removed unused tag fields	12 years ago
orbiter	6b13dd0d3d	added clickdepth field writing for webgraph core (unfinished)	12 years ago
orbiter	47114910d5	fix for possible memory leaks	12 years ago
Michael Peter Christen	addba047e2	changes in ranking computation - an existing ranking servlet for solr was extended. It is now possible to set boost values for fields, boost functions and boost queries. - The ranking can have different instances, but currently only the first one is used - added an abstraction layer for fields which can be used for search and those fields can be edited in the solr ranking configruation - the ranking value from solr within the field score is used to combine remote search requests, which all are created using the same locally defined boost values - reduced the number of fields which are used for search (makes it faster) - replaced some text fields by string fields (makes indexing faster) - removed classes which had no use - made a large number of experiments for a better ranking and created a temporary setting which prefers hits inside titles - adjusted also the RWI-based ranking computation to 'prefer title' - made special cases like for portal search where no post-processing and post-ranking is wanted: this keeps the original ranking order as done by Solr - fixed many bugs with old settings for ranking	12 years ago
orbiter	ab74d559fb	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
Michael Peter Christen	4490133909	removed target_tag_s (superfluous)	12 years ago
orbiter	cd197bb555	fix for NPE if surrogates do not exist	12 years ago
Michael Peter Christen	25300913fa	fixes to search debugging after testing with the different search debugging options	12 years ago
Michael Peter Christen	81380ae5c8	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago

... 2 3 4 5 6 ...

833 Commits (39b641d6cd82a9ca3bd2ab584435544a940e9093)