yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Michael Peter Christen	7db2888336	fixed font size and print page generation in pdf snapshots	10 years ago
Michael Peter Christen	3b51636ecb	fix for mediawiki import	10 years ago
Michael Peter Christen	3e6c3e2237	documents pushed over the api/push_p.html interface will have their unique flag set by default	10 years ago
Michael Peter Christen	932faafffe	reactivated on-demand snapshot loading	10 years ago
Michael Peter Christen	66b5a56976	Added and integrated new date detection class which can identify date notions within the fulltext of a document. This class attempts to identify also dates given abbreviated or with missing year or described with names for special days, like 'Halloween'. In case that a date has no year given, the current year and following years are considered. This process is therefore able to identify a large set of dates to a document, either because there are several dates given in the document or the date is ambiguous. Four new Solr fields are used to store the parsing result: dates_in_content_sxt: if date expressions can be found in the content, these dates are listed here in order of the appearances dates_in_content_count_i: the number of entries in dates_in_content_sxt date_in_content_min_dt: if dates_in_content_sxt is filled, this contains the oldest date from the list of available dates #date_in_content_max_dt: if dates_in_content_sxt is filled, this contains the youngest date from the list of available dates, that may also be possibly in the future These fields are deactiviated by default because the evaluation of regular expressions to detect the date is yet too CPU intensive. Maybe future enhancements will cause that this is switched on by default. The purpose of these fields is the creation of calendar-like search facets, to be implemented next.	10 years ago
Michael Peter Christen	6a1865f507	refactoring date -> lastModified	10 years ago
Michael Peter Christen	8df8ffbb6d	enhanced the snapshot functionality: - snapshots can now also be xml files which are extracted from the solr index and stored as individual xml files in the snapshot directory along the pdf and jpg images - a transaction layer was placed above of the snapshot directory to distinguish snapshots into 'inventory' and 'archive'. This may be used to do transactions of index fragments using archived solr search results between peers. This is currently unfinished, we need a protocol to move snapshots from inventory to archive - the SNAPSHOT directory was renamed to snapshot and contains now two snapshot subdirectories: inventory and archive - snapshots may now be generated by everyone, not only such peers running on a server with xkhtml2pdf installed. The expert crawl starts provides the option for snapshots to everyone. PDF snapshots are now optional and the option is only shown if xkhtml2pdf is installed. - the snapshot api now provides the request for historised xml files, i.e. call: http://localhost:8090/api/snapshot.xml?urlhash=Q3dQopFh1hyQ The result of such xml files is identical with solr search results with only one hit. The pdf generation has been moved from the http loading process to the solr document storage process. This may slow down the process a lot and a different version of the process may be needed.	10 years ago
reger	5f0bb1214f	modified FieldReIndex to reindex queries with low number of documents first by using a internally a score map with number of documents as score and working through the list from low to high.	10 years ago
Michael Peter Christen	70f03f7c8e	do not cache search requests to Solr if the result is used for doublechecking. If a double-check comes from cached results the doublecheck fails.	10 years ago
Michael Peter Christen	6a2a669db4	added loading of the synonyms file from addon/synonyms into the knowledge loader	10 years ago
Michael Peter Christen	0a879c98e7	added new 'firstSeen' database table and necessary data structures which hold a date for each URL to record when a url was first seen. This is then used to overwrite the modification date for urls upon recrawl in case that the first-seen date is before the latest document date. This behaviour is necessary due to the common behaviour of content management systems which attach always the current date to all documents. Using the firstSeen database it is possible to approximate a real first document creation date in case that the crawler starts frequently for the same domain. As a result the search results ordered by date have a much better quality and the usage of YaCy as search agent for latest news has a better quality.	10 years ago
sixcooler	9c6e3a6b1c	fix assertation-failure in version-string for Solr-4.10.2 by changing the assert - hope that is ok + add forgotten NB-Projekt-changes	10 years ago
sixcooler	725b206fb4	update to solr-/lucene-4.10.2	10 years ago
orbiter	0fcd8097a3	removed unused options from BusyThreads	10 years ago
Michael Peter Christen	77662e08e1	concurrently initialize the error cache; extended also the cache by factor 10 up to 1000 entries. This error cache is only used to catch up paused crawls between shutdown+startup	10 years ago
orbiter	a922b122a3	added a hack to forward solr search results from an external attached solr to the YaCy built-in solr search servlet. Its not complete and not fully correct (there is still a utf8 encoding problem) but it is a way to get easily requests forwarded through YaCy to an external Solr.	10 years ago
Michael Peter Christen	6d3d4c4ea6	changed the concurrent enumeration of query results in such a way that it is now possible to get the results in two steps: - first retrieve all IDs as given for a query - then retieve each document individually This was necessary for very large result sets where a query may run for hours and is possibly terminated by a solr-internal timeout. This occurs regulary during postprocessing and therefore this commit may fix unwanted postprocessing terminations.	10 years ago
Michael Peter Christen	81f9b34da7	increaesed ability ot search for all images on a single server within the p2p remote search	10 years ago
Michael Peter Christen	a7dd89c4de	changed method to write the citation index: do not catch up references during document parsing; instead use the same references that would also be written into the webgraph. That should cause that the webgraph and the citation index express the exact same semantic.	10 years ago
Michael Peter Christen	001e05bb80	do not store failure of loading of robots.txt into the index as a fail document	10 years ago
Michael Peter Christen	05d58e4df0	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
Michael Peter Christen	98f45c9032	fix for image alt attachment to AnchorURLs in html parser.	10 years ago
orbiter	22ce4fb4dd	better error handling for remote solr queries and exists-checks	10 years ago
orbiter	738989aab7	reverted commit `f94c91315b` because the webgraph has not enough performance for that	10 years ago
Michael Peter Christen	f94c91315b	if the webgraph is used, then use it also for reference computation to avoid contradictions with references_i in the collection index.	10 years ago
Michael Peter Christen	4eec1a7452	refactoring (change Metadata name of load time data structure to avoid confusion with Node data which is also called metadata)	10 years ago
Michael Peter Christen	2de159719b	added an option to set 'obey nofollow' for links with rel="nofollow" attribute in the <a> tag for each crawl. This introduces a lot of changes because it extends the usage of the AnchorURL Object type which now also has a different toString method that the underlying DigestURL.toString. It is therefore not advised to use .toString at all for urls, just just toNormalform(false) instead.	10 years ago
Michael Peter Christen	d07cdd8c3b	added SolrCloud access mode and configuration	10 years ago
Michael Peter Christen	b5fc2b63ea	removed exist() retrieval functions from error cache and replaced it with metadata retrieval from connectors directly. This should cause better usage of the cache. Automatically increase the metadata cache if more memory is available.	11 years ago
Michael Peter Christen	b5d78ba156	reduced number of solr queries during crawling	11 years ago
Michael Peter Christen	fd87fa1613	removed more unnecessary exist-checks in ErrorCache	11 years ago
Michael Peter Christen	f2b476e08b	don't do a double check to solr for failed documents if they are not written to solr	11 years ago
Michael Peter Christen	09dcdb9b19	update to solr 4.9.0	11 years ago
Michael Peter Christen	5b94a257ce	no timeout for large reference collections	11 years ago
Michael Peter Christen	8ad41a882c	fixed several problems with postprocessing: - unique-postprocessing was destroying results from other postprocessings; removed cross-updates as they had been not necessary - unique-postprocessing did not restrict on same protocol - inefficient concurrent update cache was redesigned completely - increased limits for concurrent blocking queues to prevent early time-out	11 years ago
Michael Peter Christen	f0db501630	better handling of ranking parameters and new default values for date navigation which is done using ranking in solr.	11 years ago
Michael Peter Christen	53948da7d0	tried to make last_modified recognition smarter	11 years ago
Michael Peter Christen	6634b5b737	debug code for index distribution testing	11 years ago
orbiter	97983ba89f	fixed generics warnings for generic array instantiation that appeared after migration to Java 7	11 years ago
Michael Peter Christen	10cf8215bd	added crawl depth for failed documents	11 years ago
Michael Peter Christen	9a5ab4e2c1	removed clickdepth_i field and related postprocessing. This information is now available in the crawldepth_i field which is identical to clickdepth_i because of a specific crawler strategy.	11 years ago
Michael Peter Christen	da86f150ab	- added a new Crawler Balancer: HostBalancer and HostQueues: This organizes all urls to be loaded in separate queues for each host. Each host separates the crawl depth into it's own queue. The primary rule for urls taken from any queue is, that the crawl depth is minimal. This produces a crawl depth which is identical to the clickdepth. Furthermorem the crawl is able to create a much better balancing over all hosts which is fair to all hosts that are in the queue. This process will create a very large number of files for wide crawls in the QUEUES folder: for each host a directory, for each crawl depth a file inside the directory. A crawl with maxdepth = 4 will be able to create 10.000s of files. To be able to use that many file readers, it was necessary to implement a new index data structure which opens the file only if an access is wanted (OnDemandOpenFileIndex). The usage of such on-demand file reader shall prevent that the number of file pointers is over the system limit, which is usually about 10.000 open files. Some parts of YaCy had to be adopted to handle the crawl depth number correctly. The logging and the IndexCreateQueues servlet had to be adopted to show the crawl queues differently, because the host name is attached to the port on the host to differentiate between http, https, and ftp services.	11 years ago
orbiter	95780eed32	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
Michael Peter Christen	6bd8c6f195	fix for wrong status codes of error pages	11 years ago
orbiter	c250fac9f4	linkstructure refactoring to get more options for clickdepth analysis	11 years ago
Michael Peter Christen	bd886054cb	new structure and enhancements for link graph computation: - added order option to solr queries to be able to retrieve document lists in specific order, here: link length - added HyperlinkEdge class which manages the link structure - integrated the HyperlinkEdge class into clickdepth computation - extended the linkstructure.json servlet to show also the clickdepth and other statistic information	11 years ago
Michael Peter Christen	ebd44a7080	replaced solr 4.6.1 with solr 4.7.1 and added index migration to lucene_47	11 years ago
Michael Peter Christen	926d28dd3f	fixed a bug which prevented crawl starts after a network switch	11 years ago
reger	227c42bc96	eleminate obsolete URIMetaDataRow class by joining it with/into URIMetaDataNode.	11 years ago
Michael Peter Christen	63c9fcf3e0	free configuration of postprocessing clickdepth maximum depth and time	11 years ago
Michael Peter Christen	51800007c4	- added concurrency to postprocessing of webgraph document - bundeled separate webgraph postprocesing steps into one	11 years ago
Michael Peter Christen	fdaeac374a	- enhanced postprocessing speed and memory footprint (by using HashMaps instead of TreeMaps) - enhanced memory footprint of database indexes (by introduction of optimize calls) - optimize calls shrink the amount of used memory for index sets if they are not changed afterwards any more	11 years ago
Michael Peter Christen	7c1b968378	another fix for the shutdown exceptions	11 years ago
Michael Peter Christen	7640834b37	removed double concurrency to put Solr documents into the index. The writings to the solr index are also buffered in ConcurrentUpdateSolrConnector	11 years ago
Michael Peter Christen	0f6b72f24b	do not use luke requests for remote solr servers if the result is different from normal requests. This happens if the remote solr is actually a solrCloud; in such cases the luke request returns only the result of the single solr peer, not the whole cloud. also done: some refactoring.	11 years ago
orbiter	ced1a96f9c	fixed error cache	11 years ago
orbiter	cfb647db6e	- introduced a miss cache in ConcurrentUpdateSolrConnector - better usage of cache - bugfix for postprocessing	11 years ago
orbiter	a87d8e4a8e	changed caching of ConcurrentUpdateSolrConnector: it caches now also the url along with the load date. While this takes much more memory, it eliminates database lookups for getURL() requests, which happen equally often. This speeds up remote solr configurations.	11 years ago
orbiter	f6e441dd77	refactoring	11 years ago
orbiter	76c53faeb2	removed unused code (HostStat)	11 years ago
Michael Peter Christen	254a7ac66c	fixed cleaning of index	11 years ago
Michael Peter Christen	69391e5d9e	changed strategy to test existence of documents in Solr: using the update time. The reason for that is a better caching for the crawler double-check, which needs the update time for crawler steering.	11 years ago
Michael Peter Christen	9eb668e951	enhanced the resource observer The resource observer is now able to recognize free disk space AND available space for YaCy. The amount of space which is assigned for YaCy are defined in new settings in the configuration file. Furthermore, there is now a cleanup process which deletes files in case that an autodelete is activated. The autodelete is now BY DEFAULT ON if the disk space is low, which means that YaCy starts to delete documents when the disk is full!	11 years ago
Michael Peter Christen	bf97e38b83	removed clearURLIndex, which is a stub remaining from the old metadata database and not needed any more	11 years ago
Michael Peter Christen	195e5868d3	catch solr close exceptions	11 years ago
Michael Peter Christen	0cabcbbe83	more efficient wordcount	11 years ago
Michael Peter Christen	3d474a843e	added memory protection for postprocessing	11 years ago
Michael Peter Christen	9228214f9b	enrichment of PerformanceMemory display of SolrInfoMBean table	11 years ago
Michael Peter Christen	e8bdf16ea7	added statistic information for solr resources in PerformanceMemory	11 years ago
Michael Peter Christen	456e52e0d5	enhanced strategy to clear solr caches - redesigned the instance mirror class (which was a mess) - added final method to close a searcher (which otherwise keeps a cache) - changed cache clear method which iterates over resources and calls clear to all caches in the searcher resources	11 years ago
orbiter	c40ba51ca6	added new suggest method which replaces more-than-one suggestions: instead of computing suggest permutations of the given words, the completion of a phrase using the given words is searched in the fulltext index.	11 years ago
reger	9b24dae2b7	add language navigation filter clause to rwi results	11 years ago
Michael Peter Christen	c84bcc878a	first try to add a generic solr servlet as luke request servlet	11 years ago
Michael Peter Christen	1ea17bd9f3	- removed old metadata database and all migration code - refactored all code which uses URIMetadataRow as standard for word hash length and word hash ordering and moved that to the class 'Word', becuase the class URIMetadataRow defined the old metadata data structure and should be superfluous in the future - removed unused methods from URIMetadataRow as preparation for further removal of that class	11 years ago
Michael Peter Christen	f8ce7040ab	remote search peer selection schema change: - all non-dht targets (previously separated into 'robinson' for dht-like queries and 'node' for solr queries) are non 'extra' peers, which are queries using solr - these extra-peers are now selected using a ranking on last-seen, peer-tag-matches, node-peer flags, peer age, and link count. The ranking is done using a weight and a random factor. - the number of extra peers is 50% of the dht peers - the dht peers now exclude too young peers to prevent bad results during strong growth of the network - the number of dht peers (and therefore extra-peers) is reduced when the memory of the peer is low and/or some documents still appear in the indexing-queue. This shall prevent a peer from deadlocks when p2p queries are made in a fast sequence on weak hardware.	11 years ago
reger	280c4a3ac1	exclude terms with " for didYouMean suggestion causes Solr error (and wordindex likely finds suggestion) org.apache.solr.core.SolrCore org.apache.solr.common.SolrException: org.apache.solr.search.SyntaxError: Cannot parse 'text_t:""d"': Lexical error at line 1, column 12. Encountered: <EOF> after : "" at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:171) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:187) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at net.yacy.cora.federate.solr.connector.EmbeddedSolrConnector.query(EmbeddedSolrConnector.java:179) at net.yacy.cora.federate.solr.connector.EmbeddedSolrConnector$DocListSearcher.<init>(EmbeddedSolrConnector.java:345) at net.yacy.cora.federate.solr.connector.EmbeddedSolrConnector.getCountByQuery(EmbeddedSolrConnector.java:364) at net.yacy.cora.federate.solr.connector.MirrorSolrConnector.getCountByQuery(MirrorSolrConnector.java:326) at net.yacy.cora.federate.solr.connector.ConcurrentUpdateSolrConnector.getCountByQuery(ConcurrentUpdateSolrConnector.java:440) at net.yacy.search.index.Segment.getWordCountGuess(Segment.java:464) at net.yacy.data.DidYouMean.getSuggestions(DidYouMean.java:181) at suggest.respond(suggest.java:73)	11 years ago
reger	6932aa4d7a	use configured admin-username for api calls - the admin user name can be configured, in apiExec calls the default "admin" username is used. TODO: the bin/apicall.sh script should likely take that into account.	11 years ago
orbiter	2ead4e44d9	introduced a new storage path ARCHIVE inside of DATA which will be used as path for solr index dumps (instead of the SEGMENTS path). This will make a maintenance of index backups easier. It will also provide a tool to migrate from an freeworld index to a webportal index.	11 years ago
orbiter	3cb6c7861f	fixed shutdown authenticaton problem	11 years ago
Michael Peter Christen	ee17bd0b69	added option to attach remote solr servers in read-only mode	11 years ago
Michael Peter Christen	2f16770681	migrated to solr 4.6.0	11 years ago
Michael Peter Christen	2702d9e56b	- added a SolrQueryResponse2SolrDocumentList method which is able to work around the unfolding process in Solr's BinaryResponseWriter. This was a huge performance bottleneck in the embedded solr connector and the problem is actually on Solr side, but we have now a workaround. - This made it possible to abstract a high-performance index access method which is implemented as method getDocumentListByParams. That method is also implemented in the SolrServerConnector and provides a very efficient access to a solr index if the index is embedded. - a popular use of the document list retrieval is a result count which can now also make use of the new method, via getDocumentCountByParams. - enhanced the Error cache which now does not store error documents within the ram cache if the document is also written to solr. When documents are retrieved from the cache, they are partly read from the ram cache and if not existent there, from the Solr index.	11 years ago
Michael Peter Christen	552ef9f18e	fix for bad ErrorCache.exists test (bug from latest commit)	11 years ago
Michael Peter Christen	303f5694ba	avoid usage of existsByQuery. If a document can be loaded by the ID before testing other fields from the existsByQuery request, then a document cache fills and queries after that one can be avoided.	11 years ago
Michael Peter Christen	78eac85161	better calibration of caches and queue maximum sizes	11 years ago
Michael Peter Christen	0db8e34625	enhanced webgraph processing	11 years ago
Michael Peter Christen	c3dcbdc8d5	try to recover from an OOM during citation index reading and fail-over to second solr core in case of unrecoverable OOM.	11 years ago
Michael Peter Christen	9932c441c8	fixed a problem with Date fields parsing Solr results if a remote Solr is attached.	11 years ago
orbiter	037cd0a57c	using the BinaryResponseWriter which is supported within the YaCy solr servlet since YaCy 1.63. This is much more performant for the client than using the XMLResponseWriter because parsing of XML data is very CPU intensive. Older YaCy peers are still requested using the XMLResponseWriter but the majority of YaCy peers already respond with the binary writer. This makes remote searches much faster and less CPU intensive.	11 years ago
Michael Peter Christen	9cf9727685	fix for wrong counter	11 years ago
Michael Peter Christen	fceac8cffd	more monitoring for postprocessing	11 years ago
Michael Peter Christen	9d5895f643	enhanced and fixed postprocessing	11 years ago
Michael Peter Christen	1a4a69c226	set more logger to 'final static'	11 years ago
Michael Peter Christen	acc1f8a749	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	81bb50118e	found and fixed a huge memory leak in solr caching (inside Solr). The not-flushed Solr cache is now handled in this way: - it is smaller by default - an Solr-internal process is started to flush the cache periodically (this does NOT clean the cache, just removes old objects) - a Solr-external process (the standard YaCy cleanup-process) now has direct access to the solr internal cache and flushes them completely. The time frame for such a flush is defined by the cleanup-process frequency, by default 10 minutes.	11 years ago
sixcooler	987f410011	URL-export:add query and fix for cast-class-exception	11 years ago
Michael Peter Christen	e1c1e57877	less overhead calling exist() with only one hash	11 years ago
Michael Peter Christen	5a02d650ee	avoid cloning	11 years ago
Michael Peter Christen	cc39667399	Speed enhancements and less CPU usage during Solr searches when using the embedded Solr (the default). This was obtained by cirumventing solrj search encapsulation and the implementation of direct index access methods to Solr. The effect will not only be seen during search, but this has also a strong effect on suggestions (much more) and less CPU power usage during index distribution (which needs many search requests)	11 years ago
Michael Peter Christen	434e13b46d	in host browser also show the properties of failed documents including referrer urls (this is a VERY USEFUL SEO and Web Admin feature!!)	11 years ago

1 2 3 4 5 ...

431 Commits (4a9e64caea5b0111b807c539918b465e0f2051ec)