yacy_search_server

Commit Graph

Author	SHA1	Message	Date
orbiter	39e1913585	next development step: migration to java 1.7 This includes also a small code change to test generic type inference, a java 1.7 feature	11 years ago
Michael Peter Christen	4e734815e8	enhanced snippets: remove lines which are identical to the title and choose longer versions if possible. Prefer the description part.	11 years ago
sixcooler	390f03e041	o not check for segments-count on optimize: this is also done in Solr and our getSegmentsCount() does not return up-to-date values	11 years ago
reger	78d08998db	throw MalformedURLException on unknown protocol on other than the supported http https ftp file smb \\ mailto	11 years ago
reger	bb8181b2be	fix: resolve url without path but searchpart e.g. http://yacy.net?q=test was resolved as host "yacy.net?q=test" now host="yacy.net" path="/" fixes http://mantis.tokeek.de/view.php?id=47 added test case for getHost	11 years ago
reger	81dc2aa536	add current css to HTMLResponseWriter to fix metadata view (using css from metas.template except js links)	11 years ago
orbiter	0c88a32c36	do not apply lazy value instantiation for numeric or boolean values because that is misleading and confusing in case of 0- or false-values and may cause NPEs in retrieval functions.	11 years ago
reger	79e7947442	- remove empty http0_9 status text array and unused default_charset = ISO-8859-1	11 years ago
Michael Peter Christen	9a5ab4e2c1	removed clickdepth_i field and related postprocessing. This information is now available in the crawldepth_i field which is identical to clickdepth_i because of a specific crawler strategy.	11 years ago
Michael Peter Christen	da86f150ab	- added a new Crawler Balancer: HostBalancer and HostQueues: This organizes all urls to be loaded in separate queues for each host. Each host separates the crawl depth into it's own queue. The primary rule for urls taken from any queue is, that the crawl depth is minimal. This produces a crawl depth which is identical to the clickdepth. Furthermorem the crawl is able to create a much better balancing over all hosts which is fair to all hosts that are in the queue. This process will create a very large number of files for wide crawls in the QUEUES folder: for each host a directory, for each crawl depth a file inside the directory. A crawl with maxdepth = 4 will be able to create 10.000s of files. To be able to use that many file readers, it was necessary to implement a new index data structure which opens the file only if an access is wanted (OnDemandOpenFileIndex). The usage of such on-demand file reader shall prevent that the number of file pointers is over the system limit, which is usually about 10.000 open files. Some parts of YaCy had to be adopted to handle the crawl depth number correctly. The logging and the IndexCreateQueues servlet had to be adopted to show the crawl queues differently, because the host name is attached to the port on the host to differentiate between http, https, and ftp services.	11 years ago
Michael Peter Christen	b21c208b4d	enhanced hashcode computation for MultiProtocolURL	11 years ago
Michael Peter Christen	bd886054cb	new structure and enhancements for link graph computation: - added order option to solr queries to be able to retrieve document lists in specific order, here: link length - added HyperlinkEdge class which manages the link structure - integrated the HyperlinkEdge class into clickdepth computation - extended the linkstructure.json servlet to show also the clickdepth and other statistic information	11 years ago
Michael Peter Christen	74ab094587	fix for solr query size; too many documents had been retrieved in case that less than _pagesize_ had been requested.	11 years ago
orbiter	429a874222	- added COLS field in GSA response (non-gsa standard by customer request) - updated document link in GSA response writer	11 years ago
Michael Peter Christen	63c9fcf3e0	free configuration of postprocessing clickdepth maximum depth and time	11 years ago
Michael Peter Christen	8b44fcf0f4	added missing @Override annotation	11 years ago
reger	b9056ef2db	remove unused private header entries (HeaderFramework) X_YACY_ORIGINAL_REQUEST_LINE X_YACY_KEEP_ALIVE_REQUEST_COUNT CONNECTION_PROP_REQUESTLINE	11 years ago
Michael Peter Christen	61ad194065	fix for source and target clickdepth in webgraph index	11 years ago
Marc Nause	809b4e1fd9	Team added support for URLs with unicode characters in host part to blacklist. Punycode is used to handle unicode characters.	11 years ago
reger	b126b9ba17	add some InputFileStream close at end of reads to make sure file is released	11 years ago
orbiter	01989f6af9	restrict write buffer size to a limit	11 years ago
Michael Peter Christen	7a6658abec	removed synchronization in embedded solr connection (that was probably a mistake?)	11 years ago
Michael Peter Christen	a7d4379ef9	fixed shutdown of solr cores in case that more than one local core is to be closed (this happens if webgraph is enabled and the index is dumped using /IndexControlURLs_p.html	11 years ago
reger	82dc815af9	cleanup: remove unrelated and unused code	11 years ago
Michael Peter Christen	b08375da33	fix for bad/missing values of size_i	11 years ago
Michael Peter Christen	51800007c4	- added concurrency to postprocessing of webgraph document - bundeled separate webgraph postprocesing steps into one	11 years ago
Michael Peter Christen	0e7d249a69	fixed another shutdown problem (only occurs if webgraph core is enabled)	11 years ago
Michael Peter Christen	e485fbd0ce	- let crawl loader jobs die after 10 seconds without new jobs - corrected shutdown order t prevent a deadlock during shutdown	11 years ago
reger	6878c90f99	fix: IPv6 INTRANET_PATTERNS for local ip (see http://bugs.yacy.net/view.php?id=378 ) requiring following ":" for fc and fd prefix and made pattern match case insesitive - add some more ipv6 test cases to MultiProtocolURLTest.java	11 years ago
Michael Peter Christen	6ed9c0164e	attaching names to all Threads to get a better view in profiling tools like VisualVM	11 years ago
Michael Peter Christen	fdaeac374a	- enhanced postprocessing speed and memory footprint (by using HashMaps instead of TreeMaps) - enhanced memory footprint of database indexes (by introduction of optimize calls) - optimize calls shrink the amount of used memory for index sets if they are not changed afterwards any more	11 years ago
Michael Peter Christen	7c1b968378	another fix for the shutdown exceptions	11 years ago
orbiter	133d41386c	(again) full redesign of ConcurrentUpdateSolrConnector to remove out-of-order transactions regarding add and delete operations. Now all operations (add and delete) are executed concurrently in-order.	11 years ago
Michael Peter Christen	a632b0d2a4	added a forced commit to index deletion to enable synchronized index updates	11 years ago
Michael Peter Christen	3cc5c0ffdd	a concurrency enhancement which was not used because tests showed worse indexing speed. I leave the code there since it may be useful in SolrCloud environments.	11 years ago
Michael Peter Christen	90b47e83e6	fixed shutdown error when closing solr connectors	11 years ago
Michael Peter Christen	7640834b37	removed double concurrency to put Solr documents into the index. The writings to the solr index are also buffered in ConcurrentUpdateSolrConnector	11 years ago
Michael Peter Christen	0f6b72f24b	do not use luke requests for remote solr servers if the result is different from normal requests. This happens if the remote solr is actually a solrCloud; in such cases the luke request returns only the result of the single solr peer, not the whole cloud. also done: some refactoring.	11 years ago
Michael Peter Christen	c57026e242	recover from OOM	11 years ago
Michael Peter Christen	907db8b7a6	fix for bad query shortcut hack	11 years ago
orbiter	cfb647db6e	- introduced a miss cache in ConcurrentUpdateSolrConnector - better usage of cache - bugfix for postprocessing	11 years ago
orbiter	a87d8e4a8e	changed caching of ConcurrentUpdateSolrConnector: it caches now also the url along with the load date. While this takes much more memory, it eliminates database lookups for getURL() requests, which happen equally often. This speeds up remote solr configurations.	11 years ago
orbiter	d3a88eaecb	introducing ConcurrentUpdateSolrServer for remote solr servers. Scaling of write buffers and update queue size is made according to assigned memory.	11 years ago
Michael Peter Christen	254a7ac66c	fixed cleaning of index	11 years ago
Michael Peter Christen	28a7b42e6b	removed warning "sun.misc.BASE64Encoder is internal proprietary API and may be removed in a future release"	11 years ago
Michael Peter Christen	046f5a03cb	one more SolrIndexSearcher bugfix	11 years ago
sixcooler	78c01b3eff	fix for 'AlreadyClosedException: this IndexReader is closed'	11 years ago
Michael Peter Christen	1b5e3d523a	better control over close-state of remote solr connections	11 years ago
Michael Peter Christen	1a364572a5	fix for "org.apache.solr.core.SolrCore Too many close [count:-1] on org.apache.solr.core.SolrCore@51af7c57" -error	11 years ago
Michael Peter Christen	69391e5d9e	changed strategy to test existence of documents in Solr: using the update time. The reason for that is a better caching for the crawler double-check, which needs the update time for crawler steering.	11 years ago

1 2 3 4 5 ...

849 Commits (1432a817dd2fd94bea3a6a3f145d7efe968ee727)