yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Michael Peter Christen	f86fe90eda	enhanced mass storage speed to remote solr servers	11 years ago
Michael Peter Christen	6ed9821209	fixed several problems in solr connectors	11 years ago
Michael Peter Christen	191fd3d7e7	added an optimization option to HandleSet mass data storage structure	11 years ago
Michael Peter Christen	94b565ea0d	fixed keepalive min value	11 years ago
Michael Peter Christen	24a052ecb9	removed debug code for existsByIds	11 years ago
Michael Peter Christen	1a4a69c226	set more logger to 'final static'	11 years ago
orbiter	b085cb522b	replaced old existsByIds for embedded Solr with obviously much faster new selection method (including stil existing debug code to test that this is in fact better)	11 years ago
Michael Peter Christen	899e7e92b0	added debug code	11 years ago
Michael Peter Christen	81bb50118e	found and fixed a huge memory leak in solr caching (inside Solr). The not-flushed Solr cache is now handled in this way: - it is smaller by default - an Solr-internal process is started to flush the cache periodically (this does NOT clean the cache, just removes old objects) - a Solr-external process (the standard YaCy cleanup-process) now has direct access to the solr internal cache and flushes them completely. The time frame for such a flush is defined by the cleanup-process frequency, by default 10 minutes.	12 years ago
Michael Peter Christen	b2c329929f	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
Michael Peter Christen	60187a4ec2	fix in html parser	12 years ago
Michael Peter Christen	e1c1e57877	less overhead calling exist() with only one hash	12 years ago
reger	3d5d366f1c	fix html header in Solr HTMLResponseWriter - move 1st body content after </head> tag - add closing <span> tag	12 years ago
Michael Peter Christen	5a02d650ee	avoid cloning	12 years ago
Michael Peter Christen	cc39667399	Speed enhancements and less CPU usage during Solr searches when using the embedded Solr (the default). This was obtained by cirumventing solrj search encapsulation and the implementation of direct index access methods to Solr. The effect will not only be seen during search, but this has also a strong effect on suggestions (much more) and less CPU power usage during index distribution (which needs many search requests)	12 years ago
Michael Peter Christen	9bb7eab389	hacks to prevent storage of data longer than necessary during search and some speed enhancements. This should reduce the memory usage during heavy-load search a bit.	12 years ago
Michael Peter Christen	1a8783147b	enhanced computation of number of solr documents.	12 years ago
Michael Peter Christen	4948c39e48	added concurrency for mass crawl check	12 years ago
Michael Peter Christen	1b4fa2947d	- fixed a problem which ocurred when a document was not recognized with the right content domain (i.e. identifying that it is an image, text etc.) because it used the file extension and not an existing mime type assignment. - fixed the new setting that images shall be loaded for a better image search. - both fixes together makes it now possible to crawl commons.wikimedia.org which makes use of 'funny' document names (i.e. ending with .jpg while the document is html)	12 years ago
Michael Peter Christen	74d0256e93	enhanced postprocessing: fixed bugs, enable proper postprocessing also without the harvestingkey, remove crawl profiles after postprocessing, speed-up for clickdepth computation.	12 years ago
sixcooler	d9a02ed277	NPE fix for my last commit	12 years ago
sixcooler	61f627eb85	fix for ssl-connections from proxy-usage staying in close-wait-state + some extra 'close' in HttpClient	12 years ago
Michael Peter Christen	1b61bd40ed	- Added new solr field url_file_name_tokens_t which stores the file name tokens. This can be used to enhance the ranking. - Added also a rating_i field as basis for later usage. - enhanced the tokenization process.	12 years ago
sixcooler	d536092fe4	fix false fill NAME_CACHE_MISS-DNS-Cache in case of a timeout for eg. caused by massive requests when crawl from file	12 years ago
Michael Peter Christen	ef31d0f279	fix for rss reader, see http://bugs.yacy.net/view.php?id=294	12 years ago
Michael Peter Christen	b28d43decc	added two more fields source_cr_host_norm_i,target_cr_host_norm_i in webgraph and an addition to postprocessing to copy all cr ranking attributes to the link edges associated to the postprocessing documents	12 years ago
Michael Peter Christen	4476dea5ba	do not fail if a wrong boost key is used; instead, print only a warning See also: http://bugs.yacy.net/view.php?id=293	12 years ago
Michael Peter Christen	1b3d26dd23	hack to remove most of the warning: deprecated messages (but not all, one is left)	12 years ago
sixcooler	3c48fc65fd	reverted RemoteInstance to deprecated methods of httpClient-4.2 this should work with current remote-Solr-Instances	12 years ago
sixcooler	0cae420d8e	some dns-timing changes: since httpclient uses the domain-cache it is useful not to clean the domain cache until crawling is running (domains are filled into this cache) On huge crawl-starts (eg. from file) my DNS did not follow the high rates - so I reduced the rate and give some more time(-out)	12 years ago
sixcooler	15b1bb2513	bump to httpClient-4.3	12 years ago
orbiter	d86d2be5c3	automatically removed Places autotagging if no location library is wanted	12 years ago
reger	6b9a624808	remove double declaration of TLD_any_zone_filter	12 years ago
orbiter	6e8377b8ad	do not check all words with synonym library if the library is empty	12 years ago
Michael Peter Christen	2602be8d1e	- removed ZURL data structure; removed also the ZURL data file - replaced load failure logging by information which is stored in Solr - fixed a bug with crawling of feeds: added must-match pattern application to feed urls to filter out such urls which shall not be in a wanted domain - delegatedURLs, which also used ZURLs are now temporary objects in memory	12 years ago
Michael Peter Christen	61c5e40687	- replaced the properties object in AnchorURL with distinct variables for anchor attributes. - this caused that large portions of the parser code had to be adopted as well - added a counter target_order_i for anchor links in webgraph computation	12 years ago
Michael Peter Christen	3ea9bb4427	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
Michael Peter Christen	5e31bad711	- the webgraph shall store all links which appear on a web page and not all unique links! This made it necessary, that a large portion of the parser and link processing classes must be adopted to carry a different type of link collection which carry a property attribute which are attached to web anchors. - introduction of a new URL class, AnchorURL - the other url classes, DigestURI and MultiProtocolURI had been renamed and refactored to fit into a new document package schema, document.id - cleanup of net.yacy.cora.document package and refactoring	12 years ago
reger	603368fc3e	remove redundant declaration of USER_AGENT	12 years ago
Michael Peter Christen	dbef8ccfcb	forced deletion of ZURL entries for a specific host for each host that appears in the crawl url list	12 years ago
Michael Peter Christen	85456f46b2	added two new fields, exact_signature_copycount_i and fuzzy_signature_copycount_i, which count the number of copies of non-unique documents and assigns this to each document. Thus, each document there is a number assigned which shows how many copies of this document exists. These fields are disabled by default.	12 years ago
reger	d0e78082d1	return field names in index instead of in schema for SolrServerConnector.getFields	12 years ago
Michael Peter Christen	765943a4b7	Redesign of crawler identification and robots steering. A non-p2p user in intranets and the internet can now choose to appear as Googlebot. This is an essential necessity to be able to compete in the field of commercial search appliances, since most web pages are these days optimized only for Google and no other search platform any more. All commercial search engine providers have a built-in fake-Google User Agent to be able to get the same search index as Google can do. Without the resistance against obeying to robots.txt in this case, no competition is possible any more. YaCy will always obey the robots.txt when it is used for crawling the web in a peer-to-peer network, but to establish a Search Appliance (like a Google Search Appliance, GSA) it is necessary to be able to behave exactly like a Google crawler. With this change, you will be able to switch the user agent when portal or intranet mode is selected on per-crawl-start basis. Every crawl start can have a different user agent.	12 years ago
Michael Peter Christen	6d5fefe060	added missing files :(	12 years ago
Michael Peter Christen	554c0351dd	fix for http://bugs.yacy.net/view.php?id=286	12 years ago
Michael Peter Christen	1c62fa7698	fix for bad snippets in gsa api	12 years ago
orbiter	252c525709	fixed feed api servlet and and enhanced RSSReader class	12 years ago
orbiter	d38c3c14d8	fix for CGI test	12 years ago
Michael Peter Christen	f13df9dbb6	migration to solr 4.4.0	12 years ago
Michael Peter Christen	cf12835f20	replaced the single-text description solr field with a multi-value description_txt text field	12 years ago

1 2 3 4 5 ...

723 Commits (f86fe90edae0d40316771f4eb3138e0056f56cee)