yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Michael Peter Christen	ef31d0f279	fix for rss reader, see http://bugs.yacy.net/view.php?id=294	11 years ago
Michael Peter Christen	b28d43decc	added two more fields source_cr_host_norm_i,target_cr_host_norm_i in webgraph and an addition to postprocessing to copy all cr ranking attributes to the link edges associated to the postprocessing documents	11 years ago
Michael Peter Christen	4476dea5ba	do not fail if a wrong boost key is used; instead, print only a warning See also: http://bugs.yacy.net/view.php?id=293	11 years ago
Michael Peter Christen	1b3d26dd23	hack to remove most of the warning: deprecated messages (but not all, one is left)	11 years ago
sixcooler	3c48fc65fd	reverted RemoteInstance to deprecated methods of httpClient-4.2 this should work with current remote-Solr-Instances	11 years ago
sixcooler	0cae420d8e	some dns-timing changes: since httpclient uses the domain-cache it is useful not to clean the domain cache until crawling is running (domains are filled into this cache) On huge crawl-starts (eg. from file) my DNS did not follow the high rates - so I reduced the rate and give some more time(-out)	11 years ago
sixcooler	15b1bb2513	bump to httpClient-4.3	11 years ago
orbiter	d86d2be5c3	automatically removed Places autotagging if no location library is wanted	11 years ago
reger	6b9a624808	remove double declaration of TLD_any_zone_filter	11 years ago
orbiter	6e8377b8ad	do not check all words with synonym library if the library is empty	11 years ago
Michael Peter Christen	2602be8d1e	- removed ZURL data structure; removed also the ZURL data file - replaced load failure logging by information which is stored in Solr - fixed a bug with crawling of feeds: added must-match pattern application to feed urls to filter out such urls which shall not be in a wanted domain - delegatedURLs, which also used ZURLs are now temporary objects in memory	11 years ago
Michael Peter Christen	61c5e40687	- replaced the properties object in AnchorURL with distinct variables for anchor attributes. - this caused that large portions of the parser code had to be adopted as well - added a counter target_order_i for anchor links in webgraph computation	11 years ago
Michael Peter Christen	3ea9bb4427	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	5e31bad711	- the webgraph shall store all links which appear on a web page and not all unique links! This made it necessary, that a large portion of the parser and link processing classes must be adopted to carry a different type of link collection which carry a property attribute which are attached to web anchors. - introduction of a new URL class, AnchorURL - the other url classes, DigestURI and MultiProtocolURI had been renamed and refactored to fit into a new document package schema, document.id - cleanup of net.yacy.cora.document package and refactoring	11 years ago
reger	603368fc3e	remove redundant declaration of USER_AGENT	11 years ago
Michael Peter Christen	dbef8ccfcb	forced deletion of ZURL entries for a specific host for each host that appears in the crawl url list	11 years ago
Michael Peter Christen	85456f46b2	added two new fields, exact_signature_copycount_i and fuzzy_signature_copycount_i, which count the number of copies of non-unique documents and assigns this to each document. Thus, each document there is a number assigned which shows how many copies of this document exists. These fields are disabled by default.	11 years ago
reger	d0e78082d1	return field names in index instead of in schema for SolrServerConnector.getFields	11 years ago
Michael Peter Christen	765943a4b7	Redesign of crawler identification and robots steering. A non-p2p user in intranets and the internet can now choose to appear as Googlebot. This is an essential necessity to be able to compete in the field of commercial search appliances, since most web pages are these days optimized only for Google and no other search platform any more. All commercial search engine providers have a built-in fake-Google User Agent to be able to get the same search index as Google can do. Without the resistance against obeying to robots.txt in this case, no competition is possible any more. YaCy will always obey the robots.txt when it is used for crawling the web in a peer-to-peer network, but to establish a Search Appliance (like a Google Search Appliance, GSA) it is necessary to be able to behave exactly like a Google crawler. With this change, you will be able to switch the user agent when portal or intranet mode is selected on per-crawl-start basis. Every crawl start can have a different user agent.	11 years ago
Michael Peter Christen	6d5fefe060	added missing files :(	11 years ago
Michael Peter Christen	554c0351dd	fix for http://bugs.yacy.net/view.php?id=286	11 years ago
Michael Peter Christen	1c62fa7698	fix for bad snippets in gsa api	11 years ago
orbiter	252c525709	fixed feed api servlet and and enhanced RSSReader class	11 years ago
orbiter	d38c3c14d8	fix for CGI test	11 years ago
Michael Peter Christen	f13df9dbb6	migration to solr 4.4.0	11 years ago
Michael Peter Christen	cf12835f20	replaced the single-text description solr field with a multi-value description_txt text field	11 years ago
Michael Peter Christen	83e2921b39	new test case for http://bugs.yacy.net/view.php?id=141	11 years ago
Michael Peter Christen	304aacb2cc	fix for http://bugs.yacy.net/view.php?id=267	11 years ago
orbiter	056b42f5aa	- added information about segment count to status_p.xml - also moved this information from the old index structure, which is still in use for the RWI/DHT index to that front-end	11 years ago
orbiter	6fb2811e68	fixes for problems with remote solr and non-activated webgraph index	11 years ago
Michael Peter Christen	336f86394c	replaced StringBuffer with StringBuilder	11 years ago
Michael Peter Christen	31483c47e1	fixed problem with remote luke requests	11 years ago
Michael Peter Christen	ac1aad5064	added a getSegmentCount method and use it to disable optimize if wanted current segment count is below optimization level	11 years ago
Michael Peter Christen	36035e0a0a	- used reger's LukeRequest to generalize the index info in SolrServerConnector - used the LukeRequest in SolrServerConnector to replace the index size method by a getNumDocs request to a LukeRequest result	11 years ago
Michael Peter Christen	39fceb5ccf	fix for NPE & bug #264	11 years ago
Roland Haeder	aaedc0405d	Fixes and avoid of catching bad exceptions (some): - Rewrote usage of HashMap/Map to concurrent versions (to avoid a CME=ConcurrentModificationException) - Rewrote ConnectionInfo (as an example) to use a synchronized iterator instead of synchronizing an already synced HashSet (see Collections call) - This avoids catching CMEs again - Commented out noisy ConcurrentLog.logException() call Conflicts: source/net/yacy/repository/LoaderDispatcher.java	11 years ago
Roland Haeder	841a28ae76	Added 'final' for all exception blocks as this helps the Java compiler to optimize memory usage Conflicts: source/net/yacy/search/Switchboard.java	11 years ago
orbiter	b71d13a014	added load and deadlock detector in Memory util	12 years ago
orbiter	5533fc8e01	fix for bug 260	12 years ago
Michael Peter Christen	bcc623a843	refactoring of load_delay: this is a matter of client identification	12 years ago
orbiter	dac88561ae	minimum access time has a tight connection to ClientIdentification, therefore it is defined there.	12 years ago
Michael Peter Christen	9a29ab469e	another patch to prevent CLOSE_WAIT status on solr connections	12 years ago
Michael Peter Christen	87e9052081	added Connection:close to all http requests in our http client to prevent CLOSE_WAIT states (as seen in lsof)	12 years ago
Michael Peter Christen	5c6946dd5f	replaced usage of log4j by ConcurrentLog where possible	12 years ago
Michael Peter Christen	5878c1d599	- refactoring of log to ConcurrentLog: jdk-based logger tend to block at java.util.logging.Logger.log(Logger.java:476) in concurrent environments. This makes logging a main performance issue. To overcome this problem, this is a add-on to jdk logging to put log entries on a concurrent message queue and log the messages one by one using a separate process. - FTPClient uses the concurrent logging instead of the log4j logger	12 years ago
orbiter	f4f6551c66	better handling of time-out at solrj in case that a commit is done in a fail-over case during add	12 years ago
Michael Peter Christen	dea71851d2	- better concurrency for network scanner - network scanner can now start from the list of all hosts in the search index	12 years ago
orbiter	9f0cc9b401	enhanced network scanner - textarea input field can now be used to paste in a large list of hosts - /31er subnet is possible (only one host) - auto-detect subdomains for ftp and www subdomains	12 years ago
sixcooler	308d73f855	do not use remote proxy if not switched on - regardless of the proto	12 years ago
sixcooler	69906b1d2e	Revert "do not use remote proxy if not switched on - regardless of the proto" This reverts commit `20f452d228`.	12 years ago

1 2 3 4 5 ...

699 Commits (78e7aadb26ad38c30daa1a845b2d9cee3843c853)