yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Michael Peter Christen	5b3acc12cd	Pattern.quote() replaces \\Q and \\E according to publication in http://www.cs.washington.edu/homes/mernst/pubs/regex-types-ftfjp2012.pdf	13 years ago
Michael Peter Christen	e7e381d110	added configuration to switch off redirection following in crawler	13 years ago
Michael Peter Christen	70505107ca	enhanced crawler/balancer: better remaining waiting-time guessing	13 years ago
Michael Peter Christen	f150bc218b	fixed bug in solr error document	13 years ago
Roland 'Quix0r' Haeder	a093ccf5eb	Now used synchronization in all close() methods to make sure all objects are 'closed' in an ordered way Conflicts: source/de/anomic/http/server/ChunkedInputStream.java source/de/anomic/http/server/ChunkedOutputStream.java source/de/anomic/http/server/ContentLengthInputStream.java source/net/yacy/cora/protocol/Domains.java source/net/yacy/cora/services/federated/solr/SolrShardingConnector.java source/net/yacy/cora/services/federated/solr/SolrSingleConnector.java source/net/yacy/document/content/dao/PhpBB3Dao.java source/net/yacy/document/parser/html/AbstractTransformer.java source/net/yacy/kelondro/blob/BEncodedHeap.java source/net/yacy/kelondro/blob/HeapReader.java source/net/yacy/kelondro/index/RAMIndexCluster.java source/net/yacy/kelondro/io/ByteCountInputStream.java source/net/yacy/kelondro/logging/ConsoleOutErrHandler.java source/net/yacy/kelondro/table/SQLTable.java	13 years ago
Michael Peter Christen	ba6aaabc51	refactoring + parser bugfixes	13 years ago
Michael Peter Christen	659178942f	- Redesigned crawler and parser to accept embedded links from the NOLOAD queue and not from virtual documents generated by the parser. - The parser now generates nice description texts for NOLOAD entries which shall make it possible to find media content using the search index and not using the media prefetch algorithm during search (which was costly) - Removed the media-search prefetch process from image search	13 years ago
Michael Peter Christen	f5efdb21fd	refactoring	13 years ago
Michael Peter Christen	f8cd57c92f	new indexing strategy: ALL links that appear anywhere are indexed, not only links where the content can be parsed. All non-parseable links are placed into the noload queue. The search process must therefore be able to filter out non-text search results. - This fixes the problem that image search results appeared in the text search. - The interactive search can retrieve now ALL types of links - The p2p interface is now extended to retrieve only certain types of links (text, image, video, apps) - The search process has an extension to filter the right document type according to the search query	13 years ago
Michael Peter Christen	a1a5b015d8	refactoring: moved document Classification to cora package	13 years ago
Michael Peter Christen	a5d7da68a0	refactoring: removed dependency from switchboard in Balancer/CrawlQueues	13 years ago
Michael Peter Christen	33d1062c79	refactoring: the cache belongs to the crawler	13 years ago
Michael Peter Christen	046f3a7e8d	check if httpc has decompressed the release file and rename the file from .tar.gz to .tar if that happened	13 years ago
Michael Christen	22f05c83ff	fixed default must-match filter for full domain crawls - the old filter was to restrictive and did not allow intranet crawls	13 years ago
Michael Peter Christen	0cc0290978	bugfix for a must-not-match pattern check. This bug did not make the check semantically wrong, but a trick that prevented an IP lookup in case that the filter was not used did not work. That bugfix causes that crawling gets a huge speed boost for noload urls!	13 years ago
Michael Peter Christen	2fc8ecee36	ConcurrentLinkedQueue has a VERY long return time on the .size() method. See http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/ConcurrentLinkedQueue.html and the following test programm: public class QueueLengthTimeTest { public static long countTest(Queue<Integer> q, int c) { long t = System.currentTimeMillis(); for (int i = 0; i < c; i++) { q.add(q.size()); } return System.currentTimeMillis() - t; } public static void main(String[] args) { int c = 1; for (int i = 0; i < 100; i++) { Runtime.getRuntime().gc(); long t1 = countTest(new ArrayBlockingQueue<Integer>(c), c); Runtime.getRuntime().gc(); long t2 = countTest(new LinkedBlockingQueue<Integer>(), c); Runtime.getRuntime().gc(); long t3 = countTest(new ConcurrentLinkedQueue<Integer>(), c); System.out.println("count = " + c + ": ArrayBlockingQueue = " + t1 + ", LinkedBlockingQueue = " + t2 + ", ConcurrentLinkedQueue = " + t3); c = c * 2; } } }	13 years ago
Michael Peter Christen	8aba045ba1	if a new pop-up page is set in config portal, then this page applies also to the default page configuration for the httpd if no path is given.	13 years ago
Michael Peter Christen	c6c61be3f0	fix for http://bugs.yacy.net/view.php?id=148	13 years ago
Michael Peter Christen	0d148c3353	more logging in resource observer	13 years ago
Michael Peter Christen	2fa037ae1d	enhanced crawler	13 years ago
low012	2120db289a	*) Small change which should solve problem with cgitb module in Python CGI scripts.	13 years ago
Lotus	ee89cf5ae5	fix must match filter for full domain crawl allow: http://www.example.com http://www.example.com/ http://www.example.com/abc.html?xyz=q block: http://www.example.com.cn http://www.example.com.cn/dsf	13 years ago
Michael Peter Christen	9ad1d8dde2	complete redesign of crawl queue monitoring: do not look at a ready-prepared crawl list but at the stacks of the domains that are stored for balanced crawling. This affects also the balancer since that does not need to prepare the pre-selected crawl list for monitoring. As a effect: - it is no more possible to see the correct order of next to-be-crawled links, since that depends on the actual state of the balancer stack the next time another url is requested for loading - the balancer works better since the next url can be selected according to the current situation and not according to a pre-selected order.	13 years ago
Michael Peter Christen	4540174fe0	memory hacks	13 years ago
Michael Peter Christen	9ebcae2fbc	enhanced url parser to understand urls with & instead of & in post urls	13 years ago
Michael Peter Christen	1f4f60654a	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Conflicts: source/net/yacy/document/parser/pdfParser.java	13 years ago
Michael Peter Christen	e6d26a023f	fix for bookmark crash with possible side-effects on crawl start after the crash	13 years ago
Michael Peter Christen	190b77c55e	added Ukrainian translation	13 years ago
Marek Otahal	72adbeae90	!Important: move from Hashtable to HashMap Hashtable is an obsolete collection v1, now since v2 offers HashMap with same or better functionality. Please review, almost all code was already moved, so only a few changes. That is not the issue, but I found notices that some (ugly big) helper classes had to be created in past to compensate missing Hashtable's functionality. I'd like input if we can remove some of them. look for //FIX: if these commits Signed-off-by: Marek Otahal <markotahal@gmail.com>	13 years ago
Marek Otahal	c1af123ddd	just a little faster toString Signed-off-by: Marek Otahal <markotahal@gmail.com>	13 years ago
Marek Otahal	64e4bcee82	serverSwitch get(App/Data)Path() use common helper method Signed-off-by: Marek Otahal <markotahal@gmail.com>	13 years ago
Marek Otahal	371fbb4deb	just comment + shorter code in serverSwitch Signed-off-by: Marek Otahal <markotahal@gmail.com>	13 years ago
Marek Otahal	ed253b7aff	update javadoc, does not throw IOException Signed-off-by: Marek Otahal <markotahal@gmail.com>	13 years ago
Michael Peter Christen	2ee8cbeb2c	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Conflicts: source/net/yacy/search/Switchboard.java	13 years ago
Michael Peter Christen	992dbdf4bb	added noload statistic to servlets	13 years ago
Michael Christen	354b976110	fix for concurrency problem and endless loop in /suggest.json	13 years ago
Michael Christen	c21966bb43	fix	13 years ago
Michael Christen	13b05f9c08	fix	13 years ago
Michael Christen	e5d878c59e	Merge branch 'master' of ssh://gitorious.org/yacy/rc1 Conflicts: source/de/anomic/crawler/CrawlQueues.java	13 years ago
Michael Christen	ec26b2bea4	Merge commit 'fa08ed5ae5d72bddc3cc6a662b23103579e86109' into quix0r Conflicts: source/de/anomic/crawler/CrawlQueues.java	13 years ago
Michael Christen	216a287a85	Merge commit '6d4e08ed06c5cd28c45981b2ebe31c7f7ec6fd83' into quix0r Conflicts: source/de/anomic/crawler/CrawlQueues.java	13 years ago
stbrumm	d18095dc48	Patch fuer Issue 0000102 and fixes to Patch (private peer status is a property of a peer, not a status)	13 years ago
Roland 'Quix0r' Haeder	901f37d608	Also this ... :( #2	13 years ago
Roland 'Quix0r' Haeder	a985717ed2	Also this ... :(	13 years ago
Roland 'Quix0r' Haeder	5f490de554	Fix for ported fix from my old days ...	13 years ago
Roland 'Quix0r' Haeder	fa08ed5ae5	Fixed a lot CHMOD rights (no need for execute flag on .java/.html) and introduced local/remote crawl size ratio based check	13 years ago
Michael Christen	9e5894c784	Removed handling of components objects for URIMetadataRows. This is a preparation to replace this rows with nodes from the node store.	13 years ago
Michael Christen	c04bfaa51b	refactoring	13 years ago
Michael Christen	17f962fceb	translator updates: - config string for chinese - do not copy the language file to DATA/LOCALE any more (and do not use them there, this is really confusing for new translators)	13 years ago
Michael Christen	752b092b8a	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	13 years ago

1 2 3 4 5 ...

4744 Commits (5b3acc12cd4b4343c4e7d7f0a20a1da8ea8d5f6a)