yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Michael Peter Christen	22e1f68c0b	solrj user authentication patch	13 years ago
Michael Peter Christen	09484955dc	added new entry class for embed tags	13 years ago
Michael Peter Christen	62f2554a01	- fixed build problems (deprecated methods using httpclient 3.1) - removed httpclient 3.1 lib which was used by solrj (solrj now uses httpclient 4)	13 years ago
Michael Peter Christen	a6d60fc21f	concurrency enhancement in ConfigurationSet	13 years ago
Michael Peter Christen	453010bd68	- solved problems with backpath normalization - redesigned in/outbound link handover - removed iframe links from inbound/outbound in solr scheme	13 years ago
Michael Peter Christen	5f5ed33ed8	patch for media search (audio, video apps)	13 years ago
Michael Peter Christen	7860c1df80	fix needed for new solrj library	13 years ago
Michael Peter Christen	0e13022147	- enhanced solr field documentation - added xml api button to IndexFederated_p - the solr schema.xml file can be generated by YaCy	13 years ago
Michael Peter Christen	19efbf1b0f	- apply directDocByURL to NOLOAD Queue - choose pushing to NOLOAD as default for site crawl	13 years ago
Michael Peter Christen	659178942f	- Redesigned crawler and parser to accept embedded links from the NOLOAD queue and not from virtual documents generated by the parser. - The parser now generates nice description texts for NOLOAD entries which shall make it possible to find media content using the search index and not using the media prefetch algorithm during search (which was costly) - Removed the media-search prefetch process from image search	13 years ago
Michael Peter Christen	a3badd3205	changed search process for images: no more media snippet load process, show only links from index which had been on the text search page before. This creates a superfast search process for images!	13 years ago
reger	c1f6b4fb52	lookupByIP: prevent comparing of port parameter if called with port -1 (=unknown)	13 years ago
Michael Peter Christen	f8cd57c92f	new indexing strategy: ALL links that appear anywhere are indexed, not only links where the content can be parsed. All non-parseable links are placed into the noload queue. The search process must therefore be able to filter out non-text search results. - This fixes the problem that image search results appeared in the text search. - The interactive search can retrieve now ALL types of links - The p2p interface is now extended to retrieve only certain types of links (text, image, video, apps) - The search process has an extension to filter the right document type according to the search query	13 years ago
Michael Peter Christen	14f67f217c	refactoring of ContentDomain: now subclass of Classification	13 years ago
Michael Peter Christen	8a08c96a82	removed dependency from logging	13 years ago
Michael Peter Christen	a1a5b015d8	refactoring: moved document Classification to cora package	13 years ago
Michael Peter Christen	33d1062c79	refactoring: the cache belongs to the crawler	13 years ago
Michael Peter Christen	4d5da75814	fix for parser problem if a <a>-tag is 'within' html tags with unclosed tags. That prevented the <a> tags from beeing recognized. This is a fix for http://forum.yacy-websuche.de/viewtopic.php?p=25516#p25516	13 years ago
Michael Peter Christen	91a86f0b06	fixed to network graph testing	13 years ago
Michael Peter Christen	7b5b9baee0	added citation rank to ranking profile	13 years ago
Michael Peter Christen	046f3a7e8d	check if httpc has decompressed the release file and rename the file from .tar.gz to .tar if that happened	13 years ago
Michael Christen	02e4dedff2	fix to url citation collection	13 years ago
Michael Christen	e32055aa15	added stub classes for - a new database for url reference data ('seen links') - a new database extending the references to the full url metadata attributes set which shall replace the old metadata database if it is finished - migration help classes stub to use old and new metadata databases simultanously	13 years ago
Michael Christen	ac5d124ee0	experimental implementation of a citation ranking as post-ranking method. (ranking coefficient fixed, need to be made configurable)	13 years ago
Michael Christen	8fc86fe397	added storage of full anchor link structure: the links between all pages are now stored. The same index structure as used for the word index is used to make a reverse link index. The new file(s) in SEGMENT/default/citation.index.*.blob store the citation index. This will be used to create much more detailed link structures for the YaCy apis and to create a better ranking. A ranking using the citation.index should provide better results especially for portal indexes and initranets.	13 years ago
Lotus	0b3f39136e	allow custom ppm lower than minimum button on /Crawler_p.html fixes http://bugs.yacy.net/view.php?id=166	13 years ago
Michael Peter Christen	532c7cf827	added physics experiment to the graph plotter. not active by default	13 years ago
Michael Peter Christen	aba9b1bfa0	better names for elements of a linked graph	13 years ago
Michael Peter Christen	2fc8ecee36	ConcurrentLinkedQueue has a VERY long return time on the .size() method. See http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/ConcurrentLinkedQueue.html and the following test programm: public class QueueLengthTimeTest { public static long countTest(Queue<Integer> q, int c) { long t = System.currentTimeMillis(); for (int i = 0; i < c; i++) { q.add(q.size()); } return System.currentTimeMillis() - t; } public static void main(String[] args) { int c = 1; for (int i = 0; i < 100; i++) { Runtime.getRuntime().gc(); long t1 = countTest(new ArrayBlockingQueue<Integer>(c), c); Runtime.getRuntime().gc(); long t2 = countTest(new LinkedBlockingQueue<Integer>(), c); Runtime.getRuntime().gc(); long t3 = countTest(new ConcurrentLinkedQueue<Integer>(), c); System.out.println("count = " + c + ": ArrayBlockingQueue = " + t1 + ", LinkedBlockingQueue = " + t2 + ", ConcurrentLinkedQueue = " + t3); c = c * 2; } } }	13 years ago
Michael Peter Christen	8aba045ba1	if a new pop-up page is set in config portal, then this page applies also to the default page configuration for the httpd if no path is given.	13 years ago
Michael Peter Christen	8c06925984	animation of the web structure picture	13 years ago
Michael Peter Christen	898fa7c3f3	use tld heuristic to check if a domain is local or global	13 years ago
Michael Peter Christen	213c8d97f2	use less proccesses in process pool	13 years ago
Michael Peter Christen	c639248c23	protection against strange answers from remote peers during search	13 years ago
Michael Peter Christen	36e4d82b27	changed ranking	13 years ago
Michael Peter Christen	096c17e7cd	added test code	13 years ago
Michael Peter Christen	665626a51b	catch OOM errors during scanning	13 years ago
Michael Peter Christen	1cd711d005	added classes for citation references (for new citation ranking)	13 years ago
Michael Peter Christen	33a405dab8	ipv6 bugfix	13 years ago
Michael Peter Christen	c6c61be3f0	fix for http://bugs.yacy.net/view.php?id=148	13 years ago
Michael Peter Christen	e0f1e7d904	added new citation reference data structure that shall be used for a citation ranking	13 years ago
Michael Peter Christen	e18a4f6b74	more tolerant merge iterator	13 years ago
Michael Peter Christen	e101c2e0e2	added changes from copperdust (submitted by email): 1. Improved and fixed language detection: 1.1 Identificator.java - recognition fix (improved) 1.2 DCEntry.java - fix (changed detection order due to detection from tld in many cases is incorrect) 1.3 MultiProtocolURI.java - fixed and enhanced language from tld detection (all currently used top-level domains; ccTLD added but not tested). 2. Ukrainian language update. 3. Main Slavic languages langstats (tested and works fine).	13 years ago
Michael Peter Christen	8d63a5887c	bugfixes	13 years ago
Michael Peter Christen	9ad1d8dde2	complete redesign of crawl queue monitoring: do not look at a ready-prepared crawl list but at the stacks of the domains that are stored for balanced crawling. This affects also the balancer since that does not need to prepare the pre-selected crawl list for monitoring. As a effect: - it is no more possible to see the correct order of next to-be-crawled links, since that depends on the actual state of the balancer stack the next time another url is requested for loading - the balancer works better since the next url can be selected according to the current situation and not according to a pre-selected order.	13 years ago
Michael Peter Christen	7e4e3fe5b6	free some memory after parsing html	13 years ago
Michael Peter Christen	4540174fe0	memory hacks	13 years ago
Michael Peter Christen	b4409cc803	small redesign of blob column index and usage	13 years ago
Michael Peter Christen	d5c1f2746e	performance hack	13 years ago
Michael Peter Christen	803963aebd	performance hack: better space grow in CharBuffer (speeds up html parser)	13 years ago
Michael Peter Christen	8b0920b0b5	tried to fix the ipv6 problem as reported in bug but this did not solve all problems because a bug in the apache http client prevented that it worked. Thread dump: Caused by: java.lang.NumberFormatException: For input string: "1450:400c:c01:0:0:0:69" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) at java.lang.Integer.parseInt(Integer.java:458) at java.lang.Integer.parseInt(Integer.java:499) at org.apache.http.client.utils.URIUtils.extractHost(URIUtils.java:310) at org.apache.http.impl.client.AbstractHttpClient.determineTarget(AbstractHttpClient.java:764) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754) at net.yacy.cora.protocol.http.HTTPClient.execute(HTTPClient.java:597) at net.yacy.cora.protocol.http.HTTPClient.getContentBytes(HTTPClient.java:558) at net.yacy.cora.protocol.http.HTTPClient.GETbytes(HTTPClient.java:341) at de.anomic.crawler.retrieval.HTTPLoader.load(HTTPLoader.java:131) at de.anomic.crawler.retrieval.HTTPLoader.load(HTTPLoader.java:74) at net.yacy.repository.LoaderDispatcher.loadInternal(LoaderDispatcher.java:274) at net.yacy.repository.LoaderDispatcher.load(LoaderDispatcher.java:164) at net.yacy.repository.LoaderDispatcher.load(LoaderDispatcher.java:150) at net.yacy.repository.LoaderDispatcher.loadDocument(LoaderDispatcher.java:355) at getpageinfo_p.respond(getpageinfo_p.java:97)	13 years ago
Michael Peter Christen	e2f8f263e8	changed storage of search words: keep order	13 years ago
Michael Peter Christen	ed39ef2890	changed generation of protocol information	13 years ago
Michael Peter Christen	0b67a0a5d8	added a column index for tables in blob files. This is heavily used during receiving of DHT submissions and when answering remote search requests. Both events together may have caused IO-deadlocking and this commit shall fix that.	13 years ago
Michael Peter Christen	2e5cd6a1b2	fixed parser extension deny list generation and usage	13 years ago
Michael Peter Christen	8bee1472c9	there is no noindex, only nofollow in links	13 years ago
Michael Peter Christen	3cd6dcd352	do not add new solr fields as activated fields	13 years ago
Michael Peter Christen	e3bb73c3d6	serialized some database access methods	13 years ago
Michael Peter Christen	7e728867e5	added a synchronization around iterations to prevent IO-deadlocking during concurrent remote search requests	13 years ago
Michael Peter Christen	355ecf330f	reduced target file site to 64mb	13 years ago
Michael Peter Christen	10ae6d94a1	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	13 years ago
Michael Peter Christen	2ea585d616	fix for host navigator	13 years ago
Michael Peter Christen	2f6dde92e2	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	13 years ago
Michael Peter Christen	c560a582ac	fix for single-word vocabulary lines	13 years ago
Michael Peter Christen	4c5edab1ec	added option to have exception search result windows	13 years ago
Michael Peter Christen	046d7de95b	Merge remote branch 'reger/master'	13 years ago
reger	a95f645a61	Bugfix class repository.Loaddispatcher fixed download file limit of 10000 line 355: final Response response = this.load(request, cachePolicy, 10000, true);	13 years ago
Michael Peter Christen	ef78f22ee1	performance hack	13 years ago
Michael Peter Christen	41536eb4a2	performance hack	13 years ago
Michael Peter Christen	f91487fc50	added delete-button for host navigation	13 years ago
Michael Peter Christen	e8d24fd802	author navigator can be switched off	13 years ago
Michael Peter Christen	558ab7bd4e	made the protocol navigator reversible	13 years ago
Michael Peter Christen	96cb75f1d4	made the filetype navigator be able to deselect the search constraint	13 years ago
Michael Peter Christen	1f4f60654a	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Conflicts: source/net/yacy/document/parser/pdfParser.java	13 years ago
reger	32104360ce	PDFParser - return at least first 3 pages of PDF fix for pdf parsing without returning parsed text due to interruption by time out.	13 years ago
Michael Peter Christen	ef5192f8c9	using the generic document parser for crawl starts instead of the html parser. This makes it possible that every type of document can be a crawl start point, not only text documents or html documents. Testet this with a pdf document.	13 years ago
Michael Peter Christen	a02fdf8625	better error messages	13 years ago
Michael Peter Christen	eadb58dd87	small enhancements in pdf parser	13 years ago
Michael Peter Christen	c6ba44468e	timeout = 5000 instead 3000	13 years ago
reger	b616de5973	PDFParser - return at least first 3 pages of PDF fix for pdf parsing without returning parsed text due to interruption by time out.	13 years ago
Lotus	c73af39e54	refactoring of tray icon class, now uses Java 6 methods natively	13 years ago
Michael Peter Christen	4eff0e26f1	npe bugfix	13 years ago
low012	8776b84c10	*) small fix to make password change function of reconfigureYACY.sh work again	13 years ago
Michael Peter Christen	1a0b6b3913	get more navigation details to search results	13 years ago
Michael Peter Christen	7f9b6b7a0c	added switches to ConfigParser to accept/deny documents by their extension	13 years ago
Michael Peter Christen	4901cee3cc	suppress auto-tagged subject entries when sending out or receiving metadata from other peers	13 years ago
Michael Peter Christen	83009d86f7	added the vocabulary navigator. It can be very simply tested by switching on the locale dictionaries.	13 years ago
sixcooler	985b78cf89	correct 'avaiable()' to use max of young / eden	13 years ago
sixcooler	4da8746275	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	13 years ago
sixcooler	c9aaa9e00a	respect non-reserved Memory in GenerationMemoryStrategy and enable it again	13 years ago
Michael Peter Christen	37f2d1b3e9	replaced Thread initialization with ExecutorService pool for delete method. This is much faster and produces less blocking when using the Compressor class which is used by the HTCache. I.e. picture search is much faster now.	13 years ago
Michael Peter Christen	a58dc4a91f	added autotagging to document condenser: - tags that are automatically generated now enrich the dc:subject - auto-generated tags have a '$' at the beginning of the tag - auto-generated tags lead the tag name with a vocabulary name each tag has the form $<vocabulary-name>:<tag-printname-space-replaced-by-'_'>	13 years ago
Michael Peter Christen	0d6176804b	emergency disabling of GenerationMemoryStrategy because of non-working available-method	13 years ago
Lotus	411aab02e3	Windows installer now detects reliably whether YaCy runs. A file lock on the yacy.running file has been implemented.	13 years ago
Michael Peter Christen	87f0210480	enriched log output to find NPE in HeapReader	13 years ago
Michael Peter Christen	987b412491	updated solr scheme: generic declaration of solr schemes	13 years ago
Michael Peter Christen	254adea51c	small fixes	13 years ago
Michael Peter Christen	49be60a7c8	WorkflowProcess is forced to make small pauses if shortMemoryStatus is reached.	13 years ago
Michael Peter Christen	b7bb84c0bb	set a limit to CharBuffer object size to fight against bad/too large content	13 years ago
Michael Peter Christen	c602eaaf46	enhanced search process	13 years ago
Michael Peter Christen	087f97d4c0	less noise if a browser cannot be opened	13 years ago
Michael Christen	eff966f396	fix for search process (it was aborted too early during remote search)	13 years ago
Michael Christen	e6d51363ee	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	13 years ago
Marek Otahal	a231d0eeb9	Run from Java the whole app YACY start for java webStart allow for better integration with IDE Conflicts: source/net/yacy/gui/framework/Browser.java	13 years ago
Marek Otahal	72adbeae90	!Important: move from Hashtable to HashMap Hashtable is an obsolete collection v1, now since v2 offers HashMap with same or better functionality. Please review, almost all code was already moved, so only a few changes. That is not the issue, but I found notices that some (ugly big) helper classes had to be created in past to compensate missing Hashtable's functionality. I'd like input if we can remove some of them. look for //FIX: if these commits Signed-off-by: Marek Otahal <markotahal@gmail.com>	13 years ago
Marek Otahal	f40efb39af	Blacklist loadList() remove duplicates by using Set Signed-off-by: Marek Otahal <markotahal@gmail.com>	13 years ago
Marek Otahal	f75b5e40e0	little fix in copy() Signed-off-by: Marek Otahal <markotahal@gmail.com>	13 years ago
Marek Otahal	1dc5d9f0f3	make ConnectionInfo comparable and sort list of connections in Connections_p ConnectionInfo compare by initTime Connections_p implement wish to sort connections, descending Signed-off-by: Marek Otahal <markotahal@gmail.com>	13 years ago
Michael Christen	fa8da7f89d	vocabularies are now also used as source for a did-you-mean computation	13 years ago
Michael Christen	eaec14ecc4	Dictionaries from words caches can now be used as autotagging vocabulary	13 years ago
Michael Peter Christen	91940fdf56	redesign of WordCache to be prepared to hold multiple independent dictionaries. Such dictionaries can then be also used as simplified vocabularies.	13 years ago
Michael Christen	bd40a10230	added autotaggig stub .. only reading and parsing of vocabularies at this time	13 years ago
Michael Peter Christen	2ee8cbeb2c	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Conflicts: source/net/yacy/search/Switchboard.java	13 years ago
Michael Peter Christen	992dbdf4bb	added noload statistic to servlets	13 years ago
Michael Christen	eebc02f5c1	fix	13 years ago
Michael Christen	216a287a85	Merge commit '6d4e08ed06c5cd28c45981b2ebe31c7f7ec6fd83' into quix0r Conflicts: source/de/anomic/crawler/CrawlQueues.java	13 years ago
stbrumm	d18095dc48	Patch fuer Issue 0000102 and fixes to Patch (private peer status is a property of a peer, not a status)	13 years ago
stbrumm	9f1b1b4604	Type for Robinson-Mode/Private Perr added	13 years ago
Michael Christen	20962a4ed7	added metadata node stub for metadata from blobs	13 years ago
Michael Christen	575dbbaa93	enhancements in Blob retrieval: try to use less CPU resources by testing a blog first that most certainly has wanted entries.	13 years ago
Michael Christen	585a8f3c44	fixed a bug in search sequence (caused emtpy results)	13 years ago
Michael Christen	361146dd7a	better error handling for file loader	13 years ago
Roland 'Quix0r' Haeder	6d4e08ed06	Rewrote filesize() to (hopefully) avoid a NPE, rewrote Blacklist class to concurrent classes to avoid a CME	13 years ago
Roland 'Quix0r' Haeder	fa08ed5ae5	Fixed a lot CHMOD rights (no need for execute flag on .java/.html) and introduced local/remote crawl size ratio based check	13 years ago
Roland Haeder	319fd1f4aa	A concurrent access can happen on the blacklist (with latest introduced blacklist check in media snippet computation)	13 years ago
Roland 'Quix0r' Haeder	a3083d13bf	Blacklist checks are now always turned on, in media searches (e.g. image search) images matching blacklist entries are no longer shown to the user	13 years ago
Michael Christen	52184a1170	fix for search process	13 years ago
Michael Christen	85bd4cc8bc	better lookup for peer names	13 years ago
Michael Christen	20e3084bd4	redesign of fining of peers by ip: more leightweight method to read the seed databases	13 years ago
Michael Christen	0797b0de99	new handling of remote search processes: looking for seeds will now not block the whole search process any more. A deadlock with a DHT selection process may have been the cause for interface lockings in the past.	13 years ago
Michael Christen	ee9aae5cc0	more about CreativeCommons license vocabulary	13 years ago
Michael Christen	ecd74fe34f	less dramatic upnp failures	13 years ago
Michael Christen	c75e1a3125	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	13 years ago
Michael Christen	13f5b5f80d	the component part in the YaCy Metadata is filled using the Dubling Core vocabulary	13 years ago
Michael Peter Christen	8d2cbfb685	more vocabularies and more semantics for lod data structures	13 years ago
Michael Christen	9cd36b4c44	added vocabulary for geolocalization as used in georss	13 years ago
Michael Christen	9e5894c784	Removed handling of components objects for URIMetadataRows. This is a preparation to replace this rows with nodes from the node store.	13 years ago
Michael Christen	66ab51f89d	added rdf vocabulary	13 years ago
Michael Christen	c04bfaa51b	refactoring	13 years ago
Michael Peter Christen	136b514f52	added a Triple Store based on Nodes that fit to the new storage classes. Added also a first Vocabulary for the node store - Dublin Core.	13 years ago
Michael Peter Christen	613ab6a69d	added BEncodedHeapBag and BEncodedHeapShard which are storage container for a new metadata store. An abstraction of the content for this storage is defined with MapStore. A MapStore is an abstraction of a RDF Node store.	13 years ago
Michael Christen	6fecd0db88	one more performance hack to prevent costly md5 computation	13 years ago
Michael Christen	e13441b069	better digest pool size (smaller by default but unlimited)	13 years ago
Michael Christen	1f4afb4dc0	performance hacks	13 years ago
Michael Christen	675d557e88	removed debug logging	13 years ago
Michael Christen	e9dc99fe15	added rules to set specific RWIs as private RWIs which are not transmitted to remote peers. This will be used for private index copies and phonetic indexes.	13 years ago
Michael Peter Christen	4243ace863	added phonetic classes	13 years ago
Michael Peter Christen	0bcef2d156	added feature as requested in http://forum.yacy-websuche.de/viewtopic.php?f=18&t=3461 The search can now be configured with a non-display host list. the search will always exlude the given list of host unless they are requested directly using the host navigation	13 years ago
Michael Christen	204c29f010	small bugfixes for search result display and cache display	13 years ago
Michael Christen	17f962fceb	translator updates: - config string for chinese - do not copy the language file to DATA/LOCALE any more (and do not use them there, this is really confusing for new translators)	13 years ago

1 2 3 4 5 ...

1105 Commits (2280a7b276df243b687fb7a3ed10c939fd25c9c4)