yacy_search_server

Commit Graph

Author	SHA1	Message	Date
orbiter	e816b88b55	changed behaviour of metadata storage: in case that any solr is attached, the metadata is not written to the metadata-db, even if it is enabled but instead to solr. This prevents that metadata is written in two store systems at the same time. It is also the next step to migrate the current metadata-db to solr.	12 years ago
orbiter	2571e0d47a	removed unused classes	12 years ago
Michael Peter Christen	f9c0e6e950	- Implemented and integrated the URIMetadataNode object which is a metadata representation from the solr index. This shall replace metadata from the built-in database in the future. - added the Solr-driven metadata into the search index of YaCy which makes it now possible to run YaCy without the old metadata index. This is a major stept forward to a full migration to Solr.	12 years ago
Michael Peter Christen	b2b480fff2	more abstraction of the YaCySchema -> Opensearch matching process	12 years ago
Michael Peter Christen	24462e9baa	set the title every time, it is possible that it has changed	12 years ago
Michael Peter Christen	dcc72799c4	better abstraction for result writers using controlled vocabularies and URIRefs	12 years ago
Michael Peter Christen	136fcb1ad9	refactoring	12 years ago
Michael Peter Christen	a12f693ec9	added two response writer for embedded solr interface: a rss/opensearch writer and an enhanced solr xml writer. The enhanced solr writer has less configuration overhead than the original writer and should by slightly faster. The rss/opensearch writer is at this time slightly incomplete compared with the already existing rss search result form YaCy and also snippets are missing at this time. To test the new interface, open for example: http://localhost:8090/solr/select?wt=rss&q=olympia The wt-code for the new result writers are= wt=rss for opensearch wt=exml for the enhanced solr xml writer. Additionally, the SRU search parameters had been added to the solr interface which can now also be used for a normal solr/xml search.	12 years ago
Michael Peter Christen	bca4a16603	replaced the multivalue generic string field name suffix _ss by _txt because _ss is not part of the standard solr example schema.	12 years ago
orbiter	67edfd991c	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
orbiter	d9173ba7ed	added more solr fields to integrate values from URIMetadataRow. All writings to the Metadata-DB are now also done to solr. This includes metadata transfer during search and rwi transfer. The new/added solr fields are: ## time when resource was loaded load_date_dt ## date until resource shall be considered as fresh fresh_date_dt ## id of the host, a 6-byte hash that is part of the document id host_id_s ## ids of referrer to this document referrer_id_ss ## the md5 of the raw source md5_s ## the name of the publisher of the document publisher_t ## the language used in the document; starts with primary language language_ss ## an external ranking value ranking_i ## the size of the raw source size_i ## number of links to audio resources audiolinkscount_i ## number of links to video resources videolinkscount_i ## number of links to application resources applinkscount_i	12 years ago
Michael Peter Christen	3276508d1b	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
Michael Peter Christen	3ce04cecf3	bad hack to prevent a bug appearing in solr	12 years ago
sixcooler	f32aa9a49c	prevent merge of blobs that can't be handled in memory	12 years ago
Michael Peter Christen	bbd242afb4	fix for a NPE	12 years ago
Michael Peter Christen	24d9db1613	snippet retrieval loading processes may use a smaller minimum load time value than crawling processes. This speeds up the search result preparation dramatically.	12 years ago
Michael Peter Christen	ef488a15f7	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
Michael Peter Christen	1687737771	Abstraction of HandleMap and HandleSet	12 years ago
sixcooler	76b037a20a	check content domain fix: search image/media should not show pages containing image/media search text should show all/text but image/media	12 years ago
sixcooler	9cd409682f	close augmented stream if filled from cache to get its content use augmented stream if proxyAugmentation is set only	12 years ago
Michael Peter Christen	e432bb9cd9	better calculation of possible saving in HeapReader index data structure	12 years ago
Michael Peter Christen	9549984c65	documentation/comments	12 years ago
Michael Peter Christen	3bcd9d622b	cleaned up classes and methods which are either superfluous at this time or will be superfluous or subject of complete redesign after the migration to solr. Removing these things now will make the transition to solr more simple.	12 years ago
Michael Peter Christen	6f1ddb2519	Moved solr index-add method to the same method where the YaCy index is written. Also done some code-cleanup.	12 years ago
Michael Peter Christen	315d83cfa0	cleanup	12 years ago
Michael Peter Christen	1f41d9c6f5	bugfix for a NPE	12 years ago
Michael Peter Christen	76202f068e	extended abstraction of local and remote solr index using one front-end for index administration and querying.	12 years ago
Michael Peter Christen	d3f243e2e1	fixed node type calculation for principal peers	12 years ago
Michael Peter Christen	826967513b	changed options in IndexFederated_p to switch on/off parts of the index individually. The settings are experimental and the values of the settings will be overwritten when an index migration from urldb to solr starts.	12 years ago
Michael Peter Christen	cba4ab862e	fix for http://bugs.yacy.net/view.php?id=202	12 years ago
orbiter	69e743d9e3	- more abstraction for the RWI index as preparation for solr integration - added options in search index to switch parts of the index on or off	12 years ago
orbiter	05a3ffd03a	patches to ensure that solr connectors are active ony if they have a solr object assigned and vice versa	13 years ago
orbiter	5a3c829872	embedded solr is only initiated if it is activated with IndexFederated_p.html	13 years ago
Michael Peter Christen	97b7bcf2a6	added a solr search index - by default, a (empty) solr storage instance is created at SEGMENTS/solr_36 - the index is written if in /IndexFederated_p.html the flag "embedded solr search index" is switched on - a standard solr query interface is available now with a new servlet at http://127.0.0.1:8090/solr/select To test this, do the following: - switch to webportal mode - switch on the feature as described - do a crawl. this fills the solr index. The normal YaCy search will NOT work now! - do a solr query, like: http://127.0.0.1:8090/solr/select?q=: http://127.0.0.1:8090/solr/select?q=text_t:Help play with different search fields as you can see in /IndexFederated_p.html You can use the standard solr query attributes as described in http://wiki.apache.org/solr/SearchHandler	13 years ago
Michael Peter Christen	f0a079ac9f	allow larger log entries	13 years ago
Michael Peter Christen	9b48c9fe2e	removed a crawler overhead (terminated loop which searches greatest stack that has zero-waiting urls). This should cause a slightly faster crawl for crawl stacks with many different domains in the crawl queue.	13 years ago
Michael Peter Christen	784a4abb18	enhancement in internal data organization which should generate less synchronizations in database access	13 years ago
Michael Peter Christen	f78ce93a80	collection of speed and memory saving hacks	13 years ago
orbiter	c00a3cf74d	less usage of generic logger to avoid logger generation overhead	13 years ago
orbiter	a196f24f60	prevent enqueueing of non-loggeable logging entries	13 years ago
orbiter	482afed07c	reduced logging overhead (a bit)	13 years ago
orbiter	e76159040b	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	13 years ago
orbiter	bbfa497a3c	replaced more size() > 0 by !isEmpty()	13 years ago
Michael Peter Christen	58e7d1952f	reduction of logging to prevent too much IO caused be logging	13 years ago
Michael Peter Christen	83da68c4c1	fixed a memory leak inside the logger which appeared if the log was writter faster that the logger is able to print this out to its out stream. A very large collection of unwritten log outputs had been seen during strong crawling. The new ArrayBlockingQueue is limited to prevent this case.	13 years ago
orbiter	0cbda0b2b8	- replaced all length() == 0 and size() == 0 with isEmpty() - replaced some length() > 0 and size() > 0 with !isEmpty() - cannot be done automatically - implemented some isEmpty() methods	13 years ago
orbiter	28b30231c3	fix for url matcher of multiple amp& in an url, see: http://forum.yacy-websuche.de/viewtopic.php?f=8&t=4439&p=26650#p26650	13 years ago
Roland 'Quix0r' Haeder	aef9dd0350	- removed cleaning of blacklist cache on startup - added cleaning of blacklist cache if cache is modified in interface - extended cache saving to all cache types - moved cache location to DATA/LISTS - fixed static file path which was relative to the application path but should be relative to data path - which is different in debian and mac implementations	13 years ago
orbiter	c7afa8bc48	using SwitchboardConstants for solr attributes	13 years ago
orbiter	c6d8950651	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	13 years ago
orbiter	5f3b8dc040	fix for RSS reader	13 years ago
orbiter	62202e2d71	refactoring of query attribute variable names for better consistency with (next) stored query words	13 years ago
Michael Peter Christen	1addbc792c	use less memory for md5 cache	13 years ago
Michael Peter Christen	f32de94723	more logging	13 years ago
Michael Peter Christen	d09d9f2364	filter old peers from bootstrap (now stronger: 60 minutes instead of 240).	13 years ago
Michael Peter Christen	434ee90c59	added classification for control file types which shall not be loaded but placed onto the noload-queue	13 years ago
Michael Peter Christen	a90bcb48f6	added webm	13 years ago
Michael Peter Christen	801972fe6f	fix for url camel case parser and sentence reader	13 years ago
Michael Peter Christen	fbc1a2030d	fix for sitemap importer: can now also import very large sitemaps within small memory configurations	13 years ago
Michael Peter Christen	92731e5287	fix for sevenzip parser	13 years ago
Michael Peter Christen	45641b0c23	catch and log a warning in RasterPlotter	13 years ago
Michael Peter Christen	8efc1c1078	- fixed a memory leak (or bad usage) during parsing/snippet fetch - more logging for errors	13 years ago
Michael Peter Christen	c3db015410	prevent loading of content from the cache when retrieval with IFFRESH is used and cache is stale. Should speed up snippet generation when cache strategy is IFFRESH.	13 years ago
Michael Peter Christen	b1e7c11fba	fix for pattern matcher in html parser	13 years ago
Michael Peter Christen	8a6edc0031	fix for solr shutdown	13 years ago
Michael Peter Christen	b8bcc06283	fix for urls beginning with "//"	13 years ago
Michael Peter Christen	b0c408788b	made class methods static where possible	13 years ago
Michael Peter Christen	5bd3c90907	- removed unnecessary semicolons - added default case for switch	13 years ago
Michael Peter Christen	132afaf687	removed unaccessible code	13 years ago
Michael Peter Christen	7c1ba99755	removed more unused method parameters	13 years ago
Michael Peter Christen	83701a1b4c	removed unused ImageReference package	13 years ago
Michael Peter Christen	0301aba1e9	removed unused method parameters	13 years ago
Michael Peter Christen	241dd8410a	removed snippet pattern filter - it was not used	13 years ago
Michael Peter Christen	d3964253ae	- added @SuppressWarnings to unused servlet method parameters - removed unnecessary casts - removed unnecessary throw statements	13 years ago
Michael Peter Christen	ea10766bfd	cleaned unnecessary nested code	13 years ago
Michael Peter Christen	1481037820	replaced non-generic array with collection	13 years ago
orbiter	fc0f9543fe	More SentenceReader cleanup	13 years ago
orbiter	586bb0eb6a	Simplified SentenceReader (no more Reader inside..)	13 years ago
orbiter	7f851d62a7	replaced HashARC with SizeLimited Objects which are less costly	13 years ago
orbiter	d4291ac1f3	more tolerance when creating solar document	13 years ago
orbiter	78fc3cf8f8	refactoring and new usage of SentenceReader: this class appeared as one of the major CPU users during snippet verification. The class was not efficient for two reasons: - it used a too complex input stream; generated from sources and UTF8 byte-conversions. The BufferedReader applied a strong overhead. - to feed data into the SentenceReader, multiple toString/getBytes had been applied until a buffered Reader from an input stream was possible. These superfluous conversions had been removed. - the best source for the Sentence Reader is a String. Therefore the production of Strings had been forced inside the Document class.	13 years ago
orbiter	bb8dcb4911	automatically adopt size of word cache to available memory	13 years ago
Michael Peter Christen	ad09b786bf	clean up parser data	13 years ago
Michael Peter Christen	276a66a793	Adding a limit of 1000 links that a parser shall store during indexing. A limit was necessary because some web pages have such huge numbers of links that it can easily cause a OOM just by the number of links. The quesion if the number of 1000 links is sufficient or too weak must be answered with the result of testing this feature.	13 years ago
Michael Peter Christen	613b45f604	- better data structures in secondary search - fixed a big memory leak in secondary search	13 years ago
Michael Peter Christen	de903a53a0	parser refactoring & hacks	13 years ago
Michael Peter Christen	8a82609360	- smaller caches to save memory - close cloneable iterators to free memory	13 years ago
Michael Peter Christen	7249d9c9de	bugfix for concurrent seed loader	13 years ago
Michael Peter Christen	c72d3b12cd	concurrently initialize the seed list during p2p network bootstrap	13 years ago
Michael Peter Christen	1825f165b8	better integration of blacklist according to use case	13 years ago
Michael Peter Christen	c18fa9fa75	Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1	13 years ago
Michael Peter Christen	ce8d4b87d9	fixes for new eclipse 'Juno' warning 'Resource leak'.	13 years ago
Michael Peter Christen	0c345d1559	giving threads name so its easier to see whats happening during debugging and within a thread dump	13 years ago
reger	067728bccc	add search result heuristic. adding a crawl job with depth-1 for every displayed search result (crawling every external linked page of displayed search result pages)	13 years ago
Michael Peter Christen	03280fb161	removed segments-concept and the Segments class: the segments had been there to create a tenant-infrastructure but were never be used since that was all much too complex. There will be a replacement using a solr navigation using a segment field in the search index.	13 years ago
Michael Peter Christen	508a81b86c	added solr field 'refresh_s' which stores the refresh url contained in the meta-refresh html header field.	13 years ago
Michael Peter Christen	f3167def64	do not fill the keywords with title content if keywords do not exist.	13 years ago
Michael Peter Christen	9116013c64	- allow lazy initialization of solr value (if using 'lazy', then no 0-values and no empty strings are written). This may save a lot of memory (in ram and on disc) if excessive 0-values or empty strings appear) - do not allow default boolean values for checkboxes because that does not make sense: browsers may omit the checkbox attribute name if the box is not checked. A default value 'true' would not comply with the semantic of the browsers response. - add a checkbox in IndexFederated_p for the lazy initialization of solr fields.	13 years ago
sixcooler	97f60010d8	fix crawl start from file	13 years ago
Michael Peter Christen	0294a53459	- add canonical field only if requested by solr schema - remove canonical url from in/outbound urls if present	13 years ago

1 2 3 4 5 ...

5771 Commits (0ad52ac4c34254c06efa0bc921d4c1f8b3eba592)