yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Michael Peter Christen	788288eb9e	added the generation of 50 (!!) new solr field in the core 'webgraph'. The default schema uses only some of them and the resting search index has now the following properties: - webgraph size will have about 40 times as much entries as default index - the complete index size will increase and may be about the double size of current amount As testing showed, not much indexing performance is lost. The default index will be smaller (moved fields out of it); thus searching can be faster. The new index will cause that some old parts in YaCy can be removed, i.e. specialized webgraph data and the noload crawler. The new index will make it possible to: - search within link texts of linked but not indexed documents (about 20 times of document index in size!!) - get a very detailed link graph - enhance ranking using a complete link graph To get the full access to the new index, the API to solr has now two access points: one with attribute core=collection1 for the default search index and core=webgraph to the new webgraph search index. This is also avaiable for p2p operation but client access is not yet implemented.	12 years ago
Michael Peter Christen	91a0401d59	introduced a second core named 'webgraph'. This core will hold the link structure, but is not filled yet. To have the opportunity of a second core, multi-core functionality had to be implemented to the deep-embedded solr: - migrated the solr_40 directory content to a subdirectory 'collection1'; the previously used default core is now called collection1 - added solr_40/webgraph subdirectory as second core - added a servlet configuration for the second core 'webgraph' in /IndexSchema_p.html - added instance handling as addition to solr connections: all solr connectors are now instances of an solr 'instance' object; this required a complete re-design of the solr embedding - migrated also caching and sharding ontop of new instance handling - migrated the search apis to handle now the access to a specific core, the default core named 'collection1' - migrated the remote solr search interface to access shards of cores; for the yacy remote search the default core is now called 'solr'; using the peer address as solr address - migrated the solr backup and restore process: old backups cannot be used after this migration! - redesign of solr instance handling in all methods which access the instances: they cannot hold copies of these instances any more; the must retrieve the actuall connection object every time they want to write to it (this solves also some bugs when switching the index/network) - added another schema 'solr.webgraph.schema', the old solr.keys.list is replaced by solr.collection.schema	12 years ago
Michael Peter Christen	4111606654	removed the commitWithin attribute because that is not the way how the index is updated the right way for us. May also be be superfluous with the solr 4.0 softcommit.	12 years ago
Michael Peter Christen	d70d99fab5	added more metadata fields and facets to OpensearchResponseWriter. This should make it possible to replace the original and enriched yacy opensearch result with a solr output in opensearch format.	12 years ago
Michael Peter Christen	8651ec35fe	turned author_s into the multi-valued field author_sxt	12 years ago
Michael Peter Christen	4735bd47f4	- changed solr commit call and added an optimize option. Since Solr 4.0.0 there is a new softcommit feature which implements a near-real-time (NRT) search option. The softcommit does not do IO and does not cause performance issues. YaCy has now an extension in its solr connectors to use the softcommit feature. The softcommit call now replaces all places where a hard commit was used. Furthermore the commit strategy in when doing a search from the web interface was changed (it's done every time before a search is done). The softcommit feature was implemented because it was needed for the following changes (customer demands), which is also included in this git commit: - added a feature to identify all documents which have unique titles and/or unique descriptions. These unique flags are disabled by default. - added also a feature to set a flag when the url from a canonical tag is equal to the document url. This is also disabled by default. To support the new softcommit strategy, the commitWithinMs option was set to -1 do disable automatic commit based on document insert times. If documents are inserted permanently then also a commit would happen permanently whenever the commitWithinMs time is reached. This would conflict with the regular autocommit of 10 minutes and the new softcommit strategy.	12 years ago
Michael Peter Christen	db024a4e19	added new solr fields (unused yet; implementation will follow)	12 years ago
Michael Peter Christen	9b5bdae1b4	Reverted setting of MMapDirectoryFactory from solrconfig; see http://forum.yacy-websuche.de/viewtopic.php?p=27509#p27509 Instead, in the start script is checked if the host is a 64 host and -Dsolr.directoryFactory=solr.MMapDirectoryFactory is set as java option Reverted the ramBufferSizeMB setting (this was not enabled anyway) because that may be too much memory for small peers and embedded systems. Activated the mergeFactor 4; this was commented out by mistake	12 years ago
orbiter	eb68a30947	solr performance settings the target of these performance settings is the reduction of IO in general and during search in particual. - reduced mergeFactor to 4. This will increase the IO during indexing, but will reduce IO during search. It will also greatly reduce the number of open files which should make it possible to have overall larger indexes until the number of open files in an OS is reached. - increased ramBufferSizeMB to 256mb. This will reduce the number of commits. This change may compensate the reduction of the mergeFactor. - disabled updateLog. This is a real-time search feature which is available in YaCy anyway because a commit is forced if index.html is called. The updateLog feature causes a lot of IO during indexing and search and produced a lot of files in SEGMENTS/solr_40/data/tlog	12 years ago
Michael Peter Christen	f53703df62	using MMapDirectoryFactory as solution for ClosedChannelException given in https://issues.apache.org/jira/browse/SOLR-2247	12 years ago
Michael Peter Christen	22c694f906	activated the clickdepth_i attribute for solr again because the calculcation of that value is not as extensive as expected and furthermore the value is very useful for ranking	12 years ago
Michael Peter Christen	5a0eb1b268	clickpath should not be active by default because it needs extensive computation - partly to be implemented	12 years ago
Michael Peter Christen	5c0c56cfe1	Preparations to produce a click depth attribute in the search index. This attribute can be used for ranking and for other purpose (demand by customer) The click depth is computed in two steps: - during indexing the current fill-state of the reverse link index is used to backtrack the current page to the root page. The length of that backtrack is the clickdepth. But this does not discover the shortest click depth. To get this, a second process to check again is needed - added a process tag that can be used to do operations on the existing index after a crawl; i.e. calculation the shortest clickpath. Added a field to control this operation but not a method to operate on this. - added a visualization of the clickpath length in the host browser	12 years ago
Michael Peter Christen	295884fd54	- Merge commit '168b1d130d9d67b5e8855a0b50c4ba7ad4a416f8' - fixed conflict in htroot/yacysearch.java - removed nedres check because that causes that the remote server is not called at all in most cases (local index has already results but we want more) - fixed a regex bug (a '=' too much)	12 years ago
reger	168b1d130d	Adding heuristic to get search results from configured systems which support opensearch specification - any system supporting opensearch specification can be configured - search query is only forwarded to remote system if not enough results available on local peer - discover function provided, checking the local Solr index for links to opensearchdescription files, to add to the config - sample config file with some general search engines with opensearch support	12 years ago
reger	7761b60325	fix: Broken Link on Crawler_p.html - issue 218 http://bugs.yacy.net/view.php?id=218 - reduced Solr logging (/select)	12 years ago
reger	e9e0d63897	Add config option to show HostBrowser link in search result - ConfigPortal: added checkbox Host Browser - yacy.init: added search.result.show.hostbrowser as default = on (true) - fix HostBrowser: broken link to protected WebStructurePicture for public user	12 years ago
Michael Peter Christen	98819ec3d9	use solr boost configuration to select search fields. At this time it is possible to enter a negative boost value to switch that value off. This might be different in the future with a better input interface.	12 years ago
Michael Peter Christen	01200f06cc	using the author field as solr-native facet. this makes it necessary to introduce a copy-field for the author field to be copied to a string field. This field is then used to generate facets. Without this field, the facet would consist only of the words of the author names, not of the full author string.	12 years ago
Michael Peter Christen	eac9650b31	added another solr field clickdepth_i which reflects the number of clicks which are necessary to get from the portal of a host to a specific document. At this time, only the start document is flagged with clickdepth '0', all other with '-1'. To get the actual clickdepth, a process must use crawled information to collect the actual number of clicks. This will be added in another/next step.	12 years ago
Michael Peter Christen	1052263af3	- added a new solr field references_i which stores the number of INCOMING links to the corresponding web page. This information is taken from the reverse link index (a 'little sister' of the RWI index). - this field can be of use to enhance the ranking because a web page with more incoming links can be more more important than others. But this is not true for typical link pages like menues. Therefore the number of outgoing links is needed. - added a new solr attribute 'bf' to solr queries which is a boost function extension. this field can contain a formula which comuptes the boost according to given field values. After some experiments the following forumla is now default: div(add(1,references_i),pow(add(1,inboundlinkscount_i),1.6))^0.4 This takes the number of references and the inbound links. Further experiments are needed to enhance that forumula.	12 years ago
Michael Peter Christen	72f165d58b	added a Boost class which stores solr query boost values. The class can be configured using the yacy.init file. The boost information is taken from the configuration each time when a query to solr is done.	12 years ago
Michael Peter Christen	ea033f8f8e	added number of characters in url to default index to be able to use this field for ranking	12 years ago
Michael Peter Christen	efd2c4622d	added a new fail type attribute for the index to distinguish two separate fail types: network fail and forced exclusion (i.e. by robots or forwarding rules).	12 years ago
Michael Peter Christen	d6b82840f8	added a feature to find similarities in documents. This uses an enhanced version of the Nutch/Solr TextProfileSignatue. As a result, a signature of the document is written to the solr search index. Additionally for each time when a signature is written, it is checked if the singature exists already in the index. If the signature does not exist, the document is marked as unique. The unique attribute can now be used to sort document lists and bring duplicates to the end of a result list. To enable this, a large portion of the search api to Solr had to be changed. This affected mainly caching of 'exists' searches to enhance the check for existing signatures and do this without actually doing a solr query. Because here the first time a long number is used as value in the Solr store, also the value naming in the YaCySchema had to be adopted and normalized. This caused that many files had to be changed.	12 years ago
reger	328ce0b297	fix: remove fixed individual testing IP (85.25.151.30 = server4you.de) from default/yacy.network.freeworld.unit	12 years ago
Michael Peter Christen	e2c4c3c7d3	migration to solr 4.0.0	12 years ago
sixcooler	2d972f289a	rise commitWithinMs to default-value from SwitchBoard (result in lower hd-io) no dots in memory-graph (there are to much of them)	12 years ago
Michael Peter Christen	1baf498d59	- show more lines in online log - reverse order is default now	12 years ago
sixcooler	206e7bcf94	whitelist yacyportalsearch aka search.yacy.net	12 years ago
Michael Peter Christen	43f3345c90	- removed dependencies from URIMetadataRow and made direct access to URIMetadataNode which creates the opportunity to access Solr objects directly and use their information richness - lazy initialization of the URIMetadataNode object - should cause less computation and memory usage during search. - removed dead code	12 years ago
Michael Peter Christen	7e3e45fd04	added Open Graph Metadata default fields, see http://ogp.me/ns#	12 years ago
Michael Peter Christen	c3e5f667a7	added schema.org breadcrumb counter to parser and solr schema	12 years ago
Michael Peter Christen	42e525ca9a	enhanced the host browser	12 years ago
sof	5cb244b79b	Merge remote branch 'origin/master'	12 years ago
apfelmaennchen	88b062210c	Added a parser for audio file tags (e.g. ID3 tags for MP3 files) based on the jaudiotagger library. The parser is disabled by default as it needs to store temporary files for non file:// protocols, which might be disliked. For your local MP3-collection it loads nicely Artist, Title, Album etc. from the audio files meta data.	12 years ago
Michael Peter Christen	3d33a5bdf6	turned the synonyms_t Text field into a multi-valued String field synonyms_sxt	12 years ago
Michael Peter Christen	3b959ee002	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
orbiter	3190347814	added a synonyms_t field to solr and a process to read synonym files. This can be used to add another stemming to solr using stemming files that are expressed as synonyms for grammatical alternatives. The synonym/stemming files must have the following form: - each line is a comma-separated list of synonyms - the list of synonyms may be enclosed with {} (like the GSA synonyms file) - the file may contain comments which are lines starting with a '#' The synonym file(s) must be placed in DATA/DICTIONARIES/synonyms/ and are activated by default whenever a synonym file is in place. Then, for each word that is found in a document all synonyms are added to a long text field which is stored into synonyms_t. Processes using the synonyms must query with that field as optional matcher.	12 years ago
Michael Peter Christen	411d0e839b	added an underline text field to solr to record all underlined texts	12 years ago
Michael Peter Christen	f45f7fc12e	added new Host Browser to main menu: this new search interface is something completely new for search, but completely common on desktops: browser a web space like one would browse a file system in a file browser. The file listing is created using the search index and a faceted restriction to specific domains.	12 years ago
Michael Peter Christen	80edd8ecd7	some more after-refactoring fixes	12 years ago
Michael Peter Christen	562183932b	- removed ip_s from default profile since that needs a DNS lookup to create an document entry. This makes remote search much slower. - removed synchronization of add method if ip_s is activated to prevent that a user configuration causes bad behavior. The disadvantage of that is, that a index dump can cause data loss if an indexing is running during index dump - catched more exceptions and more NPE - better abstraction in MirrorSolrConnector - slight performance enhancement when only the index count is requested (rows=0 is sufficient to get a total count)	12 years ago
Michael Peter Christen	0504b01bdc	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
orbiter	9413f77b65	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
orbiter	a55e77a115	added twitter search heuristic	12 years ago
Michael Peter Christen	62add1d564	added the protocol and the file name extension to the solr fields since these fields are probably facets in file search	12 years ago
Michael Peter Christen	9db032664e	activate two solr fields which will be used by administration interface (later)	12 years ago
Michael Peter Christen	10b911eed4	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
Michael Peter Christen	be67c70a47	added Solr fields: inboundlinks_text_chars_val inboundlinks_text_words_val inboundlinks_alttag_txt outboundlinks_text_chars_val outboundlinks_text_words_val outboundlinks_alttag_txt	12 years ago
orbiter	d73fff0e0e	added solr field images_withalt_i	12 years ago
Michael Peter Christen	ee23fc7a32	added h1..h6 counter fields	12 years ago
Michael Peter Christen	b2b516cc3e	added a collection attribute to crawls and searches: - a solr field collection_sxt can be used to store a set of crawl tags - when this field is activated, a crawl tag can be assigned when crawls are started - the content of the collection field can be comma-separated, all of them are assigned to the documents when they are indexed as result of such a crawl start - a search result can be drilled down to a specific collection; this is currently only available in the solr interface and also in the gsa interface using the 'site' option - this adds a mandatory field for gsa queries (the google api demands that field all the time)	12 years ago
Michael Peter Christen	528d6763fa	- added new solr fields: title_count_i, title_chars_val, title_words_val description_count_i, description_chars_val, description_words_val - added many asserts to ensure data type correctness from YaCy to Solr and vice versa - made many fixes according to new findings from these asserts (!)	12 years ago
Michael Peter Christen	2ddc33646a	added new field for solr: url_paths_sxt url_parameter_i url_parameter_key_sxt url_parameter_value_sxt url_chars_i	12 years ago
Michael Peter Christen	75d5e3475d	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
cominch	dc468dad01	add content control features for custom filter lists	12 years ago
Michael Peter Christen	316b5fe116	- added a solr type definition verifier - fixed type definition found by the verifier - added multivalue-string fields for solr with extension 'sxt' - added multivalue-integer fields for solr with extension 'val' - renamed some solr attributes from txt to sxt - changed solr query line to an explicit AND/OR structure - added a country code second level domain list to Domains class; with parser - added a host string parser to get domain class name, country-code second-level domain and subdomain out of it - removed old coordinate attributes	12 years ago
Michael Peter Christen	4c79ddb91e	switched off some solr logging	12 years ago
Michael Peter Christen	e8acd542b5	- added faceted drill-down for host and geolocation to solr queries - added a new geolocation field to index schema, the old values are migrated if possible	12 years ago
Michael Peter Christen	af764c106c	re-activated audio and video search because they obviously work (!)	12 years ago
orbiter	716ea0cfe2	sorted the solr schema into mandatory and optional fields; reduced number of used field to reduce solr index size	12 years ago
orbiter	db6863db77	reduced solr cache sizes to check if that solves memory problems a bit	12 years ago
Michael Peter Christen	23226676c6	FOR THE BRAVE.. this is a forced migration to solr which is now ready for production as a replacement of the metadata-db. This intermediate release 1.041 will switch on the previously optional solr index and the old metadata-db will still work as it did before. Solr+metadata are accessed in mixed mode, no migration is done yet. If this causes not a catastrophe until the end of the weekend, we will do a YaCy 1.1 main release containing this as default.	12 years ago
Michael Peter Christen	a1b2c9a67d	doctype2mime fix, influences metadata conversion between old metadata and solr	12 years ago
Michael Peter Christen	703f427303	fixed some peer-ping connection details - larger time-out - removed too old seedlist - fixed a bug in connection test	12 years ago
Michael Peter Christen	ea49a8aa8c	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
Michael Peter Christen	aab0b680c3	- added xslt support for solr result formats. try i.e. http://localhost:8090/solr/select?q=:&start=0&rows=10&wt=xslt&tr=json.xsl - added servlet-side mime-type configuration for streamed servlets. this is used for the result formatters in solr result formats	12 years ago
cominch	e2119f4e76	augmented browsing: replace htmlparser by jsoup, which is more stable and reliable	12 years ago
Michael Peter Christen	b51df6c7e8	- added coordinate storage in solr schema - fixed shutdown process - fixed some solr-to-metadata reading - added a large number of metadata attributes in ViewFile.html	12 years ago
Michael Peter Christen	f9c0e6e950	- Implemented and integrated the URIMetadataNode object which is a metadata representation from the solr index. This shall replace metadata from the built-in database in the future. - added the Solr-driven metadata into the search index of YaCy which makes it now possible to run YaCy without the old metadata index. This is a major stept forward to a full migration to Solr.	12 years ago
Michael Peter Christen	bca4a16603	replaced the multivalue generic string field name suffix _ss by _txt because _ss is not part of the standard solr example schema.	12 years ago
orbiter	67edfd991c	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
orbiter	d9173ba7ed	added more solr fields to integrate values from URIMetadataRow. All writings to the Metadata-DB are now also done to solr. This includes metadata transfer during search and rwi transfer. The new/added solr fields are: ## time when resource was loaded load_date_dt ## date until resource shall be considered as fresh fresh_date_dt ## id of the host, a 6-byte hash that is part of the document id host_id_s ## ids of referrer to this document referrer_id_ss ## the md5 of the raw source md5_s ## the name of the publisher of the document publisher_t ## the language used in the document; starts with primary language language_ss ## an external ranking value ranking_i ## the size of the raw source size_i ## number of links to audio resources audiolinkscount_i ## number of links to video resources videolinkscount_i ## number of links to application resources applinkscount_i	12 years ago
Michael Peter Christen	3ce04cecf3	bad hack to prevent a bug appearing in solr	12 years ago
Michael Peter Christen	826967513b	changed options in IndexFederated_p to switch on/off parts of the index individually. The settings are experimental and the values of the settings will be overwritten when an index migration from urldb to solr starts.	12 years ago
Michael Peter Christen	1517a3b7b9	added webm mime-type	13 years ago
Michael Peter Christen	0301aba1e9	removed unused method parameters	13 years ago
Michael Peter Christen	4de50fe808	adding more principal peers for bootstraping	13 years ago
reger	067728bccc	add search result heuristic. adding a crawl job with depth-1 for every displayed search result (crawling every external linked page of displayed search result pages)	13 years ago
Michael Peter Christen	508a81b86c	added solr field 'refresh_s' which stores the refresh url contained in the meta-refresh html header field.	13 years ago
Michael Peter Christen	9116013c64	- allow lazy initialization of solr value (if using 'lazy', then no 0-values and no empty strings are written). This may save a lot of memory (in ram and on disc) if excessive 0-values or empty strings appear) - do not allow default boolean values for checkboxes because that does not make sense: browsers may omit the checkbox attribute name if the box is not checked. A default value 'true' would not comply with the semantic of the browsers response. - add a checkbox in IndexFederated_p for the lazy initialization of solr fields.	13 years ago
Michael Peter Christen	c03d306afa	shorter autocommit time (now: 1 second) to prevent that user cannot see results in solr the first time they try it out. The value can now be easily set to a higher number using the IndexFederated_p interface.	13 years ago
Michael Peter Christen	3fd4a01286	added option to record urls that are forwarded to the solr index	13 years ago
Michael Peter Christen	8dd469b9dd	added option to configure the autocommit delay time of solr on-the-fly	13 years ago
Michael Peter Christen	b9dfca4b0a	- fixed IndexFederated Servlet / a embedded Solr can now be selected - added code stub for an embedded Solr but generation of Solr store is still commented out (it works but is not yet ready for usage)	13 years ago
Michael Peter Christen	1be0025a9c	- added test for EmbeddedSolrConnector - added needed libraries for this test this includes most (all) files needed for an embedded solr	13 years ago
Michael Peter Christen	dbdd697f4d	moved RDFaParser.xsl configuration file to defaults	13 years ago
Michael Peter Christen	8738336408	set Xms lower than Xmx	13 years ago
Michael Peter Christen	96f6a5869f	more robust OAI-PMH client (large time-out, three re-tries). OAI-PMH server appeart to be very slow sometimes	13 years ago
Michael Peter Christen	6d17686258	made triplestore persistent by default added a size display in triplestore servlet	13 years ago
cominch	3c255c025b	Show tags in search results (if activated in ConfigPortal_p.html)	13 years ago
Michael Peter Christen	a5cdfb91de	- fixed Cache link (below snippet) - added 'Augmented Proxy' link below snippet - added configuration options for augmented proxy	13 years ago
Roland 'Quix0r' Haeder	af5a597e47	Scroogle is not comming back, remove dead code Conflicts: source/net/yacy/search/Switchboard.java	13 years ago
cominch	90512640bf	Added config switches for custom parser Conflicts: source/net/yacy/document/TextParser.java	13 years ago
cominch	5d20cd324a	Add Triplestore and RDF query interface Conflicts: build.xml defaults/yacy.init source/net/yacy/interaction/AugmentHtmlStream.java	13 years ago
cominch	a32943b382	add json mimetype	13 years ago
Michael Peter Christen	41c02cb10e	- less restrictions for usage of Table RAM copy - new limit to use the table copy (instead of flag): 400MB available. If less is available, then a copy is never used. If more is available, then it can be used if there is a remaining space of at least 200MB - flush caches more often: flush the Digest cache	13 years ago
Michael Peter Christen	8002fd2578	use less cache space since a large cache would cause more memory usage in index files.	13 years ago
Michael Peter Christen	5aee19daa4	added show from cache in search results (not yet finished)	13 years ago

1 2 3 4 5 ...

392 Commits (549338957612d869516e33985d157eb890ab42e8)