yacy_search_server

Commit Graph

Author	SHA1	Message	Date
orbiter	f106345eef	link strings should not be tokenized	12 years ago
Michael Peter Christen	a88a62f7aa	added a feature to set a collection for a crawl result based on a regular expression on th url: the collection attribut for a crawl start may be now either a token or a list of tokens, seperated by ',' where a token is either a string or a pair <string,pattern> where the string is separated to the pattern with a ':' and the string is assigned to the document as collection only if the pattern matches with the url.	12 years ago
Michael Peter Christen	765943a4b7	Redesign of crawler identification and robots steering. A non-p2p user in intranets and the internet can now choose to appear as Googlebot. This is an essential necessity to be able to compete in the field of commercial search appliances, since most web pages are these days optimized only for Google and no other search platform any more. All commercial search engine providers have a built-in fake-Google User Agent to be able to get the same search index as Google can do. Without the resistance against obeying to robots.txt in this case, no competition is possible any more. YaCy will always obey the robots.txt when it is used for crawling the web in a peer-to-peer network, but to establish a Search Appliance (like a Google Search Appliance, GSA) it is necessary to be able to behave exactly like a Google crawler. With this change, you will be able to switch the user agent when portal or intranet mode is selected on per-crawl-start basis. Every crawl start can have a different user agent.	12 years ago
Michael Peter Christen	58fe986cca	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
Michael Peter Christen	cf12835f20	replaced the single-text description solr field with a multi-value description_txt text field	12 years ago
orbiter	d05e0c5368	wait a bit longer before doing the first peer ping	12 years ago
orbiter	b8f57f7703	don't be noisy when doing background tasks that may be allowed to fail	12 years ago
Roland Haeder	7263bb82fb	Fix for NPE on shutdown: java.lang.NullPointerException at net.yacy.search.Switchboard.storeDocumentIndex(Switchboard.java:2732) at net.yacy.search.Switchboard.access00(Switchboard.java:207) at net.yacy.search.Switchboard.run(Switchboard.java:3049)	12 years ago
Michael Peter Christen	61e015268b	fix in forced deletion: forced commit needed	12 years ago
orbiter	3e901dcb06	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
orbiter	f50b596e0b	do not run dht ditribution if system load is over 2.5	12 years ago
orbiter	6fb2811e68	fixes for problems with remote solr and non-activated webgraph index	12 years ago
sixcooler	af740f3058	changed optimization to a segment-size of index-size/5.000.000 + one if not idle + one (and force) if postprocessing	12 years ago
orbiter	5364c4dcc9	delayed first peer-ping to send the first ping out after the http got up; if the ping comes before the http is up, it cannot be recognized as senior peer (if at all). See also: http://bugs.yacy.net/view.php?id=266	12 years ago
orbiter	e24016e30a	added the property federated.service.solr.indexing.timeout to yacy.init to provide a configurable time-out for solr; see also: http://bugs.yacy.net/view.php?id=254	12 years ago
orbiter	c124037f19	removed forced non-soft commits to prevent index fragmentation	12 years ago
Michael Peter Christen	c15aa758dc	removed failreason_t removal patch because that causes too much confusion using an external solr. to clean up the index after a schema change, use the index cleaner function from the online servlet	12 years ago
Roland Haeder	841a28ae76	Added 'final' for all exception blocks as this helps the Java compiler to optimize memory usage Conflicts: source/net/yacy/search/Switchboard.java	12 years ago
Michael Peter Christen	89c0aa0e74	added collection_sxt to error documents	12 years ago
orbiter	d0dc86cf3d	logging of deadlocks (if any) during cleanup process	12 years ago
Michael Peter Christen	c6a6f159e8	fix for crawl stack domain counter	12 years ago
Michael Peter Christen	93d1bac140	do a more frequent optimization, reduces IO after optimization	12 years ago
Michael Peter Christen	b79471ee67	grr	12 years ago
Michael Peter Christen	a79f288ac1	automatically running optimize on solr if user/search is idle for some time	12 years ago
orbiter	a9c8046c87	do a light optimization at the end of a crawl postprocessing	12 years ago
Michael Peter Christen	bcc623a843	refactoring of load_delay: this is a matter of client identification	12 years ago
orbiter	0d0b3a30f5	activate api actions after postprocessing of crawls	12 years ago
orbiter	2be456e7fb	added a postprocessing field into api/status_p.xml to show if the postprocessing task is running at that time (status: busy) or not (status:idle)	12 years ago
Michael Peter Christen	5878c1d599	- refactoring of log to ConcurrentLog: jdk-based logger tend to block at java.util.logging.Logger.log(Logger.java:476) in concurrent environments. This makes logging a main performance issue. To overcome this problem, this is a add-on to jdk logging to put log entries on a concurrent message queue and log the messages one by one using a separate process. - FTPClient uses the concurrent logging instead of the log4j logger	12 years ago
Michael Peter Christen	57ffdfad4c	added a crawl option to obey html-meta-robots-noindex. This is on by default.	12 years ago
Michael Peter Christen	f1c5338210	prepartion for greedy crawl profiles and refactoring	12 years ago
Michael Peter Christen	f9d859f5dc	now writing image alt texts and (camelcase-)parsed urls into a text search field for a better image retrieval	12 years ago
Michael Peter Christen	bdf306e0a7	increased time-out for loading of seed-lists	12 years ago
Michael Peter Christen	1762911f57	added synchronizations and timeouts in solr api; missing synchronizations in index modification methods causes deadlocks inside solr.	12 years ago
Michael Peter Christen	6115bef335	added a 'greedy learning' mechanismn which will cause that a 'fresh' yacy will load linked web pages from search results until the total number of web pages reaches 15000. This shall give fresh peers a 'boost' to get faster a personalized search index.	12 years ago
Michael Peter Christen	8e965ffd16	fix for host compare in case that the host is null. This happens when doing a search in the intranet for file resources (they don't have a host).	12 years ago
reger	7480e87386	- fix stopword handling for RWI see example http://bugs.yacy.net/view.php?id=247 - append language setting specific stopword list - remove unused OVERHANG stack type	12 years ago
Michael Peter Christen	5f92c68f1f	removed block rank ranking and all YBR files in /ranking	12 years ago
Michael Peter Christen	0c1a018bbd	removed 'later' tactic because it used too much RAM, reduced number of soft commits, reduced caching size of search events, ensured that solr results are processed before connection is closed to keep that stuff not too long in RAM	12 years ago
Michael Peter Christen	5344a1c5f7	getting the trash out	12 years ago
Michael Peter Christen	8f2d3ce2f9	reduced locking situation in crawler: shifted synchronized location and reduced time-out of robots.txt load limit	12 years ago
Michael Peter Christen	8dbc80da70	redesign of index.exist-test: this shall now not be done using a single id to be tested, but with a collection of ids. This will cause only a single call to solr instead of many. The result is a much better performace when testing the existence of many urls. The effect should cause very much less IO during index transmission, both on sender and receiver side.	12 years ago
Michael Peter Christen	44e363f37f	refactoring of WorkflowProcessor, added process counter, update of process counter if an blocking thread dies. Added also a new column in PerformanceConcurrency_p servlet to show the actual number of concurrent processes.	12 years ago
orbiter	cf36c1614f	prevent that concurrent deletion process causes wrong double-check in crawl start	12 years ago
Michael Peter Christen	b9b446bca6	- added ssl configuration sign (a lock) to network statistic/table - fixed a bug in bitfield	12 years ago
reger	4fc6837690	- fix monitor url of crawl job in PerformanceQueues_p.html - reduce logging of every index add (switch embeddedsolr.add from info to debug)	12 years ago
orbiter	a1c989002b	fix for http://forum.yacy-websuche.de/viewtopic.php?f=5&t=4652 generate dht data even if dht receive and dht transmission is switched off	12 years ago
Michael Peter Christen	e26bdd4a52	fixes to deletion methods (removed unnecessary concurrency and added removal of crawl queue entries)	12 years ago
Michael Peter Christen	f7f3e28c5e	prevent that the size of the index is computed too many times. Because the index size is now provided by solr, and the only way to do that is a match for [* TO *], a size computation is quite complex and time-consuming. Therefore this patch prevents that the method is called at all and if necessary puts a DOS-preventing barrier in front of it.	12 years ago
Michael Peter Christen	cca19d94d4	re-declared some fields to be of type string rather than text which makes them more efficient and less large	12 years ago
reger	46fa800bc7	added httpstatus_i to automatically switched on fields (used in all search queries)	12 years ago
Michael Peter Christen	3a0fcfbeda	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
Michael Peter Christen	25499eead5	- added a new field for the regular expression in crawl start - added the field in crawl profile - adopted logging end error management - adopted duplicate document detection - added a new rule to the indexing process to reject non-matching content - full redesign of the expert crawl start servlet The new filter field can now be seen in /CrawlStartExpert_p.html at Section "Document Filter", subsection item "Filter on Content of Document"	12 years ago
orbiter	e1bfe9d07a	- reduction of the concurrently running processes to make YaCy more adjusted to smaller and 1-core devices. - the workflow processor now starts no process at all. these are started as soon as parser/condenser/indexing queues are filled. - better abstraction	12 years ago
Michael Peter Christen	c091000165	added collection attribute also to the rss feed reader	12 years ago
Michael Peter Christen	97775fbebc	fixed ranking for add-function queries: this did not work. The option was removed. All function queries are now boosts (multiplies the score according to a function). This is also the recommended way to boost rankings based on functions as explained in http://nolanlawson.com/2012/06/02/comparing-boost-methods-in-solr/	12 years ago
Michael Peter Christen	a20941c067	resume paused crawls on startup; user expects that restarts 'heal' everything	12 years ago
orbiter	e4d26d1cb4	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
orbiter	940c6849ee	enhanced did-you-mean (a bit): can now remember previously searched words (plus small enhancements)	12 years ago
reger	d57b221921	add: reset Solr schema filed selection to default button in IndexSchema_p	12 years ago
Michael Peter Christen	b8ed66a55d	added all clickdepth computations for source and target paths in webstructure core	12 years ago
Michael Peter Christen	6300730d7f	refactoring of clickdepth computation as preparation for clickdepth computation of webgraph links	12 years ago
orbiter	47114910d5	fix for possible memory leaks	12 years ago
Michael Peter Christen	addba047e2	changes in ranking computation - an existing ranking servlet for solr was extended. It is now possible to set boost values for fields, boost functions and boost queries. - The ranking can have different instances, but currently only the first one is used - added an abstraction layer for fields which can be used for search and those fields can be edited in the solr ranking configruation - the ranking value from solr within the field score is used to combine remote search requests, which all are created using the same locally defined boost values - reduced the number of fields which are used for search (makes it faster) - replaced some text fields by string fields (makes indexing faster) - removed classes which had no use - made a large number of experiments for a better ranking and created a temporary setting which prefers hits inside titles - adjusted also the RWI-based ranking computation to 'prefer title' - made special cases like for portal search where no post-processing and post-ranking is wanted: this keeps the original ranking order as done by Solr - fixed many bugs with old settings for ranking	12 years ago
orbiter	cd197bb555	fix for NPE if surrogates do not exist	12 years ago
Michael Peter Christen	3b1d9dc884	made index storage from DHT search result concurrently. This prevents blocking by high CPU usage during search. Also: removed query from Solr for DHT search results; results are taken from the pending queue.	12 years ago
orbiter	0f7ea7ad9f	- enhanced solr.add procedure for mass adds - removed unused solr access classes - made snippet generation for documents aus YaCy RWI/DHT concurrent (as it was before the search process removation) - reduced the number of remote results in settings file because the processing of such mass documents add is too CPU-intensive (in Solr)	12 years ago
Michael Peter Christen	840fa22135	disabled clickdepth computation during craling since that is repeated during clean-up phase.	12 years ago
Michael Peter Christen	d957739441	removed size request	12 years ago
Michael Peter Christen	c95a84103a	complete redesign of search process: - removed 'worker' processes - no internal time-out behaviour: methods either are successful or return null - waiting is only done on top-level - removed snippet-production; this is replaced by solr snippets - removed statistics based on solr size queries (they had been VERY long); the statistics (like suggestions or tag cloud) are now again based on the old but very fast RWI index. In portal or intranet mode the RWI index is usually switched off; if you like to have statistics again then you must switch on the rwis again in this mode. - fixed many bugs regarding correct page counter	12 years ago
Michael Peter Christen	089dee1770	- generalized SchemaConfiguration into super-class Configuration and adopted other classes which used the configuration-only access for that class - removed many warnings - adjusted logging	12 years ago
Michael Peter Christen	56d5946a59	- added flags in IndexFederated_p.html to switch on or off the webgraph index (new solr core webgraph) .. this is now off by default - completely redesigned this servlet - added description how to attach a remote solr - adjusted naming of servlet and menues - moved 'lazy initialization' attribut from IndexSchema to IndexFederated (this is a general option) back again.	12 years ago
Michael Peter Christen	788288eb9e	added the generation of 50 (!!) new solr field in the core 'webgraph'. The default schema uses only some of them and the resting search index has now the following properties: - webgraph size will have about 40 times as much entries as default index - the complete index size will increase and may be about the double size of current amount As testing showed, not much indexing performance is lost. The default index will be smaller (moved fields out of it); thus searching can be faster. The new index will cause that some old parts in YaCy can be removed, i.e. specialized webgraph data and the noload crawler. The new index will make it possible to: - search within link texts of linked but not indexed documents (about 20 times of document index in size!!) - get a very detailed link graph - enhance ranking using a complete link graph To get the full access to the new index, the API to solr has now two access points: one with attribute core=collection1 for the default search index and core=webgraph to the new webgraph search index. This is also avaiable for p2p operation but client access is not yet implemented.	12 years ago
Michael Peter Christen	91a0401d59	introduced a second core named 'webgraph'. This core will hold the link structure, but is not filled yet. To have the opportunity of a second core, multi-core functionality had to be implemented to the deep-embedded solr: - migrated the solr_40 directory content to a subdirectory 'collection1'; the previously used default core is now called collection1 - added solr_40/webgraph subdirectory as second core - added a servlet configuration for the second core 'webgraph' in /IndexSchema_p.html - added instance handling as addition to solr connections: all solr connectors are now instances of an solr 'instance' object; this required a complete re-design of the solr embedding - migrated also caching and sharding ontop of new instance handling - migrated the search apis to handle now the access to a specific core, the default core named 'collection1' - migrated the remote solr search interface to access shards of cores; for the yacy remote search the default core is now called 'solr'; using the peer address as solr address - migrated the solr backup and restore process: old backups cannot be used after this migration! - redesign of solr instance handling in all methods which access the instances: they cannot hold copies of these instances any more; the must retrieve the actuall connection object every time they want to write to it (this solves also some bugs when switching the index/network) - added another schema 'solr.webgraph.schema', the old solr.keys.list is replaced by solr.collection.schema	12 years ago
Michael Peter Christen	33bc255e85	prevent that crawl starts with very large url lists cause a time-out in the user front-end	12 years ago
Michael Peter Christen	b6de1f42dc	Full redesign of solr connection architecture. This was done to support multiple solr cores instead of just one. Therefore it is now necessary to distuingish between solr server connections (called an 'Instance') and a connection to a single solr core. One Instance may now have multiple connector classes assigned to it, each connecting to a single core. To support multiple cores it is also necessary to distinguish between the connection configuration and the configuration of the index schema. We will have multiple schema configurations in the future, each for every solr core. This caused that the IndexFederated servlet had to be split into two parts, the new Servlet for the Schema editor is now in the IndexSchema Servlet.	12 years ago
Michael Peter Christen	4111606654	removed the commitWithin attribute because that is not the way how the index is updated the right way for us. May also be be superfluous with the solr 4.0 softcommit.	12 years ago
Michael Peter Christen	de58043205	Added image license generation for solr image search results when results are generated within yjson result writer. This makes it possible to view images in yacyinteractive from solr.	12 years ago
Michael Peter Christen	6f6ddaf7e7	A robinson peer does not need to write RWI data if such peers are only searched using the solr interface. Searching public rpbinsons will be done with solr only in the future.	12 years ago
Michael Peter Christen	7806680ab8	fixed a problem with re-feeding of already indexed documents whith coordinates attached.	12 years ago
Michael Peter Christen	eb80405a16	added a disable function in RemoteCrawl_p servlet which prevents setting of remote crawl if peer is not a senior or principal peer	12 years ago
Michael Peter Christen	4735bd47f4	- changed solr commit call and added an optimize option. Since Solr 4.0.0 there is a new softcommit feature which implements a near-real-time (NRT) search option. The softcommit does not do IO and does not cause performance issues. YaCy has now an extension in its solr connectors to use the softcommit feature. The softcommit call now replaces all places where a hard commit was used. Furthermore the commit strategy in when doing a search from the web interface was changed (it's done every time before a search is done). The softcommit feature was implemented because it was needed for the following changes (customer demands), which is also included in this git commit: - added a feature to identify all documents which have unique titles and/or unique descriptions. These unique flags are disabled by default. - added also a feature to set a flag when the url from a canonical tag is equal to the document url. This is also disabled by default. To support the new softcommit strategy, the commitWithinMs option was set to -1 do disable automatic commit based on document insert times. If documents are inserted permanently then also a commit would happen permanently whenever the commitWithinMs time is reached. This would conflict with the regular autocommit of 10 minutes and the new softcommit strategy.	12 years ago
Michael Peter Christen	becd52a984	added also a re-calculation of reference counts during the post-processing of clickcount calculations. This is a really nice thing to have because the reference count affects ranking.	12 years ago
Michael Peter Christen	6f0baaa309	added the clickdepth post-processing: some links may have 'shortcuts' to already calculated click depths. There are then calculated if the crawl buffer is empty and therefore no new 'shortcuts' can be discovered. The status of the clickdepth stack (to-be-processed) can be seen using a solr search command like this: http://localhost:8090/solr/select?q=process_sxt:[%20TO%20]&start=0&rows=30&fl=sku,clickdepth_i,process_sxt	12 years ago
reger	0148f1bb8c	fix: exception if default work files don't exist	12 years ago
Michael Peter Christen	9e4033f229	fix for event starter: delete start time when event is removed	12 years ago
Michael Peter Christen	99271ffd13	copy work tables from defaults/data/work if exist there and not in DATA/WORK This can be used to create start-up behavior work scripts in the api.bheap table	12 years ago
Michael Peter Christen	24c9bb35f7	extended the Scheduler: introduced scheduled events - an event type (once, regular) can be selected - for this event type, a fixed time can be selected. This may be either directly after startup or at one of the full hours at a day (==25 options) The main point about this feature is the opportunity to start an action directly after startup. That makes it possible to create YaCy distributions which, after started at the first time, start to index parts of the intranet/internet by itself.	12 years ago
Michael Peter Christen	cb5cbec14d	distinguishing modified query string and original query string	12 years ago
orbiter	1f33c30d7b	re-integrating useForHost method (lost sometime?) to get the noProxy pattern working again. Without using this method all remote urls including the localhost had been accessed through the configured proxy	12 years ago
Michael Peter Christen	10527e28ae	fix for wrong display of error urls in HostBrowser	12 years ago
Michael Peter Christen	8aa08261a7	update to Solr Boost handling	12 years ago
Michael Peter Christen	72f165d58b	added a Boost class which stores solr query boost values. The class can be configured using the yacy.init file. The boost information is taken from the configuration each time when a query to solr is done.	12 years ago
Michael Peter Christen	4eab3aae60	removed overhead by preventing generation of full search results when only the url is requested	12 years ago
Michael Peter Christen	d6b82840f8	added a feature to find similarities in documents. This uses an enhanced version of the Nutch/Solr TextProfileSignatue. As a result, a signature of the document is written to the solr search index. Additionally for each time when a signature is written, it is checked if the singature exists already in the index. If the signature does not exist, the document is marked as unique. The unique attribute can now be used to sort document lists and bring duplicates to the end of a result list. To enable this, a large portion of the search api to Solr had to be changed. This affected mainly caching of 'exists' searches to enhance the check for existing signatures and do this without actually doing a solr query. Because here the first time a long number is used as value in the Solr store, also the value naming in the YaCySchema had to be adopted and normalized. This caused that many files had to be changed.	12 years ago
Michael Peter Christen	f5ca5cea44	- added field options to all solr queries. This can be used to restrict the actual data which is fetched from solr. - used the new field options to reduce generic options like getting the load date or the count of search results. should increase overall speed - used the new field options to reduce overhead in the host browser during aquisition of links. - used the field options to make checking of links in crawler faster - if the crawler is paused, the crawl queue is not cleaned	12 years ago
cominch	2bb8f045cc	content control: use up-to-date definitions	12 years ago
cominch	d2a94cc55e	refactor package	12 years ago
cominch	21df1ad9e0	update and generalization of the SMW import and content control routines	12 years ago
Michael Peter Christen	71ed8e5e07	bugfixes for crawler	12 years ago
Michael Peter Christen	158732af37	automatically delete entries from the crawl profile list if crawl is terminated.	12 years ago
Michael Peter Christen	15d1460b40	added information about the reason of pausing of crawls	12 years ago
Michael Peter Christen	791e1dcfdf	when a new crawl is started, delete all entries about error-urls for crawl-start domains	12 years ago
Michael Peter Christen	8fb370d9f8	renovated the way how search results are count. should be correct now...	12 years ago
Michael Peter Christen	6629e37685	tried to clean up the search process mess	12 years ago
Michael Peter Christen	f8f05ecba7	- added a delete button in host browser to delete a complete subpath - removed storage of default collection name - default is now "user" - made stacking of crawl start points concurrently	12 years ago
Michael Peter Christen	4a14122ba7	in case that a crawl profile has a collection assigned, use the collection to show a name in the web interface. This should prevent that much too long names make the interface unusable.	12 years ago
Michael Peter Christen	0833937c1c	better balancing and duetime-cumputation also for no-delay intranet hosts	12 years ago
Michael Peter Christen	c25d7bcb80	- added concurrency for robots.txt loading - changed data model for domain counter	12 years ago
Michael Peter Christen	2d9e577ad0	replaced the custom robots.txt loader by the standard http loader	12 years ago
Michael Peter Christen	799d71bc67	enhanced solr caching: - increased cache size which is needed for longer solr commit time - speed hacks on cache write code	12 years ago
Michael Peter Christen	a33e2742cb	- removed unnecessary synchronized and deadlock in crawler - removed problem with monitoring object on Balancer.wait - added missing user agent settings	12 years ago
sixcooler	47ae7e322e	smaller dhtDispatcher.cloudSize @Orbiter: we talked about this times ago - please revert if I'm wrong	12 years ago
Michael Peter Christen	ccc3760a47	Refactoring and redesign of data architecture to make URIMetadataRow superfluous. The target is to make a solr document as the core of YaCy documents which would cause that many conversions can be removed. On the way to this target the Equivalence of URIMetadataRow and URIMetadataNode had to be removed to expose the usage of the old URIMetadataRow data structure. This refactoring already removes unneccessary conversions and should make memory usage during indexing lower.	12 years ago
Michael Peter Christen	e5b3c172ff	removed hack which translated Solr documents to virtual RWI entries which had been then mixed with remote RWIs. Now these Solr documents are feeded into the result set as they appear during local and remote search. That makes the search much faster.	12 years ago
Michael Peter Christen	43f3345c90	- removed dependencies from URIMetadataRow and made direct access to URIMetadataNode which creates the opportunity to access Solr objects directly and use their information richness - lazy initialization of the URIMetadataNode object - should cause less computation and memory usage during search. - removed dead code	12 years ago
Michael Peter Christen	21fe8339b4	- enhanced generation of url objects - enhanced computation of link structure graphics - enhanced collection of data for link structures	12 years ago
Michael Peter Christen	5f0ab25382	removed the option to prevent removal of & parts inside of the MultiProtocolURI during normalform computation because that should always be done and also be done during initialization of the MultiProtocolURI Object. The new normalform method takes only one argument which should be 'true' unless you know exactly what you are doing.	12 years ago
Michael Peter Christen	bd769de604	since the solr index is now used for all pages that are indexed locally, there is no need for the RWI index if the index is not transfered to another peer. Therefore the creation of RWI index data is now suppressed if DHT is disabled. This applies for all intranet and portal mode configurations, but not for public robinson modes. A robinson may switch back to public mode and then transmit its data. That means if someone wants to switch never to DHT mode, it would be more appropriate to choose the portal mode.	12 years ago
Michael Peter Christen	4b5e0c1500	added an url rewriter which can be used to remove session ids from urls	12 years ago
Michael Peter Christen	76d218fbef	fixes to crawl profiles	12 years ago
sof	5cb244b79b	Merge remote branch 'origin/master'	13 years ago
apfelmaennchen	88b062210c	Added a parser for audio file tags (e.g. ID3 tags for MP3 files) based on the jaudiotagger library. The parser is disabled by default as it needs to store temporary files for non file:// protocols, which might be disliked. For your local MP3-collection it loads nicely Artist, Title, Album etc. from the audio files meta data.	13 years ago
orbiter	3190347814	added a synonyms_t field to solr and a process to read synonym files. This can be used to add another stemming to solr using stemming files that are expressed as synonyms for grammatical alternatives. The synonym/stemming files must have the following form: - each line is a comma-separated list of synonyms - the list of synonyms may be enclosed with {} (like the GSA synonyms file) - the file may contain comments which are lines starting with a '#' The synonym file(s) must be placed in DATA/DICTIONARIES/synonyms/ and are activated by default whenever a synonym file is in place. Then, for each word that is found in a document all synonyms are added to a long text field which is stored into synonyms_t. Processes using the synonyms must query with that field as optional matcher.	13 years ago
Michael Peter Christen	f45f7fc12e	added new Host Browser to main menu: this new search interface is something completely new for search, but completely common on desktops: browser a web space like one would browse a file system in a file browser. The file listing is created using the search index and a faceted restriction to specific domains.	13 years ago
Michael Peter Christen	8556a3d521	extended solr connector with a method to retrieve a single facet.	13 years ago
Michael Peter Christen	23f68f2a69	force usage of default faceting mechanisms for search	13 years ago
Michael Peter Christen	a4214694df	We assert that no other metadata storage than solr is used now. Therefore a property like solrConnected() must be true all the time. Removal of this method causes removal of all write operations to the old metadata index.	13 years ago
Michael Peter Christen	1533bfd63b	refactoring	13 years ago
Michael Peter Christen	872f83ebe0	refactoring	13 years ago
Michael Peter Christen	8219a445f3	refactoring	13 years ago
Michael Peter Christen	00c1c777fa	refactoring	13 years ago
orbiter	63762d8f89	removed kelondro dependencies from cora	13 years ago
orbiter	60b1e23f05	added new crawl options: - indexUrlMustMatch and indexUrlMustNotMatch which can be used to select loaded pages for indexing. Default patterns are in such a way that all loaded pages are also indexed (as before) but when doing an expert crawl start, then the user may select only specific urls to be indexed. - crawlerNoDepthLimitMatch is a new pattern that can be used to remove the crawl depth limitation. This filter a never-match by default (which causes that the depth is used) but the user can select paths which will be loaded completely even if a crawl depth is reached.	13 years ago
Michael Peter Christen	6ec02deec6	added new crawl attributes in crawl profile (not active yet)	13 years ago
orbiter	a55e77a115	added twitter search heuristic	13 years ago
Michael Peter Christen	b2b516cc3e	added a collection attribute to crawls and searches: - a solr field collection_sxt can be used to store a set of crawl tags - when this field is activated, a crawl tag can be assigned when crawls are started - the content of the collection field can be comma-separated, all of them are assigned to the documents when they are indexed as result of such a crawl start - a search result can be drilled down to a specific collection; this is currently only available in the solr interface and also in the gsa interface using the 'site' option - this adds a mandatory field for gsa queries (the google api demands that field all the time)	13 years ago
Michael Peter Christen	31d4d38804	- extended the solr interface by a references-by-word-count method - reduced danger that a non-existing RWI database causes NPEs - added Solr queries to did-you-mean: this makes it possible that our did-you-mean algorithm works together with only Solr and without RWIs	13 years ago
cominch	dc468dad01	add content control features for custom filter lists	13 years ago
reger	65d49df865	security fix: clear automtic password only if adminAccountForLocalhost=false to prevent remote access to protected pages after restart. if adminAccountForLocalhost=true leave automatic password unchanged so access from local host is granted but remote access is preventet from the 1st second.	13 years ago
Michael Peter Christen	48a82bc705	log queries anonymous from gsa+solr requests	13 years ago
Michael Peter Christen	0cab06c47c	refactoring	13 years ago
Michael Peter Christen	06a78eecb7	code simplification	13 years ago
Michael Peter Christen	18f989dfb1	- refactoring (load -> getMetadata) - added getDocument to retrieve Solr documents which shall replace getMetadata	13 years ago
Michael Peter Christen	23226676c6	FOR THE BRAVE.. this is a forced migration to solr which is now ready for production as a replacement of the metadata-db. This intermediate release 1.041 will switch on the previously optional solr index and the old metadata-db will still work as it did before. Solr+metadata are accessed in mixed mode, no migration is done yet. If this causes not a catastrophe until the end of the weekend, we will do a YaCy 1.1 main release containing this as default.	13 years ago
Michael Peter Christen	b51df6c7e8	- added coordinate storage in solr schema - fixed shutdown process - fixed some solr-to-metadata reading - added a large number of metadata attributes in ViewFile.html	13 years ago
orbiter	39f8eb60c3	tried to prevent calls to bad-hack getSize() method and reduced overhead of that method a bit.	13 years ago
orbiter	67edfd991c	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	13 years ago
orbiter	d9173ba7ed	added more solr fields to integrate values from URIMetadataRow. All writings to the Metadata-DB are now also done to solr. This includes metadata transfer during search and rwi transfer. The new/added solr fields are: ## time when resource was loaded load_date_dt ## date until resource shall be considered as fresh fresh_date_dt ## id of the host, a 6-byte hash that is part of the document id host_id_s ## ids of referrer to this document referrer_id_ss ## the md5 of the raw source md5_s ## the name of the publisher of the document publisher_t ## the language used in the document; starts with primary language language_ss ## an external ranking value ranking_i ## the size of the raw source size_i ## number of links to audio resources audiolinkscount_i ## number of links to video resources videolinkscount_i ## number of links to application resources applinkscount_i	13 years ago
Michael Peter Christen	24d9db1613	snippet retrieval loading processes may use a smaller minimum load time value than crawling processes. This speeds up the search result preparation dramatically.	13 years ago

1 2 3 4 5 ...

344 Commits (0d29b972ccee9f5bffbb72dd3b0954ac57958443)