yacy_search_server

Commit Graph

Author	SHA1	Message	Date
orbiter	056b42f5aa	- added information about segment count to status_p.xml - also moved this information from the old index structure, which is still in use for the RWI/DHT index to that front-end	11 years ago
orbiter	6fb2811e68	fixes for problems with remote solr and non-activated webgraph index	11 years ago
sixcooler	af740f3058	changed optimization to a segment-size of index-size/5.000.000 + one if not idle + one (and force) if postprocessing	11 years ago
Michael Peter Christen	336f86394c	replaced StringBuffer with StringBuilder	11 years ago
Michael Peter Christen	aeac2fb763	replaced more containsKey() -> get() usages by a simple get(), followed by a test for NULL. This should increase the application speed and reduces the lookup time for the affected methods by 50%	11 years ago
orbiter	5364c4dcc9	delayed first peer-ping to send the first ping out after the http got up; if the ping comes before the http is up, it cannot be recognized as senior peer (if at all). See also: http://bugs.yacy.net/view.php?id=266	11 years ago
orbiter	e24016e30a	added the property federated.service.solr.indexing.timeout to yacy.init to provide a configurable time-out for solr; see also: http://bugs.yacy.net/view.php?id=254	11 years ago
orbiter	c124037f19	removed forced non-soft commits to prevent index fragmentation	11 years ago
Michael Peter Christen	31483c47e1	fixed problem with remote luke requests	11 years ago
Michael Peter Christen	c15aa758dc	removed failreason_t removal patch because that causes too much confusion using an external solr. to clean up the index after a schema change, use the index cleaner function from the online servlet	11 years ago
reger	2b7a38640a	extend content type detection on file extension for .tif .tiff .htm	11 years ago
Michael Peter Christen	ac1aad5064	added a getSegmentCount method and use it to disable optimize if wanted current segment count is below optimization level	11 years ago
Michael Peter Christen	36035e0a0a	- used reger's LukeRequest to generalize the index info in SolrServerConnector - used the LukeRequest in SolrServerConnector to replace the index size method by a getNumDocs request to a LukeRequest result	11 years ago
Michael Peter Christen	39fceb5ccf	fix for NPE & bug #264	11 years ago
Michael Peter Christen	735a66eff3	enhancements to crawler	11 years ago
Roland Haeder	be0ff6018f	Removed trailing spaces + some more final	11 years ago
Roland Haeder	aaedc0405d	Fixes and avoid of catching bad exceptions (some): - Rewrote usage of HashMap/Map to concurrent versions (to avoid a CME=ConcurrentModificationException) - Rewrote ConnectionInfo (as an example) to use a synchronized iterator instead of synchronizing an already synced HashSet (see Collections call) - This avoids catching CMEs again - Commented out noisy ConcurrentLog.logException() call Conflicts: source/net/yacy/repository/LoaderDispatcher.java	11 years ago
Roland Haeder	841a28ae76	Added 'final' for all exception blocks as this helps the Java compiler to optimize memory usage Conflicts: source/net/yacy/search/Switchboard.java	11 years ago
Felix Ableitner	03044589dd	Fixed (?i) appearing in entries, fixed multiple equal lines in file.	11 years ago
Michael Peter Christen	89c0aa0e74	added collection_sxt to error documents	11 years ago
Michael Peter Christen	0df5195cb0	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	1fd006cc56	fixes using the embedded connector	11 years ago
orbiter	d0dc86cf3d	logging of deadlocks (if any) during cleanup process	11 years ago
Michael Peter Christen	c6a6f159e8	fix for crawl stack domain counter	12 years ago
Michael Peter Christen	93d1bac140	do a more frequent optimization, reduces IO after optimization	12 years ago
orbiter	b71d13a014	added load and deadlock detector in Memory util	12 years ago
orbiter	290e24564b	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
orbiter	5533fc8e01	fix for bug 260	12 years ago
Michael Peter Christen	b79471ee67	grr	12 years ago
Michael Peter Christen	a79f288ac1	automatically running optimize on solr if user/search is idle for some time	12 years ago
orbiter	a9c8046c87	do a light optimization at the end of a crawl postprocessing	12 years ago
orbiter	a548354c71	replaced type of solr schema object sku of text_en_splitting_tight by string	12 years ago
orbiter	2f1ec8d4a2	npe fix	12 years ago
Michael Peter Christen	bcc623a843	refactoring of load_delay: this is a matter of client identification	12 years ago
orbiter	0d0b3a30f5	activate api actions after postprocessing of crawls	12 years ago
orbiter	3978c5ca5d	fix for http://bugs.yacy.net/view.php?id=255	12 years ago
orbiter	2be456e7fb	added a postprocessing field into api/status_p.xml to show if the postprocessing task is running at that time (status: busy) or not (status:idle)	12 years ago
orbiter	dac88561ae	minimum access time has a tight connection to ClientIdentification, therefore it is defined there.	12 years ago
Michael Peter Christen	9a29ab469e	another patch to prevent CLOSE_WAIT status on solr connections	12 years ago
Michael Peter Christen	5091d627bc	fixed parsing of peer flags	12 years ago
Michael Peter Christen	87e9052081	added Connection:close to all http requests in our http client to prevent CLOSE_WAIT states (as seen in lsof)	12 years ago
Michael Peter Christen	5c6946dd5f	replaced usage of log4j by ConcurrentLog where possible	12 years ago
Michael Peter Christen	5878c1d599	- refactoring of log to ConcurrentLog: jdk-based logger tend to block at java.util.logging.Logger.log(Logger.java:476) in concurrent environments. This makes logging a main performance issue. To overcome this problem, this is a add-on to jdk logging to put log entries on a concurrent message queue and log the messages one by one using a separate process. - FTPClient uses the concurrent logging instead of the log4j logger	12 years ago
orbiter	f4f6551c66	better handling of time-out at solrj in case that a commit is done in a fail-over case during add	12 years ago
Michael Peter Christen	07261fe274	Merge remote-tracking branch 'nutomics/blacklist_structure'	12 years ago
Michael Peter Christen	dea71851d2	- better concurrency for network scanner - network scanner can now start from the list of all hosts in the search index	12 years ago
Michael Peter Christen	a34e137e27	fix for citation index generation in case that entry.referrerhash() is null. This is especially the case if ftp sites are crawled	12 years ago
Michael Peter Christen	a2c8116a8f	accept (but ignore) a '+' sign in front of search words	12 years ago
orbiter	9f0cc9b401	enhanced network scanner - textarea input field can now be used to paste in a large list of hosts - /31er subnet is possible (only one host) - auto-detect subdomains for ftp and www subdomains	12 years ago
sixcooler	308d73f855	do not use remote proxy if not switched on - regardless of the proto	12 years ago
sixcooler	69906b1d2e	Revert "do not use remote proxy if not switched on - regardless of the proto" This reverts commit `20f452d228`.	12 years ago
sixcooler	20f452d228	do not use remote proxy if not switched on - regardless of the proto	12 years ago
sixcooler	9551720d5c	re-enable saved setting for proxy-crawl-profile	12 years ago
sixcooler	d5d8936f9d	For indexes that are changing rapidly in NRT situations, fcs (stands for Field Cache per Segment) may be a better choice than the default fc. (saves memory) see: http://wiki.apache.org/solr/SimpleFacetParameters#facet.method	12 years ago
Felix Ableitner	44f8fcf62e	Changed class structure of Blacklist.	12 years ago
Michael Peter Christen	57ffdfad4c	added a crawl option to obey html-meta-robots-noindex. This is on by default.	12 years ago
Michael Peter Christen	5a5d411ec0	new robots_i attribute fields	12 years ago
Michael Peter Christen	fa08bd9d5a	hack to prevent long waiting times in crawler	12 years ago
Michael Peter Christen	f1c5338210	prepartion for greedy crawl profiles and refactoring	12 years ago
Michael Peter Christen	e6f361f474	adding the canonical tag to crawl queues	12 years ago
reger	a6bf44212e	bugfix: location (lat/lon) meta data retrival (Double.NaN check)	12 years ago
Michael Peter Christen	203921006a	redesign of citation index storage	12 years ago
reger	83763ee4a4	jpeg parser: extract GPS location from meta data	12 years ago
Michael Peter Christen	32aa1d4569	removed unused option for queries	12 years ago
Michael Peter Christen	9d291764d1	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
sixcooler	e5abccdfe4	added optimize-option	12 years ago
Michael Peter Christen	64140f35cd	fix for solr requests if no query part is given (prevent npe)	12 years ago
Michael Peter Christen	8caaf6203a	fixed false multiple-generation of remote facet search which caused high cpu usage on remote side.	12 years ago
Michael Peter Christen	823ae4d6a7	added url_protocol_s to error documents	12 years ago
Michael Peter Christen	660a196989	refactoring	12 years ago
Michael Peter Christen	c4538d8d91	added metadata-extractor-2.6.2.jar to eclipse classpath, removed old lib	12 years ago
reger	3760e2616b	bump up lib/metadata-extractor-2.6.2.jar (used for image parser) with needed code adjustments	12 years ago
Michael Peter Christen	9a6fcdf597	npe fix	12 years ago
Michael Peter Christen	16d1d744fa	added url_file_name_s in default collection schema for the file name without the file extension. This part of the file path is removed from the multi-field url_paths_sxt, which has now not the file name as last part of the path list. The same applies to the new fields source_file_name_s and target_file_name_s in the webgraph schema.	12 years ago
reger	8d1c4c423d	make imageparser fileextension detection case insensitive (extensions are often upper case)	12 years ago
Michael Peter Christen	f9d859f5dc	now writing image alt texts and (camelcase-)parsed urls into a text search field for a better image retrieval	12 years ago
Michael Peter Christen	e441a9d4c8	to avoid confusion, the gsa api is available at /search? and /searchresult?	12 years ago
orbiter	8792e6c6e9	stub for better image indexing	12 years ago
orbiter	97f2ac9091	added hint to gsa response writer that the result comes from a yacy peer	12 years ago
Michael Peter Christen	14186e815e	npe fix	12 years ago
Michael Peter Christen	bdf306e0a7	increased time-out for loading of seed-lists	12 years ago
Michael Peter Christen	374d2e2a52	removed warning message during crawling	12 years ago
Michael Peter Christen	570511f3c8	removed fields references_internal_id_sxt and references_internal_url_sxt because they had been shown to be superfluous. The citation of referrer in the host browser is possible without them. Therefore now the host browser does not only show internal, but also external referrer to each link.	12 years ago
Michael Peter Christen	fd1776a3b0	added a new 'Citations' function: each search result item can now be explored for citations within other documents. A click on the 'Citations' link shows an analysis with all text lines in the document each with a complete list of documents which contain the same line. A second section shows the linking documents in ascending order of number of citations from the original document. Because documents from different hosts are most interesting here, they are listed at the top of the page as possible 'copypasta' source.	12 years ago
Michael Peter Christen	fc3ff92c69	npe fix	12 years ago
Michael Peter Christen	1762911f57	added synchronizations and timeouts in solr api; missing synchronizations in index modification methods causes deadlocks inside solr.	12 years ago
Michael Peter Christen	3e1e358fdc	calling pdf cache flush on class initialization because calling of the methods during runtime can conflict with dynamic solr class loader and cause a deadlock (seriously!)	12 years ago
Michael Peter Christen	291912ee52	removed misleading http accessGranted message (this is only for debugging)	12 years ago
Michael Peter Christen	2fd7bbb450	reduced load on solr; no seed update in Status and no exists-check in HTTPLoader in case of redirects, that can be done using the htcache.	12 years ago
Michael Peter Christen	2648b42b27	added fixed clear method as public method	12 years ago
Michael Peter Christen	ffc570f95f	removed forced soft commit since this may be the cause for a performance problem	12 years ago
Michael Peter Christen	6115bef335	added a 'greedy learning' mechanismn which will cause that a 'fresh' yacy will load linked web pages from search results until the total number of web pages reaches 15000. This shall give fresh peers a 'boost' to get faster a personalized search index.	12 years ago
Michael Peter Christen	f24574b3da	use s greeting line which does not sound so beta	12 years ago
Michael Peter Christen	b85db72a73	added another response writer which can present search result with texts, separated by sentences. Then, these sentences can be used to search again in the index for the same sentence. This can be used to provide a tool for plagiarism-search. (not finished yet). Try the following: http://localhost:8090/solr/select?q=text_t:flut&grep=wasser&defType=edismax&start=0&rows=3&core=collection1&wt=grephtml .. to search for 'flut' and show only sentences in the result documents which contain the word 'wasser'. Consider this like using a grep-tool on documents: you select the documents by a search query and you grep sentences inside the found documents with the 'grep' attribute.	12 years ago
Michael Peter Christen	8e965ffd16	fix for host compare in case that the host is null. This happens when doing a search in the intranet for file resources (they don't have a host).	12 years ago
orbiter	2b320313d9	replaced yacydoc servlet usage by a solr result output using an html output writer. This made the creation of a html result writer necessary which is included in this commit. The yacydoc servlet was used to present all metadata to a document, but the solr interface can serve for this purpose in a much better way. All usages (instead one) of yacydoc were replaced by a solr call. This affects also the 'metadata' link attached to search results.	12 years ago
Michael Peter Christen	f7a4377812	usage of the new normalized link polularity CRn as default ranking function. This replaces the previous formula, which was bad. Before you update to this version, please check if you changed the ranking function yourself before, since it will be overwritten.	12 years ago
Michael Peter Christen	f7e77a21bf	Added a citation reference computation for intra-domain link structures. While the values for the reference evaluation are computed, also a backlink-structure can be discovered and written to the index as well. The host browser has been extended to show such backlinks to each presented links. The host browser therefore can now show an information where an document is linked. The new citation reference is computed as likelyhood for a random click path with recursive usage of previously computed likelyhood. This process is repeated until the likelyhood converges to a specific number. This number is then normalized to a ranking value CRn, 0<=CRn<=1. The value CRn can therefore be used to rank popularity within intra-domain link structures.	12 years ago
Michael Peter Christen	e20450e798	patch in HTCache and CitationIndex loading in case that a file is broken: do not crash; instead ignore the file and delete it.	12 years ago
reger	d367b1f4d9	add null pointer check to stopword fix	12 years ago

1 2 3 4 5 ...

6470 Commits (deadeb406eed1425d0afdf3aef794eec13e92756)