yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Michael Peter Christen	765943a4b7	Redesign of crawler identification and robots steering. A non-p2p user in intranets and the internet can now choose to appear as Googlebot. This is an essential necessity to be able to compete in the field of commercial search appliances, since most web pages are these days optimized only for Google and no other search platform any more. All commercial search engine providers have a built-in fake-Google User Agent to be able to get the same search index as Google can do. Without the resistance against obeying to robots.txt in this case, no competition is possible any more. YaCy will always obey the robots.txt when it is used for crawling the web in a peer-to-peer network, but to establish a Search Appliance (like a Google Search Appliance, GSA) it is necessary to be able to behave exactly like a Google crawler. With this change, you will be able to switch the user agent when portal or intranet mode is selected on per-crawl-start basis. Every crawl start can have a different user agent.	11 years ago
Michael Peter Christen	47b1c81d08	- refactoring - generalized writing of url attributes to solr documents - added more url attributes to error documents	11 years ago
Roland Haeder	841a28ae76	Added 'final' for all exception blocks as this helps the Java compiler to optimize memory usage Conflicts: source/net/yacy/search/Switchboard.java	11 years ago
Michael Peter Christen	5878c1d599	- refactoring of log to ConcurrentLog: jdk-based logger tend to block at java.util.logging.Logger.log(Logger.java:476) in concurrent environments. This makes logging a main performance issue. To overcome this problem, this is a add-on to jdk logging to put log entries on a concurrent message queue and log the messages one by one using a separate process. - FTPClient uses the concurrent logging instead of the log4j logger	12 years ago
Marc Nause	8fb1b1e290	*) simplified banner creation code	12 years ago
Michael Peter Christen	8f2d3ce2f9	reduced locking situation in crawler: shifted synchronized location and reduced time-out of robots.txt load limit	12 years ago
Michael Peter Christen	ed1d5bace6	draw the names of other peers which receive/send dht into the network graphic	12 years ago
Michael Peter Christen	b528448332	enlarge network graph circle according to image height and reduce the image height in the Network servlet. Overall, the image is now larger but takes less space on the web page.	12 years ago
Michael Peter Christen	bb4bf3d8fd	infinity timeout bug protection patch	12 years ago
Michael Peter Christen	fc2095ac67	some extensions to raster plotter to transform a RGB picture to an indexed color scheme. This is needed for gif animations	12 years ago
Michael Peter Christen	c1a2175fbc	added transparency to gif image animation and the integration to the YaCy httpd for on-the-fly generated gifs (including animated gifs)	12 years ago
Michael Peter Christen	2d36a7eaf5	- do not create a new query for all remote peers - no document search this time - adjusted banner and network to not show 'WORDS' but DHT Chunks. This is to avoid confusion for robinson peers which do not create Word Entries	12 years ago
orbiter	d74472f562	corrected result counter	12 years ago
Michael Peter Christen	788288eb9e	added the generation of 50 (!!) new solr field in the core 'webgraph'. The default schema uses only some of them and the resting search index has now the following properties: - webgraph size will have about 40 times as much entries as default index - the complete index size will increase and may be about the double size of current amount As testing showed, not much indexing performance is lost. The default index will be smaller (moved fields out of it); thus searching can be faster. The new index will cause that some old parts in YaCy can be removed, i.e. specialized webgraph data and the noload crawler. The new index will make it possible to: - search within link texts of linked but not indexed documents (about 20 times of document index in size!!) - get a very detailed link graph - enhance ranking using a complete link graph To get the full access to the new index, the API to solr has now two access points: one with attribute core=collection1 for the default search index and core=webgraph to the new webgraph search index. This is also avaiable for p2p operation but client access is not yet implemented.	12 years ago
orbiter	5dfd6359cb	redesign of the QueryParams class: introduced QueryGoal which holds the query string parser. This shall be used to create a proper full-string matching which is handled then by QueryGoal.	12 years ago
Michael Peter Christen	8fb370d9f8	renovated the way how search results are count. should be correct now...	12 years ago
Michael Peter Christen	6629e37685	tried to clean up the search process mess	12 years ago
Michael Peter Christen	c5f67a5d6d	fixed a problem with local search from solr results: now all results from solr are shown (again)	12 years ago
sixcooler	2d972f289a	rise commitWithinMs to default-value from SwitchBoard (result in lower hd-io) no dots in memory-graph (there are to much of them)	12 years ago
Michael Peter Christen	f2d0418218	because the new PngEncoder had a problem with the PixelGrabber which is caused by a JRE bug, the PixelGrabber had to be circumvented using an own frame buffer which can be read without a PixelGrabber. This resulted in ultra-fast and much less memory-consuming transformation. YaCy images are now generated really fast!	12 years ago
Michael Peter Christen	d5d64019e5	- added a method for the RasterPlotter to draw arrow endings to lines - replaced the dot in the NetworkGraph with arrows - enhanced the image drawing speed using pre-computed color values - added more attention for OOM cases during very large image painting	12 years ago
Michael Peter Christen	ae6feb5610	showing the web structure graph as animation in the crawl monitor	12 years ago
Michael Peter Christen	39317a6c66	enhanced webstructure image: introduced - multiple hosts can be listed (comma-separated) as host argument - new 'bf'-attribut (branch factor): the maximum number of edges per node - the bf-value is computed automatically - ordering of nodes when the graphic is drawed: mostly the drawing ends with an limitation eg. number of nodes. When this happens, it should be ensured that more 'interesting' nodes are painted in advance. This is now done by sorting all nodes by the number of links they have in de distant sub-graph.	12 years ago
Michael Peter Christen	43f3345c90	- removed dependencies from URIMetadataRow and made direct access to URIMetadataNode which creates the opportunity to access Solr objects directly and use their information richness - lazy initialization of the URIMetadataNode object - should cause less computation and memory usage during search. - removed dead code	12 years ago
Michael Peter Christen	21fe8339b4	- enhanced generation of url objects - enhanced computation of link structure graphics - enhanced collection of data for link structures	12 years ago
Michael Peter Christen	016ffa7434	increased strength of crawling waves in network image	12 years ago
Michael Peter Christen	1533bfd63b	refactoring	12 years ago
Michael Peter Christen	5683162bd3	simplifications in DHT Distribution class and more documentation	12 years ago
Michael Peter Christen	e57bf2ca39	simplified DHT classes	12 years ago
Michael Peter Christen	8219a445f3	refactoring	12 years ago
Michael Peter Christen	00c1c777fa	refactoring	12 years ago
orbiter	563d584420	removed more dependencies in cora from kelondro	12 years ago
Michael Peter Christen	a06123aec6	more abstraction and less parameter overhead for remote search	12 years ago
Michael Peter Christen	24d9db1613	snippet retrieval loading processes may use a smaller minimum load time value than crawling processes. This speeds up the search result preparation dramatically.	12 years ago
Michael Peter Christen	1687737771	Abstraction of HandleMap and HandleSet	12 years ago
orbiter	0cbda0b2b8	- replaced all length() == 0 and size() == 0 with isEmpty() - replaced some length() > 0 and size() > 0 with !isEmpty() - cannot be done automatically - implemented some isEmpty() methods	13 years ago
orbiter	62202e2d71	refactoring of query attribute variable names for better consistency with (next) stored query words	13 years ago
Michael Peter Christen	0301aba1e9	removed unused method parameters	13 years ago
Michael Peter Christen	1825f165b8	better integration of blacklist according to use case	13 years ago
Michael Peter Christen	3b992e6b00	using utf8 String compression in Webstructure database	13 years ago
Michael Peter Christen	f294f2e295	bugfix to http://bugs.yacy.net/view.php?id=181 tried to make a bit less 'noise' to dns server also included: less processes in snippet fetch to reduce load during search on small computers	13 years ago
Michael Peter Christen	89142d1e8d	removed (not all) warnings	13 years ago
Michael Peter Christen	15db703808	added missing serialization to remove all warnings	13 years ago
Roland 'Quix0r' Haeder	a093ccf5eb	Now used synchronization in all close() methods to make sure all objects are 'closed' in an ordered way Conflicts: source/de/anomic/http/server/ChunkedInputStream.java source/de/anomic/http/server/ChunkedOutputStream.java source/de/anomic/http/server/ContentLengthInputStream.java source/net/yacy/cora/protocol/Domains.java source/net/yacy/cora/services/federated/solr/SolrShardingConnector.java source/net/yacy/cora/services/federated/solr/SolrSingleConnector.java source/net/yacy/document/content/dao/PhpBB3Dao.java source/net/yacy/document/parser/html/AbstractTransformer.java source/net/yacy/kelondro/blob/BEncodedHeap.java source/net/yacy/kelondro/blob/HeapReader.java source/net/yacy/kelondro/index/RAMIndexCluster.java source/net/yacy/kelondro/io/ByteCountInputStream.java source/net/yacy/kelondro/logging/ConsoleOutErrHandler.java source/net/yacy/kelondro/table/SQLTable.java	13 years ago
Marc Nause	a691023d04	) better formatting for network QPM ) refactoring	13 years ago
Michael Peter Christen	ba6aaabc51	refactoring + parser bugfixes	13 years ago
Michael Peter Christen	33d1062c79	refactoring: the cache belongs to the crawler	13 years ago
Michael Peter Christen	8c06925984	animation of the web structure picture	13 years ago
Michael Christen	0797b0de99	new handling of remote search processes: looking for seeds will now not block the whole search process any more. A deadlock with a DHT selection process may have been the cause for interface lockings in the past.	13 years ago
Michael Christen	044f83feed	added some pauses into the search process which shall produce better-ranked search results. without that pauses the result page will only contain links from the peer that answers first which is not a good average picture of all the peers that provided results	13 years ago

1 2

57 Commits (9e12fdff23344fb1bf84089c111a95404bdaa5ac)