yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Michael Peter Christen	030d0776ff	Enhanced crawl start for very, very large crawl lists (i.e. > 5000) which had a problem because of badly used concurrency. This fix also caused a redesign of the whole host deletion process. This should fix bug http://bugs.yacy.net/view.php?id=250	11 years ago
Michael Peter Christen	4948c39e48	added concurrency for mass crawl check	11 years ago
Michael Peter Christen	1b4fa2947d	- fixed a problem which ocurred when a document was not recognized with the right content domain (i.e. identifying that it is an image, text etc.) because it used the file extension and not an existing mime type assignment. - fixed the new setting that images shall be loaded for a better image search. - both fixes together makes it now possible to crawl commons.wikimedia.org which makes use of 'funny' document names (i.e. ending with .jpg while the document is html)	11 years ago
orbiter	20bbde8665	fix for mustmatch regex computation: result had correct semantic, but may have contained multiple same expressions within the disjunction of domain-restrictions. This fix removes the redundant restrictions and makes the regex shorter.	11 years ago
Michael Peter Christen	74d0256e93	enhanced postprocessing: fixed bugs, enable proper postprocessing also without the harvestingkey, remove crawl profiles after postprocessing, speed-up for clickdepth computation.	11 years ago
Michael Peter Christen	101a6e6e14	Patch the citation index for links with canonical tags. This shall fulfill the following requirement: If a document A links to B and B contains a 'canonical C', then the citation rank computation shall consider that A links to C and B does not link to C. To do so, we first must collect all canonical links, find all references to them, get the anchor list of the documents and patch the citation reference of these links.	11 years ago
reger	fd119deb00	fix NPE on modified since check ( Response.requestHeader allowed to be null)	11 years ago
Michael Peter Christen	b28d43decc	added two more fields source_cr_host_norm_i,target_cr_host_norm_i in webgraph and an addition to postprocessing to copy all cr ranking attributes to the link edges associated to the postprocessing documents	11 years ago
Michael Peter Christen	3bf0104199	fix for crawl domain counter limitation (limit was reached too early)	11 years ago
Michael Peter Christen	82bfd9e00a	- crawl profiles shall be deleted from active and passive stacks if they are deleted to terminate the crawl because otherwise the crawl will go on after the load-from-passive stack policy. - better check if a crawl is terminated using the loader queue.	11 years ago
Michael Peter Christen	91a875dff5	self-healing of mistakenly deactivated crawl profiles. This fixes a bug which can happen in rare cases when a crawl start and a cleanup process happen at the same time.	11 years ago
Michael Peter Christen	4f83d5f18c	added the new field harvestkey_s to the collection index and the webgraph index which is temporary filled with the crawl profile key. This is used to select a set of documents for post-processing as soon as a crawl is finished. Now the postprocessing for a specific crawl is started when that specific crawl is finished and not at the end of all post-processing steps.	11 years ago
orbiter	14442efa6d	when profiles are cleaned, there shall be first a callback showing which profiles are cleaned. This shall enable a profile-termination-driven postprocessing. To do this, index writings must carry the profile key which will be implemented in another (next) step.	11 years ago
orbiter	0013d0d0bb	removed superfluous class	11 years ago
orbiter	f90d5296cb	Added new data structure to be used by the balancer (not used yet). These data structures will enable the balancer to store the crawl queue into individual queues, one each for a single host.	11 years ago
orbiter	0e8d752462	refactoring	11 years ago
Michael Peter Christen	e40671ddb7	better and consistent deletions for error urls	11 years ago
Michael Peter Christen	2602be8d1e	- removed ZURL data structure; removed also the ZURL data file - replaced load failure logging by information which is stored in Solr - fixed a bug with crawling of feeds: added must-match pattern application to feed urls to filter out such urls which shall not be in a wanted domain - delegatedURLs, which also used ZURLs are now temporary objects in memory	11 years ago
Michael Peter Christen	61c5e40687	- replaced the properties object in AnchorURL with distinct variables for anchor attributes. - this caused that large portions of the parser code had to be adopted as well - added a counter target_order_i for anchor links in webgraph computation	11 years ago
Michael Peter Christen	5e31bad711	- the webgraph shall store all links which appear on a web page and not all unique links! This made it necessary, that a large portion of the parser and link processing classes must be adopted to carry a different type of link collection which carry a property attribute which are attached to web anchors. - introduction of a new URL class, AnchorURL - the other url classes, DigestURI and MultiProtocolURI had been renamed and refactored to fit into a new document package schema, document.id - cleanup of net.yacy.cora.document package and refactoring	11 years ago
Michael Peter Christen	1a8c64117f	decreased the responseHeaderDB database which is now flushed more frequently. This will preserve more documents in the cache in case of a crash.	11 years ago
Michael Peter Christen	dbef8ccfcb	forced deletion of ZURL entries for a specific host for each host that appears in the crawl url list	11 years ago
Michael Peter Christen	e137ff4171	refactoring (im preparation for new removeHost method)	11 years ago
orbiter	26366596d9	fix for a problem which ocurres when a site is crawled where the start url is redirected.	11 years ago
Michael Peter Christen	69f85265e1	added an option to put image links to the crawl queue and handle these like normal documents. Using this option (by default on at this moment; this might change soon) it is possible to get the exif data into the search index to be used in image search.	11 years ago
Michael Peter Christen	a88a62f7aa	added a feature to set a collection for a crawl result based on a regular expression on th url: the collection attribut for a crawl start may be now either a token or a list of tokens, seperated by ',' where a token is either a string or a pair <string,pattern> where the string is separated to the pattern with a ':' and the string is assigned to the document as collection only if the pattern matches with the url.	11 years ago
Michael Peter Christen	e4cbe9232d	fixed a crawler bug where a double-occurring url was not re-crawled because the double-check error was written to the error-db and never deleted. No the error-db is cleared on every start and these double-messages are not written to the error-db any more.	11 years ago
Michael Peter Christen	765943a4b7	Redesign of crawler identification and robots steering. A non-p2p user in intranets and the internet can now choose to appear as Googlebot. This is an essential necessity to be able to compete in the field of commercial search appliances, since most web pages are these days optimized only for Google and no other search platform any more. All commercial search engine providers have a built-in fake-Google User Agent to be able to get the same search index as Google can do. Without the resistance against obeying to robots.txt in this case, no competition is possible any more. YaCy will always obey the robots.txt when it is used for crawling the web in a peer-to-peer network, but to establish a Search Appliance (like a Google Search Appliance, GSA) it is necessary to be able to behave exactly like a Google crawler. With this change, you will be able to switch the user agent when portal or intranet mode is selected on per-crawl-start basis. Every crawl start can have a different user agent.	11 years ago
Michael Peter Christen	47b1c81d08	- refactoring - generalized writing of url attributes to solr documents - added more url attributes to error documents	11 years ago
Michael Peter Christen	dbfa865700	added a stub of a class for crawler redesign	11 years ago
Michael Peter Christen	58fe986cca	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	cf12835f20	replaced the single-text description solr field with a multi-value description_txt text field	11 years ago
orbiter	268a36aaff	emergency fix for crawler: this will otherwise cause loss of complete crawl queue if latency of remote system is too low	11 years ago
reger	2b7a38640a	extend content type detection on file extension for .tif .tiff .htm	11 years ago
Michael Peter Christen	735a66eff3	enhancements to crawler	11 years ago
Roland Haeder	841a28ae76	Added 'final' for all exception blocks as this helps the Java compiler to optimize memory usage Conflicts: source/net/yacy/search/Switchboard.java	11 years ago
Michael Peter Christen	89c0aa0e74	added collection_sxt to error documents	11 years ago
Michael Peter Christen	c6a6f159e8	fix for crawl stack domain counter	12 years ago
Michael Peter Christen	bcc623a843	refactoring of load_delay: this is a matter of client identification	12 years ago
orbiter	3978c5ca5d	fix for http://bugs.yacy.net/view.php?id=255	12 years ago
orbiter	dac88561ae	minimum access time has a tight connection to ClientIdentification, therefore it is defined there.	12 years ago
Michael Peter Christen	5c6946dd5f	replaced usage of log4j by ConcurrentLog where possible	12 years ago
Michael Peter Christen	5878c1d599	- refactoring of log to ConcurrentLog: jdk-based logger tend to block at java.util.logging.Logger.log(Logger.java:476) in concurrent environments. This makes logging a main performance issue. To overcome this problem, this is a add-on to jdk logging to put log entries on a concurrent message queue and log the messages one by one using a separate process. - FTPClient uses the concurrent logging instead of the log4j logger	12 years ago
Michael Peter Christen	a34e137e27	fix for citation index generation in case that entry.referrerhash() is null. This is especially the case if ftp sites are crawled	12 years ago
sixcooler	9551720d5c	re-enable saved setting for proxy-crawl-profile	12 years ago
Michael Peter Christen	57ffdfad4c	added a crawl option to obey html-meta-robots-noindex. This is on by default.	12 years ago
Michael Peter Christen	fa08bd9d5a	hack to prevent long waiting times in crawler	12 years ago
Michael Peter Christen	f1c5338210	prepartion for greedy crawl profiles and refactoring	12 years ago
Michael Peter Christen	203921006a	redesign of citation index storage	12 years ago
Michael Peter Christen	16d1d744fa	added url_file_name_s in default collection schema for the file name without the file extension. This part of the file path is removed from the multi-field url_paths_sxt, which has now not the file name as last part of the path list. The same applies to the new fields source_file_name_s and target_file_name_s in the webgraph schema.	12 years ago

1 2 3

132 Commits (434e13b46d77af182a0eb05e449ad00ecd9acf13)