yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Michael Peter Christen	8df8ffbb6d	enhanced the snapshot functionality: - snapshots can now also be xml files which are extracted from the solr index and stored as individual xml files in the snapshot directory along the pdf and jpg images - a transaction layer was placed above of the snapshot directory to distinguish snapshots into 'inventory' and 'archive'. This may be used to do transactions of index fragments using archived solr search results between peers. This is currently unfinished, we need a protocol to move snapshots from inventory to archive - the SNAPSHOT directory was renamed to snapshot and contains now two snapshot subdirectories: inventory and archive - snapshots may now be generated by everyone, not only such peers running on a server with xkhtml2pdf installed. The expert crawl starts provides the option for snapshots to everyone. PDF snapshots are now optional and the option is only shown if xkhtml2pdf is installed. - the snapshot api now provides the request for historised xml files, i.e. call: http://localhost:8090/api/snapshot.xml?urlhash=Q3dQopFh1hyQ The result of such xml files is identical with solr search results with only one hit. The pdf generation has been moved from the http loading process to the solr document storage process. This may slow down the process a lot and a different version of the process may be needed.	10 years ago
reger	ff18129def	ViewFile servlet: update index if newer, so viewed text and metadata (stored) info is similar - to archive it, use request with profile to allow indexing (defaultglobaltext) and update index (the resource is loaded, parsed anyway, so it's not a expensive operation) Request: remove 2 unused init parameter - number of anchors of the parent - forkfactor sum of anchors of all ancestors	10 years ago
Michael Peter Christen	e586e423aa	in case that loading from the cache fails, load from wkhtmltopdf without cache using the user agent string given in the crawl profile	10 years ago
Michael Peter Christen	d5bac64421	recognize more html file types for snapshots	10 years ago
Michael Peter Christen	25a64c51b3	moved snapshot generation out of the html handler to prevent that existing cache entries cause that the handler is not executed	10 years ago
reger	48aed15c48	skip loader wait cycle on concurrent access in nocache configuration. In nocache config resource is loaded online, leaving no benefit to wait for a faster cache hit.	10 years ago
Michael Peter Christen	1735dbc9d9	enhanced image search: bugfixes and performance enhancements	10 years ago
Michael Peter Christen	ebd0be2cea	fixes and speed updates for search process	10 years ago
Michael Peter Christen	2de159719b	added an option to set 'obey nofollow' for links with rel="nofollow" attribute in the <a> tag for each crawl. This introduces a lot of changes because it extends the usage of the AnchorURL Object type which now also has a different toString method that the underlying DigestURL.toString. It is therefore not advised to use .toString at all for urls, just just toNormalform(false) instead.	10 years ago
Michael Peter Christen	fb3dd56b02	fix for processing of noindex flag in http header	11 years ago
Michael Peter Christen	10cf8215bd	added crawl depth for failed documents	11 years ago
Michael Peter Christen	da86f150ab	- added a new Crawler Balancer: HostBalancer and HostQueues: This organizes all urls to be loaded in separate queues for each host. Each host separates the crawl depth into it's own queue. The primary rule for urls taken from any queue is, that the crawl depth is minimal. This produces a crawl depth which is identical to the clickdepth. Furthermorem the crawl is able to create a much better balancing over all hosts which is fair to all hosts that are in the queue. This process will create a very large number of files for wide crawls in the QUEUES folder: for each host a directory, for each crawl depth a file inside the directory. A crawl with maxdepth = 4 will be able to create 10.000s of files. To be able to use that many file readers, it was necessary to implement a new index data structure which opens the file only if an access is wanted (OnDemandOpenFileIndex). The usage of such on-demand file reader shall prevent that the number of file pointers is over the system limit, which is usually about 10.000 open files. Some parts of YaCy had to be adopted to handle the crawl depth number correctly. The logging and the IndexCreateQueues servlet had to be adopted to show the crawl queues differently, because the host name is attached to the port on the host to differentiate between http, https, and ftp services.	11 years ago
Michael Peter Christen	6bd8c6f195	fix for wrong status codes of error pages	11 years ago
Michael Peter Christen	b08375da33	fix for bad/missing values of size_i	11 years ago
Michael Peter Christen	1a4a69c226	set more logger to 'final static'	11 years ago
Michael Peter Christen	91a875dff5	self-healing of mistakenly deactivated crawl profiles. This fixes a bug which can happen in rare cases when a crawl start and a cleanup process happen at the same time.	11 years ago
Michael Peter Christen	2602be8d1e	- removed ZURL data structure; removed also the ZURL data file - replaced load failure logging by information which is stored in Solr - fixed a bug with crawling of feeds: added must-match pattern application to feed urls to filter out such urls which shall not be in a wanted domain - delegatedURLs, which also used ZURLs are now temporary objects in memory	11 years ago
Michael Peter Christen	61c5e40687	- replaced the properties object in AnchorURL with distinct variables for anchor attributes. - this caused that large portions of the parser code had to be adopted as well - added a counter target_order_i for anchor links in webgraph computation	11 years ago
Michael Peter Christen	5e31bad711	- the webgraph shall store all links which appear on a web page and not all unique links! This made it necessary, that a large portion of the parser and link processing classes must be adopted to carry a different type of link collection which carry a property attribute which are attached to web anchors. - introduction of a new URL class, AnchorURL - the other url classes, DigestURI and MultiProtocolURI had been renamed and refactored to fit into a new document package schema, document.id - cleanup of net.yacy.cora.document package and refactoring	11 years ago
Michael Peter Christen	765943a4b7	Redesign of crawler identification and robots steering. A non-p2p user in intranets and the internet can now choose to appear as Googlebot. This is an essential necessity to be able to compete in the field of commercial search appliances, since most web pages are these days optimized only for Google and no other search platform any more. All commercial search engine providers have a built-in fake-Google User Agent to be able to get the same search index as Google can do. Without the resistance against obeying to robots.txt in this case, no competition is possible any more. YaCy will always obey the robots.txt when it is used for crawling the web in a peer-to-peer network, but to establish a Search Appliance (like a Google Search Appliance, GSA) it is necessary to be able to behave exactly like a Google crawler. With this change, you will be able to switch the user agent when portal or intranet mode is selected on per-crawl-start basis. Every crawl start can have a different user agent.	11 years ago
Roland Haeder	aaedc0405d	Fixes and avoid of catching bad exceptions (some): - Rewrote usage of HashMap/Map to concurrent versions (to avoid a CME=ConcurrentModificationException) - Rewrote ConnectionInfo (as an example) to use a synchronized iterator instead of synchronizing an already synced HashSet (see Collections call) - This avoids catching CMEs again - Commented out noisy ConcurrentLog.logException() call Conflicts: source/net/yacy/repository/LoaderDispatcher.java	11 years ago
Michael Peter Christen	89c0aa0e74	added collection_sxt to error documents	11 years ago
Michael Peter Christen	5878c1d599	- refactoring of log to ConcurrentLog: jdk-based logger tend to block at java.util.logging.Logger.log(Logger.java:476) in concurrent environments. This makes logging a main performance issue. To overcome this problem, this is a add-on to jdk logging to put log entries on a concurrent message queue and log the messages one by one using a separate process. - FTPClient uses the concurrent logging instead of the log4j logger	12 years ago
Michael Peter Christen	8f2d3ce2f9	reduced locking situation in crawler: shifted synchronized location and reduced time-out of robots.txt load limit	12 years ago
Michael Peter Christen	77faeada4d	small memory leak patch	12 years ago
Michael Peter Christen	788288eb9e	added the generation of 50 (!!) new solr field in the core 'webgraph'. The default schema uses only some of them and the resting search index has now the following properties: - webgraph size will have about 40 times as much entries as default index - the complete index size will increase and may be about the double size of current amount As testing showed, not much indexing performance is lost. The default index will be smaller (moved fields out of it); thus searching can be faster. The new index will cause that some old parts in YaCy can be removed, i.e. specialized webgraph data and the noload crawler. The new index will make it possible to: - search within link texts of linked but not indexed documents (about 20 times of document index in size!!) - get a very detailed link graph - enhance ranking using a complete link graph To get the full access to the new index, the API to solr has now two access points: one with attribute core=collection1 for the default search index and core=webgraph to the new webgraph search index. This is also avaiable for p2p operation but client access is not yet implemented.	12 years ago
Michael Peter Christen	84f82541e8	search process enhancements	12 years ago
Michael Peter Christen	2d9e577ad0	replaced the custom robots.txt loader by the standard http loader	12 years ago
Michael Peter Christen	5f0ab25382	removed the option to prevent removal of & parts inside of the MultiProtocolURI during normalform computation because that should always be done and also be done during initialization of the MultiProtocolURI Object. The new normalform method takes only one argument which should be 'true' unless you know exactly what you are doing.	12 years ago
Michael Peter Christen	1533bfd63b	refactoring	12 years ago
Michael Peter Christen	00c1c777fa	refactoring	12 years ago
Michael Peter Christen	24d9db1613	snippet retrieval loading processes may use a smaller minimum load time value than crawling processes. This speeds up the search result preparation dramatically.	12 years ago
orbiter	0cbda0b2b8	- replaced all length() == 0 and size() == 0 with isEmpty() - replaced some length() > 0 and size() > 0 with !isEmpty() - cannot be done automatically - implemented some isEmpty() methods	13 years ago
Michael Peter Christen	c3db015410	prevent loading of content from the cache when retrieval with IFFRESH is used and cache is stale. Should speed up snippet generation when cache strategy is IFFRESH.	13 years ago
Michael Peter Christen	b0c408788b	made class methods static where possible	13 years ago
Michael Peter Christen	7c1ba99755	removed more unused method parameters	13 years ago
Michael Peter Christen	ea10766bfd	cleaned unnecessary nested code	13 years ago
Michael Peter Christen	1825f165b8	better integration of blacklist according to use case	13 years ago
Michael Peter Christen	03280fb161	removed segments-concept and the Segments class: the segments had been there to create a tenant-infrastructure but were never be used since that was all much too complex. There will be a replacement using a solr navigation using a segment field in the search index.	13 years ago
Michael Peter Christen	77f795756c	fixing redirects and status codes: storing of status code in ResponseHeader to make it available for late evaluations, like storage in solr.	13 years ago
Michael Peter Christen	7dc59979bc	fix for npe, possibly for http://bugs.yacy.net/view.php?id=195	13 years ago
Roland 'Quix0r' Haeder	edaa09b9b1	Rewrote all String blacklist types to enum 'BlacklistType', closes bug #143 Conflicts: htroot/Supporter.java htroot/yacy/crawlReceipt.java htroot/yacy/transferRWI.java htroot/yacy/transferURL.java source/de/anomic/crawler/CrawlStacker.java source/de/anomic/data/ListManager.java source/net/yacy/peers/Protocol.java source/net/yacy/repository/Blacklist.java source/net/yacy/repository/LoaderDispatcher.java source/net/yacy/search/Switchboard.java source/net/yacy/search/index/MetadataRepository.java source/net/yacy/search/index/Segment.java source/net/yacy/search/query/RWIProcess.java source/net/yacy/search/snippet/MediaSnippet.java	13 years ago
Michael Peter Christen	7e0ddbd275	added a "fromCache" flag in Response object to omit one cache.has() check during snippet generation. This should cause less blockings	13 years ago
Michael Peter Christen	659178942f	- Redesigned crawler and parser to accept embedded links from the NOLOAD queue and not from virtual documents generated by the parser. - The parser now generates nice description texts for NOLOAD entries which shall make it possible to find media content using the search index and not using the media prefetch algorithm during search (which was costly) - Removed the media-search prefetch process from image search	13 years ago
Michael Peter Christen	33d1062c79	refactoring: the cache belongs to the crawler	13 years ago
reger	a95f645a61	Bugfix class repository.Loaddispatcher fixed download file limit of 10000 line 355: final Response response = this.load(request, cachePolicy, 10000, true);	13 years ago
Michael Peter Christen	ef5192f8c9	using the generic document parser for crawl starts instead of the html parser. This makes it possible that every type of document can be a crawl start point, not only text documents or html documents. Testet this with a pdf document.	13 years ago
Michael Christen	216a287a85	Merge commit '6d4e08ed06c5cd28c45981b2ebe31c7f7ec6fd83' into quix0r Conflicts: source/de/anomic/crawler/CrawlQueues.java	13 years ago
Michael Christen	361146dd7a	better error handling for file loader	13 years ago
Roland 'Quix0r' Haeder	a3083d13bf	Blacklist checks are now always turned on, in media searches (e.g. image search) images matching blacklist entries are no longer shown to the user	13 years ago

1 2

100 Commits (1cfddea57835e068479b5ab73d9bda4388429ccf)