<p>YACY consists mainly of four parts: the <b>p2p index exchange</b> protocol, based on http; a <b>spider/indexer</b>; a <b>caching http proxy</b> which is not only a simple <i>increase in value</i> but also an <i>informtaion provider</i> for the indexing engine and the built-in <b>database engine</b> which makes installation and maintenance of yacy very easy.</p>
<tr><tdwidth="30%"valign="top"><b>Transparent HTTP and HTTPS Proxy and Caching:</b></td><tdwidth="70%">
The proxy implementation provides a fast content-passing, since every file that the proxy reads from the targeted server is streamed directly to the accessing client while the stream is copied to a RAM cache for later processing. This ensures that the proxy mode is extremely fast and does not interrupt browsing. Whenever the Proxy idles, it processes it's RAM cache to perform indexing and storage to a local file of the cache. Every HTTP header that was passed along with the file is stored in a database and is re-used later on when a cache hit appears. The proxy function has maximum priority above other tasks, like cache management or indexing functions.
We implemented a file-based AVL tree upon a random-access-file. Tree nodes can be dynamically allocated and de-allocated and an unused-node list is maintained. For the PLASMA search algorithm, an ordered access to search results are necessary, therefore we needed an indexing mechanism which stores the index in an ordered way. The database supports such access, and the resulting database tables are stored as a single file. The database does not need any set-up or maintenance tasks that must done by an administrator. It is completely self-organizing. The AVL property ensures maximum performance in terms of algorithmic order. Any database may grow to an unthinkable number of records: with one billion records a database request needs a theoretical maximum number of only 44 comparisments.
The page indexing is done by the creation of a 'reverse word index': every page is parsed, the words are extracted and for every word a database table is maintained. The database tables are held in a file-based hash-table, so accessing a word index is extremely fast, resulting in an extremely fast search. Conjunctions of search words are easily found, because the search results for each word is ordered and can be pairwise enumerated. In terms of computability: the order of the searched access efford to the word index for a single word is O(log <number of words in database>). It is always constant fast, since the data structure provides a 'pre-calculated' result. This means, the result speed is <i>independent</i> from the number of indexed pages! It only slows down for a page-ranking, and is multiplied by the number of words that are searched simultanously. That means, the search efford for n words is O(n * log w). You can't do better (consider that n is always small, since you rarely search for more that 10 words).
This technology is the driving force behind the YACY implementation. A DHT (Distributed Hash Table) - like technique will be used to publish the word cache. The idea is, that word indexes travel along the peers <i>before</i> a search request arrives at a specific word index. A search for a specific word would be performed by computing the peer and point <i>directly</i> to the peer, that hosts the index. No peer-hopping or such, since search requests are time-critical (the user usually does not want to wait long). Redundancy must be implemented as well, to catch up the (often) occasions of disappearing peers. Privacy is ensured, since no peer can know which word index is stored, updated or passed since word indexes are stored under a word hash, not the word itself. Search mis-use is regulated by the p2p-laws of give-and-take: every peer must contribute in the crawl/proxy-and-index - process before it is allowed to search.
</td></tr>
</table></p>
<p><h3>Privacy</h3>
Sharing the index to other users may concern you about your privacy. We have made great efforts to keep and secure your privacy: