<p>When a web search in you local peer is done, it searches not only the own database, but also
databases of other YaCy peers. Only those peers are searched that should have the specific index.</p><br><br>
</td></tr><tr><tdvalign="top"><b>Caching HTTP and transparent HTTPS Proxy with page indexing</b></td><td>
With optional pre-fetching. HTTP 1.1 with GET/HEAD/POST/CONNECT is supported. This is sufficient for nearly all public web pages.
HTTP headers are transparently forwarded. HTTPS connections through target port 443 are transparently forwarded, non-443 connections are suppressed to enhance security. Both (HTTP and HTTPS) proxies share the same proxy port, which is by default port 8080.
The proxy 'scrapes' the content that it passes and creates an index that can be shared between every YACY Proxy daemons.
You can use the indexing feature for intranet indexing:
you instantly have a search service at hand to index all intranet-served web pages.
You don't need to set up a separated search service. And the used <ahref="http://www.anomic.de/AnomicPlasma/index.html">PLASMA</a>
indexing is not a naive quick-hack but an <ahref="Technology.html">properly engineered and extremely fast algorithm</a>;
it is capable of indexing a nearly unlimited number of pages, without slowing down the search process.
<p>YaCy is not only a distributed search engine, but also a caching HTTP proxy.
Both application parts benefit from each other.</p>
<h3>Can I Crawl The Web With YaCy?</h3>
<p>Yes! You can start your own crawl and you may also trigger distributed crawling,
which means that your own YaCy peer asks other peers to perform specific crawl tasks.
@ -59,13 +56,13 @@ You can specify many parameters that focus your crawl to a limited set of web pa
<h3>What do you mean with 'Global Search Engine'?</h3>
<p>The integrated indexing and search service can not only be used locally, but also <i>globally</i>.
Each YaCy peer distributes some contact information to all other proxies that can be reached in the internet,
and proxies exchange <i>but do not copy</i> their indexes to each other.
Each YaCy peer distributes some contact information to all other peers that can be reached,
and peers exchange <i>but do not copy</i> their indexes to each other.
This is done in such a way, that each <i>peer</i> knows how to address the correct other
<i>peer</i> to retrieve a special search index.
Therefore the community of all proxies spawns a <i>distributed hash table</i> (DHT)
which is used to share the <i>reverse word index</i> (RWI) to all operators and users of the proxies.
The applied logic of distribution and retrieval of RWI's on the DHT combines all participating proxies to
Therefore the community of all peers spawns a <i>distributed hash table</i> (DHT)
which is used to share the <i>reverse word index</i> (RWI) to all operators and users of the YaCy peers.
The applied logic of distribution and retrieval of RWI's on the DHT combines all participating peers to
a <i>Distributed Search Engine</i>.
To point out that this is in contrast to local indexing and searching,
we call it a <i>Global Search Engine</i>.
@ -88,17 +85,10 @@ Junior peers can contribute to the network by submitting index files to senior/p
If you run YaCy, you need an average of the same disc memory amount for the index as you need for the cache.
In fact, the global space for the index may reach the space of Terabytes, but not all of that on your machine!</p>
<h3>Search Engines must do crawling, don't they? Do you?</h3>
<p>No. They <i>can</i> do, but we collect information by simply using the information that passes the proxy.
If you <i>want</i> to crawl, you can do so and start your own crawl job with a certain search depth.</p>
<h3>Do I need a fast machine? Search Engines need big server farms, don't they?</h3>
<p>You don't need a fast machine to run YaCy. You also don't need a lot of space.
You can configure the amount of Megabytes that you want to spend for the cache.
Any time-critical task is delayed automatically and takes place
when you are idle surfing (which works only if you use YaCy as http proxy).
Whenever internet pages pass YaCy in proxy-mode,
any indexing (or if wanted: prefetch-crawling) is interrupted and delayed.
YaCy can also run on a vServer.
</p>
<h3>I don't want to wait for search results very long. How long does a search take?</h3>
@ -159,19 +149,6 @@ Therefore we don't need to give junior peers low priority: they contribute equal
But enough senior peers are needed to make this architecture functional. Since any peer contributes almost equally, either actively or passively, you should decide to run in Senior Mode if you can.
</p>
<h3>Why is this Search Engine also a Proxy?</h3>
<p>
We wanted to avoid that you start a search service ony for that very time when you submit a search query. This would give the Search Engine too little online time. So we looked for a cause the you would like to run the Search Engine during all the time that you are online. By giving you the additional value of a caching proxy, the reason was found. The built-in blacklist (url filter, useful i.e. to block ads) for the proxy is another increase in value.
</p>
<h3>Why is this Proxy also a Search Engine?</h3>
<p>YaCy has a built-in <i>caching</i> proxy, which means that YaCy has a lot of indexing information
'for free' without crawling. This may not be a very usual function of a proxy, but a very useful one:
you see a lot of information when you browse the internet and maybe you would like to search exactly
only what you have seen. Beside this interesting feature, you can use YaCy to index an intranet
simply by using the proxy; you don't need to additionally set up another search/indexing process or databases.
YaCy gives you an 'instant' database and an 'instant' search service.</p>
<h3>My YaCy says it runs in 'Junior Mode'. How can I run it in Senior Mode?</h3>
<p>Open your firewall for port 8080 (or the port you configured) or program your router to act as a <i>virtual server</i>.</p>
<p>YACY is written entirely in Java (Version Java2 / 1.2 and up). Any system that supports Java2 can run YACY. That means it runs on almost any commercial and free platforms and operation systems that are around. This includes of course Mac OS X, Windows (NT, W2K, XP) and Linux systems. For java support of your platform, please see the <ahref="Installation.html">installation</a> documentation.</p>
The Proxy runs seamless on any Windows System and comes with an easy-to-use installer application. Just install and use the proxy like any other Windows application. Please download the Windows Release Flavour of YACY instead the generic one.
YaCy runs on Windows and comes with an easy-to-use installer application. Please download the Windows Release Flavour of YaCy instead the generic one.
The proxy environment is terminal-based, not windows-based. You can start the proxy in a console, and monitor it's actions through a log file. A wrapper shell script for easy startup is included. You can administrate the proxy remotely through the built-in http server with any browser.
You can start YaCy in a console, and monitor it's actions through a log file. A wrapper shell script for easy startup is included. You can administrate the proxy remotely through the built-in http server with any browser.
<p>Bevore a huge number of web pages can be searched efficiently, the pages must be <i>indexed</i>.
This is a very difficult process which runs inside YaCy without any user action.
After indexing of web pages a single YaCy installation is able to provide search results
from more that 10 million of web pages efficiently.</p>
<p>YACY consists mainly of four parts: the <b>p2p index exchange</b> protocol, based on http; a <b>spider/indexer</b>; a <b>caching http proxy</b> which is not only a simple <i>increase in value</i> but also an <i>informtaion provider</i> for the indexing engine and the built-in <b>database engine</b> which makes installation and maintenance of yacy very easy.</p>
<tr><tdwidth="30%"valign="top"><b>Transparent HTTP and HTTPS Proxy and Caching:</b></td><tdwidth="70%">
The proxy implementation provides a fast content-passing, since every file that the proxy reads from the targeted server is streamed directly to the accessing client while the stream is copied to a RAM cache for later processing. This ensures that the proxy mode is extremely fast and does not interrupt browsing. Whenever the Proxy idles, it processes it's RAM cache to perform indexing and storage to a local file of the cache. Every HTTP header that was passed along with the file is stored in a database and is re-used later on when a cache hit appears. The proxy function has maximum priority above other tasks, like cache management or indexing functions.
We implemented a file-based AVL tree upon a random-access-file. Tree nodes can be dynamically allocated and de-allocated and an unused-node list is maintained. For the PLASMA search algorithm, an ordered access to search results are necessary, therefore we needed an indexing mechanism which stores the index in an ordered way. The database supports such access, and the resulting database tables are stored as a single file. The database does not need any set-up or maintenance tasks that must done by an administrator. It is completely self-organizing. The AVL property ensures maximum performance in terms of algorithmic order. Any database may grow to an unthinkable number of records: with one billion records a database request needs a theoretical maximum number of only 44 comparisments.
The page indexing is done by the creation of a 'reverse word index': every page is parsed, the words are extracted and for every word a database table is maintained. The database tables are held in a file-based hash-table, so accessing a word index is extremely fast, resulting in an extremely fast search. Conjunctions of search words are easily found, because the search results for each word is ordered and can be pairwise enumerated. In terms of computability: the order of the searched access efford to the word index for a single word is O(log <number of words in database>). It is always constant fast, since the data structure provides a 'pre-calculated' result. This means, the result speed is <i>independent</i> from the number of indexed pages! It only slows down for a page-ranking, and is multiplied by the number of words that are searched simultanously. That means, the search efford for n words is O(n * log w). You can't do better (consider that n is always small, since you rarely search for more that 10 words).
This technology is the driving force behind the YACY implementation. A DHT (Distributed Hash Table) - like technique will be used to publish the word cache. The idea is, that word indexes travel along the peers <i>before</i> a search request arrives at a specific word index. A search for a specific word would be performed by computing the peer and point <i>directly</i> to the peer, that hosts the index. No peer-hopping or such, since search requests are time-critical (the user usually does not want to wait long). Redundancy must be implemented as well, to catch up the (often) occasions of disappearing peers. Privacy is ensured, since no peer can know which word index is stored, updated or passed since word indexes are stored under a word hash, not the word itself. Search mis-use is regulated by the p2p-laws of give-and-take: every peer must contribute in the crawl/proxy-and-index - process before it is allowed to search.
</td></tr>
</table></p>
<p><h3>Privacy</h3>
Sharing the index to other users may concern you about your privacy. We have made great efforts to keep and secure your privacy:
<p>The following persons are involved (alphabetical order):
<ul>
<li><b>Michael Christen/Orbiter</b> is project founder; designed and implemented the overall architecture, is chief software architect, release management, kelondro database, yacy core protocol, indexing technique and database structure, search and ranking functionality, http client/server architecture, admin of yacy.net</li>
<li><b>Michael Christen</b> is project founder; designed and implemented the overall architecture, is chief software architect, release management, kelondro database, yacy core protocol, indexing technique and database structure, search and ranking functionality, http client/server architecture, admin of yacy.net</li>
<li><b>Natali Christen</b> designed the YaCy logo.</li>
<li><b>Oliver Wunder</b> provided some german translation. He also made bittorrent-releases</li>
<li><b>Stephan Hermens</b> has made some important bugfixes.</li>
<li><b>Matthias Kempka</b> provided a linux-init start/stop - script</li>
<li><b>Timo Leise</b> suggested and implemented an extension to the blacklist feature: part-of-domain matching.</li>
@ -67,6 +66,7 @@ and manager of the meta-search-engine <a href="http://www.metager.de">metaGer</a
<li><b>Matthias Söhnholz</b> added the offline-browsing feature</li>
<li><b>slick</b> helps as packager (.rpm, .deb etc)</li>
<li><b>Martin Thelian</b> made system-wide performance enhancement by introducing thread pools; he added ICAP and SOAP support, most of external parser integration, maintains the http protocol implementation, added squid compatibility, robots protocol, better logging and many index protocol, import/export and transfer enhancements. He created a YaCy screensaver and coded major parts of the yacybar Firefox extension.</li>
<li><b>Oliver Wunder</b> provided some german translation. He also made bittorrent-releases</li>