We wanted to avoid that you start a search service ony for that very time when you submit a search query. This would give the Search Engine too little online time. So we looked for a cause the you would like to run the Search Engine during all the time that you are online. By giving you the additional value of a caching proxy, the reason was found. The built-in blacklist (url filter, useful i.e. to block ads) for the proxy is another increase in value.
<p>Yes! You can start your own crawl and you may also trigger distributed crawling, which means that your own YaCy peer asks other peers to perform specific crawl tasks. You can specify many parameters that focus your crawl to a limited set of web pages.</p>
which is used to share the <i>reverse word index</i> (RWI) to all operators and users of the proxies.
The applied logic of distribution and retrieval of RWI's on the DHT combines all participating proxies to
a <i>Distributed Search Engine</i>.
To point out that this is in contrast to local indexing and searching,
we call it a <i>Global Search Engine</i>.
</p>
<h3>Is there a central server? Does the search engine network need one?</h3>
<p>No. The network architecture does not need a central server, and there is none.
In fact there is a root server which is the 'first' peer, but any other peer has the same rights and tasks to perform.
We still distinguish three different <i>classes</i> of peers:
<ul>
<li><i>junior</i> peers are peers that cannot be reached from the internet because of routing problems or firewall settings;</li>
<li><i>senior</i> peers can be accessed by other peers and</li>
<li><i>principal</i> peers are like senior but can also upload network bootstrap information to ftp/http sites; this is necessary for the network bootstraping.</li>
Junior peers can contribute to the network by submitting index files to senior/principal peers without being asked. (This function is currently very limited)
In contrast, the proxy does <i>caching</i> which means that repeated loading of known pages is avoided and this possibly
speeds up your internet connection. Index sharing creates some traffic, but is only performed during idle time of the proxy and of your internet usage.</p>
<p>No, it won't, because indexing is only performed when the proxy is idle. This shifts the computing time to the moment when you read pages and you don't need computing time. Indexing is stopped automatically the next time you retrieve web pages through the proxy.</p>
<p>You don't need a fast machine to run YaCy. You also don't need a lot of space. You can configure the amount of Megabytes that you want to spend for the cache and the index. Any time-critical task is delayed automatically and takes place when you are idle surfing. Whenever internet pages pass the proxy, any indexing (or if wanted: prefetch-crawling) is interrupted and delayed. The root server runs on a simple 500 MHz/20 GB Linux system. You don't need more.</p>
<p>No. Any file that passes the proxy is <i>streamed</i> through the filter and caching process. At a certain point the information stream is duplicated; one copy is streamed to your browser, the other one to the cache. The files that pass the proxy are not delayed because they are <i>not</i> first stored and then passed to you, but streamed at the same time as they are streamed to the cache. Therefore your browser can do layout while loading as it would do without the proxy.</p>
We have a better solution to be up-to-date: browsing results of all people who run YaCy.
Many people prefer to look at news pages every day, and by passing through the proxy the latest news also arrive in the distributed search engine. This may take place possibly faster than it happens with a normal/crawling search engine.</p>
<h3>I don't want to wait for search results very long. How long does a search take?</h3>
<p>Our architecture does not do peer-hopping, we also don't have a TTL (time to live). We expect that search results are <i>instantly</i> responded to the requester. This can be done by asking the index-owning peer <i>directly</i> which is in fact possible by using DHT's (distributed hash tables). Because we need some redundancy to compensate for missing peers, we ask several peers simultanously. To collect their response, we wait a little time of at most 10 seconds. The user may configure a search time different than 10 seconds, but this is our target of <i>maximum</i> search time.</p>
<p>Don't be scared. We have an architecture that hides your private browsing profile from others. For example: none of the words that are indexed from the pages you have seen is stored in clear text on your computer. Instead, a hash is used which can not be computed back into the original word. Because index files travel among peers, you cannot state if a specific link was visited by you or another peer-user, so this frees you from being responsible about the index files on your machine.</p>
<p>The database stores either tables or property-lists in files with the structure of AVL-Trees (which are height-regulated binary trees). Such a search tree ensures a logarithmic order of computation time. For example a search within an AVL tree with one million entries needs an average of 20 comparisons, and at most 24 in the worst case. This database is therefore extremely fast. It lacks an API like SQL or the LDAP protocol, but it does not need one because it provides a highly specialized database structure. The missing interface pays off with a very small organization overhead, which improves the speed further in comparison to other databases with SQL or LDAP api's. This database is fast enough for millions of indexed web pages, maybe also for billions.</p>
<p>The database structure we need is very special. One demand is that the entries can be retrieved in logarithmic time <i>and</i> can be enumerated in any order. Enumeration in a specific order is needed to create conjunctions of tables very fast. This is needed when someone searches for several words. We implement the search word conjunction by pairwise and simultanous enumeration/comparisment of index trees/sequences. This forces us to use binary trees as data structure. Another demand is that we need the ability to have many index tables, maybe <i>millions of tables</i>. The size of the tables may be not big in average, but we need many of them. This is in contrast of the organization of relational databases, where the focus is on management of very large tables, but not of many of them. A third demand is the ease of installation and maintenance: the user shall not be forced to install a RBMS first, care about tablespaces and such. The integrated database is completely service-free.</p>
If your peer is in Senior Mode, it is an access point for index sharing and distribution. It can be contacted for search requests and it collects index files from other peers. If your peer is in Junior Mode, it collects index files from your browsing and distributes them only to other Senior peers, but does not collect index files.
<p>Some p2p-based file sharing software assign non-contributing peers very low priority. We think that that this is not always fair since sometimes the operator does not have the choice of opening the firewall or configuring the router accordingly. Our idea of 'information wares' and their exchange can also be applied to junior peers: they must contribute to the global index by submitting their index <i>actively</i>, while senior peers contribute <i>passively</i>.
But enough senior peers are needed to make this architecture functional. Since any peer contributes almost equally, either actively or passively, you should decide to run in Senior Mode if you can.
If you want to add your own code, you are welcome; but please contact the author first and discuss your idea to see how it may fit into the overall architecture.