documentation update

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2439 6c8d7289-2bf4-0310-a012-ef5d649a1542
pull/1/head
orbiter 19 years ago
parent c2264962d1
commit 57d50df858

@ -48,67 +48,28 @@ globalheader();
</NOSCRIPT>
<!-- ----- HERE STARTS CONTENT PART ----- -->
<h2>Technology Details</h2>
<h2>Details</h2>
<p><img width="480" src="grafics/YaCy_Technology_Methods.png" align="center"></p>
<p>YaCy has alarge number of document parsers included.</p><br><br>
YaCy supports the following features:<br>
<table border="0" cellspacing="1" cellpadding="3" width="100%">
<p><img width="480" src="grafics/YaCy_Technology_IndexDistribution.png" align="center"></p>
<p>When YaCy has indexed a number of web pages, it starts to distribute parts of the web index
to other YaCy peers. That causes that all indexes that users of YaCy generate are mixed with each other.
The indexes are stored to such peers that they can be found efficiently again.</p><br><br>
</td></tr><tr><td valign="top"><b>p2p-Based Global Search Engine</b></td><td>
YaCy has an index-sharing p2p-based algorithm which creates a global distributed search engine.
This spawns a world-wide global search index.
<p><img width="480" src="grafics/YaCy_Technology_IndexSearch.png" align="center"></p>
<p>When a web search in you local peer is done, it searches not only the own database, but also
databases of other YaCy peers. Only those peers are searched that should have the specific index.</p><br><br>
</td></tr><tr><td valign="top"><b>Caching HTTP and transparent HTTPS Proxy with page indexing</b></td><td>
With optional pre-fetching. HTTP 1.1 with GET/HEAD/POST/CONNECT is supported. This is sufficient for nearly all public web pages.
HTTP headers are transparently forwarded. HTTPS connections through target port 443 are transparently forwarded, non-443 connections are suppressed to enhance security. Both (HTTP and HTTPS) proxies share the same proxy port, which is by default port 8080.
The proxy 'scrapes' the content that it passes and creates an index that can be shared between every YACY Proxy daemons.
You can use the indexing feature for intranet indexing:
you instantly have a search service at hand to index all intranet-served web pages.
You don't need to set up a separated search service. And the used <a href="http://www.anomic.de/AnomicPlasma/index.html">PLASMA</a>
indexing is not a naive quick-hack but an <a href="Technology.html">properly engineered and extremely fast algorithm</a>;
it is capable of indexing a nearly unlimited number of pages, without slowing down the search process.
<p><img width="480" src="grafics/YaCy_Technology_Workflow.png" align="center"></p>
<p>The workflow inside YaCy. There is no user action required for steering of the workflow,
but the user interface offers a large number of monitoring pages that show status information about the queues and stacks.</p>
</td></tr><tr><td valign="top"><b>Privacy</b></td><td>
YaCy protects your privacy. Please see the
<a href="Technology.html">privacy secion in the documentation.</a>;
</td></tr><tr><td valign="top"><b>Security</b></td><td>
YaCy can block unwanted access by setting IP filters and http passwords.
You can also enhance security by inspecting the source code, which is completely included.
Check the code and re-build your own YaCy application.
</td></tr><tr><td valign="top"><b>Web/HTTP server</b></td><td>
The built-in HTTP server is the interface to the local and global search service;
the server may not only be used to administrate your peer, but also to serve as an intranet/internet web server.
</td></tr><tr><td valign="top"><b>Ideal Internet Cafe Proxy Solution</b></td><td>
Every Internet Cafe needs a caching proxy instead only a NAT to route the cafe's client traffic from the internet to maximize bandwidth.
This can only be done using a <i>caching</i> proxy. This is naturally provided by YaCy. Future versions may also include
billing support functions.
</td></tr><tr><td valign="top"><b>Terminal-Based</b></td><td>
YaCy does not need to have a window-based environment and can run on a screen-less router;
it has a user interface based on web pages using its own http server.
</td></tr><tr><td valign="top"><b>Open-Source</b></td><td>
This is a simple necessity for an application that implements a server.
Don't use any other server software that does not come with the source code.
<a href="Volunteers.html">Volunteers</a> to extent YaCy are welcome!
If you think you have a great idea how to extend/enhance/fix YaCy,
please let me know.
</td></tr><tr><td valign="top"><b>Easy Installation</b></td><td>
You just need to decompress the release containter with your favourite decompressor
(zip, rar, sit, tar etc. will do) and double-click the application wrapper
for your OS. No restart necessary.
Just double-click the application wrapper.
<tr><td width="30%" valign="top"><b>Licence Model</b></td><td width="70%">
This is GPL-based freeware/open-source software!
The release comes with complete source code.
See <a href="License.html">the license</a> for details.
</td></tr></table>
<!-- ----- HERE ENDS CONTENT PART ----- -->
<SCRIPT LANGUAGE="JavaScript1.1"><!--

@ -49,9 +49,6 @@ globalheader();
<h2>FAQ</h2>
<p>YaCy is not only a distributed search engine, but also a caching HTTP proxy.
Both application parts benefit from each other.</p>
<h3>Can I Crawl The Web With YaCy?</h3>
<p>Yes! You can start your own crawl and you may also trigger distributed crawling,
which means that your own YaCy peer asks other peers to perform specific crawl tasks.
@ -59,13 +56,13 @@ You can specify many parameters that focus your crawl to a limited set of web pa
<h3>What do you mean with 'Global Search Engine'?</h3>
<p>The integrated indexing and search service can not only be used locally, but also <i>globally</i>.
Each YaCy peer distributes some contact information to all other proxies that can be reached in the internet,
and proxies exchange <i>but do not copy</i> their indexes to each other.
Each YaCy peer distributes some contact information to all other peers that can be reached,
and peers exchange <i>but do not copy</i> their indexes to each other.
This is done in such a way, that each <i>peer</i> knows how to address the correct other
<i>peer</i> to retrieve a special search index.
Therefore the community of all proxies spawns a <i>distributed hash table</i> (DHT)
which is used to share the <i>reverse word index</i> (RWI) to all operators and users of the proxies.
The applied logic of distribution and retrieval of RWI's on the DHT combines all participating proxies to
Therefore the community of all peers spawns a <i>distributed hash table</i> (DHT)
which is used to share the <i>reverse word index</i> (RWI) to all operators and users of the YaCy peers.
The applied logic of distribution and retrieval of RWI's on the DHT combines all participating peers to
a <i>Distributed Search Engine</i>.
To point out that this is in contrast to local indexing and searching,
we call it a <i>Global Search Engine</i>.
@ -88,17 +85,10 @@ Junior peers can contribute to the network by submitting index files to senior/p
If you run YaCy, you need an average of the same disc memory amount for the index as you need for the cache.
In fact, the global space for the index may reach the space of Terabytes, but not all of that on your machine!</p>
<h3>Search Engines must do crawling, don't they? Do you?</h3>
<p>No. They <i>can</i> do, but we collect information by simply using the information that passes the proxy.
If you <i>want</i> to crawl, you can do so and start your own crawl job with a certain search depth.</p>
<h3>Do I need a fast machine? Search Engines need big server farms, don't they?</h3>
<p>You don't need a fast machine to run YaCy. You also don't need a lot of space.
You can configure the amount of Megabytes that you want to spend for the cache.
Any time-critical task is delayed automatically and takes place
when you are idle surfing (which works only if you use YaCy as http proxy).
Whenever internet pages pass YaCy in proxy-mode,
any indexing (or if wanted: prefetch-crawling) is interrupted and delayed.
YaCy can also run on a vServer.
</p>
<h3>I don't want to wait for search results very long. How long does a search take?</h3>
@ -159,19 +149,6 @@ Therefore we don't need to give junior peers low priority: they contribute equal
But enough senior peers are needed to make this architecture functional. Since any peer contributes almost equally, either actively or passively, you should decide to run in Senior Mode if you can.
</p>
<h3>Why is this Search Engine also a Proxy?</h3>
<p>
We wanted to avoid that you start a search service ony for that very time when you submit a search query. This would give the Search Engine too little online time. So we looked for a cause the you would like to run the Search Engine during all the time that you are online. By giving you the additional value of a caching proxy, the reason was found. The built-in blacklist (url filter, useful i.e. to block ads) for the proxy is another increase in value.
</p>
<h3>Why is this Proxy also a Search Engine?</h3>
<p>YaCy has a built-in <i>caching</i> proxy, which means that YaCy has a lot of indexing information
'for free' without crawling. This may not be a very usual function of a proxy, but a very useful one:
you see a lot of information when you browse the internet and maybe you would like to search exactly
only what you have seen. Beside this interesting feature, you can use YaCy to index an intranet
simply by using the proxy; you don't need to additionally set up another search/indexing process or databases.
YaCy gives you an 'instant' database and an 'instant' search service.</p>
<h3>My YaCy says it runs in 'Junior Mode'. How can I run it in Senior Mode?</h3>
<p>Open your firewall for port 8080 (or the port you configured) or program your router to act as a <i>virtual server</i>.</p>

@ -54,12 +54,12 @@ globalheader();
<p><table border="0" cellspacing="1" cellpadding="1" width="100%">
<tr><td colspan="2" width="100%" valign="top"><hr></td></tr>
<tr><td width="30%" valign="top"><center><b>Any Java2 System</b></center></td><td width="70%">
<p>YACY is written entirely in Java (Version Java2 / 1.2 and up). Any system that supports Java2 can run YACY. That means it runs on almost any commercial and free platforms and operation systems that are around. This includes of course Mac OS X, Windows (NT, W2K, XP) and Linux systems. For java support of your platform, please see the <a href="Installation.html">installation</a> documentation.</p>
<tr><td width="30%" valign="top"><center><b>Any Java 2 System</b></center></td><td width="70%">
<p>YaCy is written entirely in Java 1.4.2.<br>GNU Classpath 0.92 can be used in exchange of Sun classes (tested only with JamVM)</p>
</td></tr>
<tr><td colspan="2" width="100%" valign="top"><hr></td></tr>
<tr><td width="30%" valign="top"><center><b>Windows</b><br><img src="grafics/startupWin.gif"></center></td><td width="70%">
The Proxy runs seamless on any Windows System and comes with an easy-to-use installer application. Just install and use the proxy like any other Windows application. Please download the Windows Release Flavour of YACY instead the generic one.
YaCy runs on Windows and comes with an easy-to-use installer application. Please download the Windows Release Flavour of YaCy instead the generic one.
</td></tr>
<tr><td colspan="2" width="100%" valign="top"><hr></td></tr>
<tr><td width="30%" valign="top"><center><b>Mac OS X</b><br><img src="grafics/startupMac.gif"></center></td><td width="70%">
@ -67,7 +67,7 @@ The general distribution includes a Mac OS X wrapper shell, which is double-clic
</td></tr>
<tr><td colspan="2" width="100%" valign="top"><hr></td></tr>
<tr><td width="30%" valign="top"><center><b>Linux/Unix</b><br><img src="grafics/startupLinux.gif"></center></td><td width="70%">
The proxy environment is terminal-based, not windows-based. You can start the proxy in a console, and monitor it's actions through a log file. A wrapper shell script for easy startup is included. You can administrate the proxy remotely through the built-in http server with any browser.
You can start YaCy in a console, and monitor it's actions through a log file. A wrapper shell script for easy startup is included. You can administrate the proxy remotely through the built-in http server with any browser.
</td></tr>
<tr><td colspan="2" width="100%" valign="top"><hr></td></tr>
</table></p>

@ -48,75 +48,32 @@ globalheader();
</NOSCRIPT>
<!-- ----- HERE STARTS CONTENT PART ----- -->
<h2>Technology</h2>
<h2>Web Search Technology</h2><br>
<p><img width="480" src="grafics/YaCy_Technology_Components.png" align="center"></p>
<p>YaCy consists mainly of four parts:
a <b>web crawler</b>, an <b>indexer</b>,
a built-in <b>database engine</b> and
the <b>p2p index exchange</b> protocol, based on http.
The YaCy search engine can be accessed through the <b>built-in http server</b>.
All parts of this architecture are included in the YACY distribution.</p><br><br>
<p><img width="480" src="grafics/YaCy_Technology_UserInterface.png" align="center"></p>
<p>YaCy has a built-in http server,
and the user interface is realized as web pages on the own web server.
A search request to YaCy is done inside your web browser.
</p><br><br>
<p><img width="480" src="grafics/YaCy_Technology_Crawler.png" align="center"></p>
<p>A web search engine can only search web pages that had been <i>crawled</i>, which means that
all pages of subpages (and so on) of a start point had been loaded. YaCy has an integrated web crawler.</p><br><br>
<p><img width="480" src="grafics/YaCy_Technology_Indexing.png" align="center"></p>
<p>Bevore a huge number of web pages can be searched efficiently, the pages must be <i>indexed</i>.
This is a very difficult process which runs inside YaCy without any user action.
After indexing of web pages a single YaCy installation is able to provide search results
from more that 10 million of web pages efficiently.</p>
<p>YACY consists mainly of four parts: the <b>p2p index exchange</b> protocol, based on http; a <b>spider/indexer</b>; a <b>caching http proxy</b> which is not only a simple <i>increase in value</i> but also an <i>informtaion provider</i> for the indexing engine and the built-in <b>database engine</b> which makes installation and maintenance of yacy very easy.</p>
<center><img src="grafics/architecture.gif"></center>
<p>All parts of this architecture are included in the YACY distribution. The YACY search engine can be accessed through the built-in http server.</p>
<p><h3>Algorithms</h3>
<p>For our software architecture we emphasize that always the approriate data structure and algorithm is used
to ensure maximum performance. <b>The right combination of structure and algorithm results in an ideal
order of computability which is the key to performant application design.</b> We reject the myth that
the Java language is not appropriate for time-critical software; in contrast to that myth we
believe that Java with it's clean and save-to-use dynamic data structures is most notably qualified
to implement highly complex algorithms.</p>
<p><table border="0" cellspacing="1" cellpadding="3" width="100%">
<tr><td width="30%" valign="top"><b>Transparent HTTP and HTTPS Proxy and Caching:</b></td><td width="70%">
The proxy implementation provides a fast content-passing, since every file that the proxy reads from the targeted server is streamed directly to the accessing client while the stream is copied to a RAM cache for later processing. This ensures that the proxy mode is extremely fast and does not interrupt browsing. Whenever the Proxy idles, it processes it's RAM cache to perform indexing and storage to a local file of the cache. Every HTTP header that was passed along with the file is stored in a database and is re-used later on when a cache hit appears. The proxy function has maximum priority above other tasks, like cache management or indexing functions.
</td></tr>
<tr><td width="30%" valign="top"><b>Fast Database Implementation:</b></td><td width="70%">
We implemented a file-based AVL tree upon a random-access-file. Tree nodes can be dynamically allocated and de-allocated and an unused-node list is maintained. For the PLASMA search algorithm, an ordered access to search results are necessary, therefore we needed an indexing mechanism which stores the index in an ordered way. The database supports such access, and the resulting database tables are stored as a single file. The database does not need any set-up or maintenance tasks that must done by an administrator. It is completely self-organizing. The AVL property ensures maximum performance in terms of algorithmic order. Any database may grow to an unthinkable number of records: with one billion records a database request needs a theoretical maximum number of only 44 comparisments.
</td></tr>
<tr><td width="30%" valign="top"><b>Sophisticated Page Indexing:</b></td><td width="70%">
The page indexing is done by the creation of a 'reverse word index': every page is parsed, the words are extracted and for every word a database table is maintained. The database tables are held in a file-based hash-table, so accessing a word index is extremely fast, resulting in an extremely fast search. Conjunctions of search words are easily found, because the search results for each word is ordered and can be pairwise enumerated. In terms of computability: the order of the searched access efford to the word index for a single word is O(log &lt;number of words in database&gt;). It is always constant fast, since the data structure provides a 'pre-calculated' result. This means, the result speed is <i>independent</i> from the number of indexed pages! It only slows down for a page-ranking, and is multiplied by the number of words that are searched simultanously. That means, the search efford for n words is O(n * log w). You can't do better (consider that n is always small, since you rarely search for more that 10 words).
</td></tr>
<tr><td width="30%" valign="top"><b>Massive-Parallel Distributed Search Engine:</b></td><td width="70%">
This technology is the driving force behind the YACY implementation. A DHT (Distributed Hash Table) - like technique will be used to publish the word cache. The idea is, that word indexes travel along the peers <i>before</i> a search request arrives at a specific word index. A search for a specific word would be performed by computing the peer and point <i>directly</i> to the peer, that hosts the index. No peer-hopping or such, since search requests are time-critical (the user usually does not want to wait long). Redundancy must be implemented as well, to catch up the (often) occasions of disappearing peers. Privacy is ensured, since no peer can know which word index is stored, updated or passed since word indexes are stored under a word hash, not the word itself. Search mis-use is regulated by the p2p-laws of give-and-take: every peer must contribute in the crawl/proxy-and-index - process before it is allowed to search.
</td></tr>
</table></p>
<p><h3>Privacy</h3>
Sharing the index to other users may concern you about your privacy. We have made great efforts to keep and secure your privacy:
<table border="0" cellspacing="1" cellpadding="3" width="100%">
<tr><td width="30%" valign="top"><b>Private Index and Index Movement</b></td><td width="70%">
Your local word index does not only contain information that you created by surfing the internet, but also entries from other peers.
Word index files travel along the proxy peers to form a distributed hash table. Therefore nobody can argue that information that
is provided by your peer was also retrieved by your peer and therefore by your personal use of the internet. In fact it is very unlikely that
information that can be found on your peer was created by you, since the search process targets only peers where it is likely because
of the movement of the index to form the distributed hash table. During a test phase, all word indexes on your peer will be accessible.
The future production release will constraint searches to indexes entries on your peer that have been created by other peers, which will
ensure complete browsing privacy.
</td></tr><tr><td valign="top"><b>Word Index Storage and Content Responsibility</b></td><td>
The words that are stored in your local word index are stored using a word hash. That means that not any word is stored, but only the word hash.
You cannot find any word that is indexed as clear text. You can also not re-translate the word hashes into the original word. This means that
you don't know actually which words are stored in your system. The positive effect is, that you cannot be responsible for the words that
are stored in your peer. But if you want to deny storage of specific words, you can put them into the 'bluelist' (in the file yacy.bluelist).
No word that is in the bluelist can be stored, searched or even viewed through the proxy.
</td></tr><tr><td valign="top"><b>Peer Communication Encryption</b></td><td>
Information that is passed from one peer to another is encoded. That means that no information like search words,
indexed URL's or URL descriptions is transported in clear text. Network sniffers cannot see the content that is exchanged.
We also implemented an encryption method, where a temporary key, created by the requesting peer is used to encrypt the response
(not yet active in test release, but non-ascii/base64 - encoding is in place).
</td></tr><tr><td valign="top"><b>Access Restrictions</b></td><td>
The proxy contains a two-stage access control: IP filter check and an account/password gateway that can be configured to access the proxy.
The default setting denies access to your proxy from the internet, but allowes usage from the intranet. The proxy and it's security settings
can be configured using the built-in web server for service pages; the access to this service pages itself can also be restricted again by using
an IP filter and an account/password combination.
</td></tr></table>
<!-- ----- HERE ENDS CONTENT PART ----- -->
<SCRIPT LANGUAGE="JavaScript1.1"><!--

@ -52,9 +52,8 @@ globalheader();
<p>The following persons are involved (alphabetical order):
<ul>
<li><b>Michael Christen/Orbiter</b> is project founder; designed and implemented the overall architecture, is chief software architect, release management, kelondro database, yacy core protocol, indexing technique and database structure, search and ranking functionality, http client/server architecture, admin of yacy.net</li>
<li><b>Michael Christen</b> is project founder; designed and implemented the overall architecture, is chief software architect, release management, kelondro database, yacy core protocol, indexing technique and database structure, search and ranking functionality, http client/server architecture, admin of yacy.net</li>
<li><b>Natali Christen</b> designed the YaCy logo.</li>
<li><b>Oliver Wunder</b> provided some german translation. He also made bittorrent-releases</li>
<li><b>Stephan Hermens</b> has made some important bugfixes.</li>
<li><b>Matthias Kempka</b> provided a linux-init start/stop - script</li>
<li><b>Timo Leise</b> suggested and implemented an extension to the blacklist feature: part-of-domain matching.</li>
@ -67,6 +66,7 @@ and manager of the meta-search-engine <a href="http://www.metager.de">metaGer</a
<li><b>Matthias S&ouml;hnholz</b> added the offline-browsing feature</li>
<li><b>slick</b> helps as packager (.rpm, .deb etc)</li>
<li><b>Martin Thelian</b> made system-wide performance enhancement by introducing thread pools; he added ICAP and SOAP support, most of external parser integration, maintains the http protocol implementation, added squid compatibility, robots protocol, better logging and many index protocol, import/export and transfer enhancements. He created a YaCy screensaver and coded major parts of the yacybar Firefox extension.</li>
<li><b>Oliver Wunder</b> provided some german translation. He also made bittorrent-releases</li>
</ul>
<p>Further volunteers are very welcome.

Binary file not shown.

After

Width:  |  Height:  |  Size: 128 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 50 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 496 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 419 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 83 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 54 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 255 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 321 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 13 KiB

@ -93,29 +93,40 @@ globalheader();
</NOSCRIPT>
<!-- ----- HERE STARTS CONTENT PART ----- -->
<H2>YaCy<br><br>
&nbsp;<FONT SIZE="4">p2p-based distributed Web Search Engine</FONT></H2><br>
<H2><FONT SIZE="5">YaCy&nbsp;-&nbsp;Web Search Engine</FONT></H2><br>
<p><i>F&uuml;r eine deutsche Dokumentation sehen sie bitte <a href="http://www.yacy-websuche.de/">hier</a></p>
<br>
<table border="0" cellspacing="1" cellpadding="3" width="100%">
<tr><td valign="top" width="40">&nbsp;</td><td>
<tr valign="top">
<td width="20">&nbsp;</td>
The YaCy project is a new approach to build a P2P-based Web indexing network.<br><br>
<td width="50%">
<b>YaCy is a peer-to-peer application<br>for web search.</b><br><br>
<ul>
<li>Anonymous, independend, not-censored web search</li>
<li>This is your web search engine for freedom of information:
search requests are anonymous, independend and uncensored</li>
<li>No central server, no storage of user behaviour</li>
<li>Your can crawl the web and feed pages that you selected to the global index</li>
<li>Run your peer to support other YaCy crawlers, they support your crawler</li>
<li>Host information on your peer using the built-in http-server, file-sharing zone and wiki</li>
<li>Easy installation! No additional database required!</li>
<li>GPL'ed, freeware</li>
</ul>
<br>
Start today to contribute to the global index with our own YaCy peer!
</td>
<td width="40">&nbsp;</td>
<td width="50%">
<b>YaCy can be used as your own,<br>private search engine</b><br><br>
<ul>
<li>Set up a search portal</li>
<li>Use YaCy to provide a search function for your own web pages</li>
<li>Host web pages with the built-in http-server</li>
</ul>
</td>
<td width="20">&nbsp;</td>
</tr></table>
<p>YaCy is GPL licensed, free software.
<p><i>F&uuml;r eine deutsche Dokumentation sehen sie bitte <a href="http://www.yacy-websuche.de/">hier</a></p>
</td></tr></table>
<!-- ----- HERE ENDS CONTENT PART ----- -->
<SCRIPT LANGUAGE="JavaScript1.1"><!--

@ -1,6 +1,6 @@
var thismenu = new Array(
"index","FAQ","Details","Technology","Platforms","News","Demo","License","Download",
"Installation","Volunteers","Material","Links","Contact","",
"index","Technology","Details","Platforms","News","Demo","License","Download",
"Installation","FAQ","Volunteers","Material","Links","Contact","",
"Deutsches Forum@http://www.yacy-forum.de","English Forum@http://sourceforge.net/forum/?group_id=116142","",
"Impressum");
var root = "http://www.yacy.net/";

Loading…
Cancel
Save