You can define URLs as start points for Web page crawling and start that crawling here.
You can define URLs as start points for Web page crawling and start crawling here. "Crawling" means that YaCy will download the given website, extract all links in it and then download the content behind these links. This is repeated as long as specified under "Crawling Depth".
Be careful with the prefetch number. Consider a branching factor of average 20;
A prefetch-depth of 8 would index 25.600.000.000 pages, maybe the whole WWW.
This defines how often the Crawler will follow links embedded in websites.<br>
A minimum of 1 is recommended and means that the page you enter under "Starting Point" will be added to the index, but no linked content is indexed. 2-4 is good for normal indexing.
Be careful with the depth. Consider a branching factor of average 20;
A prefetch-depth of 8 would index 25.600.000.000 pages, maybe this is the whole WWW.
URLs pointing to dynamic content should usually not be crawled. However, there are sometimes web pages with static content that
is accessed with URL's containing question marks. If you are unsure, do not check this to avoid crawl loops.
A questionmark is usually a hint for a dynamic page. URLs pointing to dynamic content should usually not be crawled. However, there are sometimes web pages with static content that
is accessed with URLs containing question marks. If you are unsure, do not check this to avoid crawl loops.
</td>
</tr>
<trvalign="top"class="TableCellDark">
@ -60,7 +61,7 @@ You can define URLs as start points for Web page crawling and start that crawlin
To exclude all words given in the file <ttclass=small>yacy.stopwords</tt> from indexing,
This can be useful to circumvent that extremely common words are added to the database, i.e. "the", "he", "she", "it"... To exclude all words given in the file <ttclass=small>yacy.stopwords</tt> from indexing,
check this box.
</td>
</tr>
@ -118,8 +119,8 @@ You can define URLs as start points for Web page crawling and start that crawlin
</tr>
</table>
</td>
<tdclass=smallcolspan="3"rowspan="2">Existing start URL's are re-crawled.
Other already visited URL's are sorted out as 'double'.
<tdclass=smallcolspan="3"rowspan="2">Existing start URLs are re-crawled.
Other already visited URLs are sorted out as 'double'.
A complete re-crawl will be available soon.
</td>
</tr>
@ -146,7 +147,7 @@ Your peer can search and index for other peers and they can search for you.</div
<b>It will take at least 30 seconds until the first result appears there. Please be patient, the crawling will pause each time you use the proxy or web server to ensure maximum availability.</b>
If you crawl any un-wanted pages, you can delete them <ahref="IndexCreateWWWLocalQueue_p.html">here</a>.<br>
::
Removed #[numEntries]# entries from crawl queue. This queue may fill again if the loading and indexing queue is not empty
Removed #[numEntries]# entries from crawl queue. This queue may fill again if the loading and indexing queue is not empty.