You can define URLs as start points for Web page crawling and start that crawling here.
You can define URLs as start points for Web page crawling and start crawling here. "Crawling" means that YaCy will download the given website, extract all links in it and then download the content behind these links. This is repeated as long as specified under "Crawling Depth".
This defines how often the Crawler will follow links embedded in websites.<br>
Be careful with the prefetch number. Consider a branching factor of average 20;
A minimum of 1 is recommended and means that the page you enter under "Starting Point" will be added to the index, but no linked content is indexed. 2-4 is good for normal indexing.
A prefetch-depth of 8 would index 25.600.000.000 pages, maybe the whole WWW.
Be careful with the depth. Consider a branching factor of average 20;
A prefetch-depth of 8 would index 25.600.000.000 pages, maybe this is the whole WWW.
URLs pointing to dynamic content should usually not be crawled. However, there are sometimes web pages with static content that
A questionmark is usually a hint for a dynamic page. URLs pointing to dynamic content should usually not be crawled. However, there are sometimes web pages with static content that
is accessed with URL's containing question marks. If you are unsure, do not check this to avoid crawl loops.
is accessed with URLs containing question marks. If you are unsure, do not check this to avoid crawl loops.
</td>
</td>
</tr>
</tr>
<trvalign="top"class="TableCellDark">
<trvalign="top"class="TableCellDark">
@ -60,7 +61,7 @@ You can define URLs as start points for Web page crawling and start that crawlin
To exclude all words given in the file <ttclass=small>yacy.stopwords</tt> from indexing,
This can be useful to circumvent that extremely common words are added to the database, i.e. "the", "he", "she", "it"... To exclude all words given in the file <ttclass=small>yacy.stopwords</tt> from indexing,
check this box.
check this box.
</td>
</td>
</tr>
</tr>
@ -118,8 +119,8 @@ You can define URLs as start points for Web page crawling and start that crawlin
</tr>
</tr>
</table>
</table>
</td>
</td>
<tdclass=smallcolspan="3"rowspan="2">Existing start URL's are re-crawled.
<tdclass=smallcolspan="3"rowspan="2">Existing start URLs are re-crawled.
Other already visited URL's are sorted out as 'double'.
Other already visited URLs are sorted out as 'double'.
A complete re-crawl will be available soon.
A complete re-crawl will be available soon.
</td>
</td>
</tr>
</tr>
@ -146,7 +147,7 @@ Your peer can search and index for other peers and they can search for you.</div
<b>It will take at least 30 seconds until the first result appears there. Please be patient, the crawling will pause each time you use the proxy or web server to ensure maximum availability.</b>
<b>It will take at least 30 seconds until the first result appears there. Please be patient, the crawling will pause each time you use the proxy or web server to ensure maximum availability.</b>
If you crawl any un-wanted pages, you can delete them <ahref="IndexCreateWWWLocalQueue_p.html">here</a>.<br>
If you crawl any un-wanted pages, you can delete them <ahref="IndexCreateWWWLocalQueue_p.html">here</a>.<br>
::
::
Removed #[numEntries]# entries from crawl queue. This queue may fill again if the loading and indexing queue is not empty
Removed #[numEntries]# entries from crawl queue. This queue may fill again if the loading and indexing queue is not empty.