YaCy '#[clientname]#': Crawl Start

#%env/templates/metas.template%# #%env/templates/header.template%# #%env/templates/submenuIndexCreate.template%#

Crawl Start

Start Crawling Job: You can define URLs as start points for Web page crawling and start crawling here. "Crawling" means that YaCy will download the given website, extract all links in it and then download the content behind these links. This is repeated as long as specified under "Crawling Depth".

Attribut Value Description

Starting Point:

From URL:
From Sitemap:
From File:

Existing start URLs are always re-crawled. Other already visited URLs are sorted out as "double", if they are not allowed using the re-crawl option.

Create Bookmark

Use: (works with "Starting Point: From URL" only)

Title:

Folder:

This option lets you create a bookmark from your crawl start URL. For automatic re-crawling you can use the following default folders:

/autoReCrawl/hourly
/autoReCrawl/daily
/autoReCrawl/weekly
/autoReCrawl/monthly

Attention: recrawl settings depend on the folder. They can be adjusted in /DATA/SETTINGS/autoReCrawl.conf.

Crawling Depth: This defines how often the Crawler will follow links embedded in websites.
A minimum of 0 is recommended and means that the page you enter under "Starting Point" will be added to the index, but no linked content is indexed. 2-4 is good for normal indexing. Be careful with the depth. Consider a branching factor of average 20; A prefetch-depth of 8 would index 25.600.000.000 pages, maybe this is the whole WWW.

Must-Match Filter: Use filter
Restrict to start domain
Restrict to sub-path The filter is a regular expression that must match with the URLs which are used to be crawled; default is 'catch all'. Example: to allow only urls that contain the word 'science', set the filter to '.*science.*'. You can also use an automatic domain-restriction to fully crawl a single domain.

Must-Not-Match Filter: This filter must not match to allow that the page is accepted for crawling. The empty string is a never-match filter which should do well for most cases. If you don't know what this means, please leave this field empty.

Re-crawl known URLs: Use: If older than: If you use this option, web pages that are already existent in your database are crawled and indexed again. It depends on the age of the last crawl if this is done or not: if the last crawl is older than the given date, the page is crawled again, otherwise it is treated as 'double' and not loaded or indexed again.

Auto-Dom-Filter: Use: Depth: This option will automatically create a domain-filter which limits the crawl on domains the crawler will find on the given depth. You can use this option i.e. to crawl a page with bookmarks while restricting the crawl on only those domains that appear on the bookmark-page. The adequate depth for this example would be 1.
The default value 0 gives no restrictions.

Maximum Pages per Domain: Use: Page-Count: You can limit the maxmimum number of pages that are fetched and indexed from a single domain with this option. You can combine this limitation with the 'Auto-Dom-Filter', so that the limit is applied to all the domains within the given depth. Domains outside the given depth are then sorted-out anyway.

Accept URLs with '?' / dynamic URLs: A questionmark is usually a hint for a dynamic page. URLs pointing to dynamic content should usually not be crawled. However, there are sometimes web pages with static content that is accessed with URLs containing question marks. If you are unsure, do not check this to avoid crawl loops.

Store to Web Cache: This option is used by default for proxy prefetch, but is not needed for explicit crawling.

Policy for usage of Web Cache: no cache if fresh if exist cache only The caching policy states when to use the cache during crawling: no cache: never use the cache, all content from fresh internet source; if fresh: use the cache if the cache exists and is fresh using the proxy-fresh rules; if exist: use the cache if the cache exist. Do no check freshness. Othervise use online source; cache only: never go online, use all content from cache. If no cache exist, treat content as unavailable

Do Local Indexing: index text: index media: This enables indexing of the wepages the crawler will download. This should be switched on by default, unless you want to crawl only to fill the Document Cache without indexing.

Do Remote Indexing:

Describe your intention to start this global crawl (optional):

This message will appear in the 'Other Peer Crawl Start' table of other peers.

If checked, the crawler will contact other peers and use them as remote indexers for your crawl. If you need your crawling results locally, you should switch this off. Only senior and principal peers can initiate or receive remote crawls. A YaCyNews message will be created to inform all peers about a global crawl, so they can omit starting a crawl with the same start point.

Exclude static Stop-Words: This can be useful to circumvent that extremely common words are added to the database, i.e. "the", "he", "she", "it"... To exclude all words given in the file yacy.stopwords from indexing, check this box.

#%env/templates/footer.template%#