YaCy '#[clientname]#': Crawl Start

#%env/templates/metas.template%#

Click on this API button to see a documentation of the POST request parameter for crawl starts.

#%env/templates/header.template%# #%env/templates/submenuIndexCreate.template%#

Expert Crawl Start

Start Crawling Job: You can define URLs as start points for Web page crawling and start crawling here. "Crawling" means that YaCy will download the given website, extract all links in it and then download the content behind these links. This is repeated as long as specified under "Crawling Depth". A crawl can also be started using wget and the post arguments for this web page.

Attribute

Value

Description

Starting Point:

One Start URL or a list of URLs: (must start with http:// https:// ftp:// smb:// file://):		#[starturl]#

From Link-List of URL:
From Sitemap:
From File (enter a path within your local file system):

Define the start-url(s) here. You can submit more than one URL, each line one URL please. Each of these URLs are the root for a crawl start, existing start URLs are always re-loaded. Other already visited URLs are sorted out as "double", if they are not allowed using the re-crawl option.

Crawling Depth:

also all linked non-parsable documents
Unlimited crawl depth for URLs matching with:

This defines how often the Crawler will follow links (of links..) embedded in websites. 0 means that only the page you enter under "Starting Point" will be added to the index. 2-4 is good for normal indexing. Values over 8 are not useful, since a depth-8 crawl will index approximately 25.600.000.000 pages, maybe this is the whole WWW.

Must-Match Filter:

on URLs for Crawling: Restrict to start domain(s) Restrict to sub-path(s) Use filter
on IPs for Crawling:
on URLs for Indexing

The filter is a regular expression that must match with the URLs which are used to be crawled; default is 'catch all'. Example: to allow only urls that contain the word 'science', set the filter to '.*science.*'. You can also use an automatic domain-restriction to fully crawl a single domain.

Must-Not-Match Filter:

on URLs for Crawling:
on IPs for Crawling:
on URLs for Indexing:

The filter is a regular expression that must not match with the URLs to allow that the content of the url is indexed.

Document Deletion

No Deletion: Do not delete any document before the crawl is started.
Delete start host: For each host in the start url list, delete all documents from that host.
Delete only old: Treat documents that are loaded ago as stale and delete them before the crawl is started.

After a crawl was done in the past, document may become stale and eventually they are also deleted on the target host. To remove old files from the search index it is not sufficient to just consider them for re-load but it may be necessary to delete them because they simply do not exist any more. Use this in combination with re-crawl while this time should be longer.

Document Double-Check

No Doubles: Never load any page that is already known.
Only the start-url may be loaded again.
Re-load: Treat documents that are loaded ago as stale and load them again. If they are younger, they are ignored.

A web crawl performs a double-check on all links found in the internet against the internal database. If the same url is found again, then the url is treated as double when you check the 'no doubles' option. A url may be loaded again when it has reached a specific age, to use that check the 're-load' option.

Must-Match List for Country Codes:

Use filter
no country code restriction

Crawls can be restricted to specific countries. This uses the country code that can be computed from the IP of the server that hosts the page. The filter is not a regular expressions but a list of country codes, separated by comma.

Maximum Pages per Domain:

Use: Page-Count:

You can limit the maximum number of pages that are fetched and indexed from a single domain with this option. You can combine this limitation with the 'Auto-Dom-Filter', so that the limit is applied to all the domains within the given depth. Domains outside the given depth are then sorted-out anyway.

Accept URLs with '?' / dynamic URLs:

A questionmark is usually a hint for a dynamic page. URLs pointing to dynamic content should usually not be crawled. However, there are sometimes web pages with static content that is accessed with URLs containing question marks. If you are unsure, do not check this to avoid crawl loops.

Store to Web Cache:

This option is used by default for proxy prefetch, but is not needed for explicit crawling.

Policy for usage of Web Cache:

no cache if fresh if exist cache only

The caching policy states when to use the cache during crawling: no cache: never use the cache, all content from fresh internet source; if fresh: use the cache if the cache exists and is fresh using the proxy-fresh rules; if exist: use the cache if the cache exist. Do no check freshness. Otherwise use online source; cache only: never go online, use all content from cache. If no cache exist, treat content as unavailable

Do Local Indexing:

index text: index media:

This enables indexing of the wepages the crawler will download. This should be switched on by default, unless you want to crawl only to fill the Document Cache without indexing.

Do Remote Indexing:

Describe your intention to start this global crawl (optional):

This message will appear in the 'Other Peer Crawl Start' table of other peers.

If checked, the crawler will contact other peers and use them as remote indexers for your crawl. If you need your crawling results locally, you should switch this off. Only senior and principal peers can initiate or receive remote crawls. A YaCyNews message will be created to inform all peers about a global crawl, so they can omit starting a crawl with the same start point.

Add Crawl result to collection(s):

A crawl result can be tagged with names which are candidates for a collection request. These tags can be selected with the GSA interface using the 'site' operator. To use this option, the 'collection_sxt'-field must be switched on in the Solr Schema

#%env/templates/footer.template%#