Starting Point:
Existing start URLs are always re-crawled.
Other already visited URLs are sorted out as "double", if they are not allowed using the re-crawl option.
Crawling Depth:
also all linked non-parsable documents
This defines how often the Crawler will follow links (of links..) embedded in websites.
0 means that only the page you enter under "Starting Point" will be added
to the index. 2-4 is good for normal indexing. Values over 8 are not useful, since a depth-8 crawl will
index approximately 25.600.000.000 pages, maybe this is the whole WWW.
Scheduled re-crawl
no doubles
run this crawl once and never load any page that is already known, only the start-url may be loaded again.
re-load
run this crawl once, but treat urls that are known since
1 2 3
4 5 6
7
8 9 10
12 14 21
28 30
years
months
days
hours
not as double and load them again. No scheduled re-crawl.
scheduled
after starting this crawl, repeat the crawl every
1 2 3
4 5 6
7
8 9 10
12 14 21
28 30
minutes
hours
days
automatically.
A web crawl performs a double-check on all links found in the internet against the internal database. If the same url is found again,
then the url is treated as double when you check the 'no doubles' option. A url may be loaded again when it has reached a specific age,
to use that check the 're-load' option. When you want that this web crawl is repeated automatically, then check the 'scheduled' option.
In this case the crawl is repeated after the given time and no url from the previous crawl is omitted as double.
Must-Match Filter for URLs :
Use filter
Restrict to start domain
Restrict to sub-path
The filter is a regular expression
that must match with the URLs which are used to be crawled; default is 'catch all'.
Example: to allow only urls that contain the word 'science', set the filter to '.*science.*'.
You can also use an automatic domain-restriction to fully crawl a single domain.
Must-Not-Match Filter for URLs :
The filter is a regular expression
that must not match to allow that the page is accepted for crawling.
The empty string is a never-match filter which should do well for most cases.
If you don't know what this means, please leave this field empty.
Must-Match Filter for IPs :
Like the MUST-Match Filter for URLs this filter must match, but only for the IP of the host.
YaCy performs a DNS lookup for each host and this filter restricts the crawl to specific IPs
Must-Not-Match Filter for IPs :
This filter must not match on the IP of the crawled host.
Must-Match List for Country Codes :
Use filter
no country code restriction
Crawls can be restricted to specific countries. This uses the country code that can be computed from
the IP of the server that hosts the page. The filter is not a regular expressions but a list of country codes, separated by comma.
Maximum Pages per Domain:
Use :
Page-Count :
You can limit the maximum number of pages that are fetched and indexed from a single domain with this option.
You can combine this limitation with the 'Auto-Dom-Filter', so that the limit is applied to all the domains within
the given depth. Domains outside the given depth are then sorted-out anyway.
Accept URLs with '?' / dynamic URLs :
A questionmark is usually a hint for a dynamic page. URLs pointing to dynamic content should usually not be crawled. However, there are sometimes web pages with static content that
is accessed with URLs containing question marks. If you are unsure, do not check this to avoid crawl loops.
Store to Web Cache :
This option is used by default for proxy prefetch, but is not needed for explicit crawling.
Policy for usage of Web Cache :
no cache
if fresh
if exist
cache only
The caching policy states when to use the cache during crawling:
no cache : never use the cache, all content from fresh internet source;
if fresh : use the cache if the cache exists and is fresh using the proxy-fresh rules;
if exist : use the cache if the cache exist. Do no check freshness. Otherwise use online source;
cache only : never go online, use all content from cache. If no cache exist, treat content as unavailable
Do Local Indexing:
index text :
index media :
This enables indexing of the wepages the crawler will download. This should be switched on by default, unless you want to crawl only to fill the
Document Cache without indexing.
Do Remote Indexing :
If checked, the crawler will contact other peers and use them as remote indexers for your crawl.
If you need your crawling results locally, you should switch this off.
Only senior and principal peers can initiate or receive remote crawls.
A YaCyNews message will be created to inform all peers about a global crawl ,
so they can omit starting a crawl with the same start point.
Exclude static Stop-Words :
This can be useful to circumvent that extremely common words are added to the database, i.e. "the", "he", "she", "it"... To exclude all words given in the file yacy.stopwords from indexing,
check this box.