Crawl Job
A Crawl Job consist of one or more start point, crawl limitations and document freshness rules.
Start Point
One Start URL or a list of URLs: (must start with http:// https:// ftp:// smb:// file://)
Define the start-url(s) here. You can submit more than one URL, each line one URL please.
Each of these URLs are the root for a crawl start, existing start URLs are always re-loaded.
Other already visited URLs are sorted out as "double", if they are not allowed using the re-crawl option.
From Link-List of URL
From Sitemap
From File (enter a path within your local file system)
Crawler Filter
These are limitations on the crawl stacker. The filters will be applied before a web page is loaded.
Crawling Depth
This defines how often the Crawler will follow links (of links..) embedded in websites.
0 means that only the page you enter under "Starting Point" will be added
to the index. 2-4 is good for normal indexing. Values over 8 are not useful, since a depth-8 crawl will
index approximately 25.600.000.000 pages, maybe this is the whole WWW.
also all linked non-parsable documents
Unlimited crawl depth for URLs matching with
Maximum Pages per Domain
You can limit the maximum number of pages that are fetched and indexed from a single domain with this option.
You can combine this limitation with the 'Auto-Dom-Filter', so that the limit is applied to all the domains within
the given depth. Domains outside the given depth are then sorted-out anyway.
Use :
Page-Count :
misc. Constraints
A questionmark is usually a hint for a dynamic page. URLs pointing to dynamic content should usually not be crawled.
However, there are sometimes web pages with static content that
is accessed with URLs containing question marks. If you are unsure, do not check this to avoid crawl loops.
Following frames is NOT done by Gxxg1e, but we do by default to have a richer content. 'nofollow' in robots metadata can be overridden; this does not affect obeying of the robots.txt which is never ignored.
Accept URLs with query-part ('?'):
Obey html-robots-noindex:
Obey html-robots-nofollow:
Load Filter on URLs
The filter is a regular expression .
Example: to allow only urls that contain the word 'science', set the must-match filter to '.*science.*'.
You can also use an automatic domain-restriction to fully crawl a single domain.
Attention: you can test the functionality of your regular expressions using the Regular Expression Tester within YaCy.
Load Filter on IPs
Must-Match List for Country Codes
Crawls can be restricted to specific countries. This uses the country code that can be computed from
the IP of the server that hosts the page. The filter is not a regular expressions but a list of country codes, separated by comma.
no country code restriction
Use filter
Document Filter
These are limitations on index feeder. The filters will be applied after a web page was loaded.
Filter on URLs
The filter is a regular expression
that must not match with the URLs to allow that the content of the url is indexed.
Attention: you can test the functionality of your regular expressions using the Regular Expression Tester within YaCy.
Filter on Content of Document (all visible text, including camel-case-tokenized url and title)
Filter on Document Media Type (aka MIME type)
The filter is a regular expression
that must match with the document Media Type (also known as MIME Type) to allow the URL to be indexed.
Standard Media Types are described at the IANA registry .
Attention: you can test the functionality of your regular expressions using the Regular Expression Tester within YaCy.
Content Filter
These are limitations on parts of a document. The filter will be applied after a web page was loaded.
Filter div or nav class names
Clean-Up before Crawl Start
Clean up search events cache
Check this option to be sure to get fresh search results including newly crawled documents. Beware that it will also interrupt any refreshing/resorting of search results currently requested from browser-side.
No Deletion
After a crawl was done in the past, document may become stale and eventually they are also deleted on the target host.
To remove old files from the search index it is not sufficient to just consider them for re-load but it may be necessary
to delete them because they simply do not exist any more. Use this in combination with re-crawl while this time should be longer.
Do not delete any document before the crawl is started.
Delete sub-path
For each host in the start url list, delete all documents (in the given subpath) from that host.
Delete only old
Treat documents that are loaded
#(deleteIfOlderSelect)#::
#{list}##[name]# #{/list}#
#(/deleteIfOlderSelect)#
#(deleteIfOlderUnitSelect)#::
#{list}##[name]# #{/list}#
#(/deleteIfOlderUnitSelect)#
ago as stale and delete them before the crawl is started.
Double-Check Rules
No Doubles
A web crawl performs a double-check on all links found in the internet against the internal database. If the same url is found again,
then the url is treated as double when you check the 'no doubles' option. A url may be loaded again when it has reached a specific age,
to use that check the 're-load' option.
Never load any page that is already known. Only the start-url may be loaded again.
Re-load
Treat documents that are loaded
#(reloadIfOlderSelect)#::
#{list}##[name]# #{/list}#
#(/reloadIfOlderSelect)#
#(reloadIfOlderUnitSelect)#::
#{list}##[name]# #{/list}#
#(/reloadIfOlderUnitSelect)#
ago as stale and load them again. If they are younger, they are ignored.
Document Cache
Store to Web Cache
This option is used by default for proxy prefetch, but is not needed for explicit crawling.
Policy for usage of Web Cache
The caching policy states when to use the cache during crawling:
no cache : never use the cache, all content from fresh internet source;
if fresh : use the cache if the cache exists and is fresh using the proxy-fresh rules;
if exist : use the cache if the cache exist. Do no check freshness. Otherwise use online source;
cache only : never go online, use all content from cache. If no cache exist, treat content as unavailable
no cache
if fresh
if exist
cache only
#(agentSelect)# ::
Robot Behaviour
Use Special User Agent and robot identification
You are running YaCy in non-p2p mode and because YaCy can be used as replacement for commercial search appliances
(like the GSA) the user must be able to crawl all web pages that are granted to such commercial plattforms.
Not having this option would be a strong handicap for professional usage of this software. Therefore you are able to select
alternative user agents here which have different crawl timings and also identify itself with another user agent and obey the corresponding robots rule.
#{list}#
#[name]#
#{/list}#
#(/agentSelect)#
#(vocabularySelect)#::
Enrich Vocabulary
Scraping Fields
You can use class names to enrich the terms of a vocabulary based on the text content that appears on web pages. Please write the names of classes into the matrix.
#(/vocabularySelect)#
Snapshot Creation
Max Depth for Snapshots
Snapshots are xml metadata and pictures of web pages that can be created during crawling time.
The xml data is stored in the same way as a Solr search result with one hit and the pictures will be stored as pdf into subdirectories
of HTCACHE/snapshots/. From the pdfs the jpg thumbnails are computed. Snapshot generation can be controlled using a depth parameter; that
means a snapshot is only be generated if the crawl depth of a document is smaller or equal to the given number here. If the number is set to -1,
no snapshots are generated.
Multiple Snapshot Versions
replace old snapshots with new one
add new versions for each crawl
must-not-match filter for snapshot generation
#(snapshotEnableImages)#
::
Image Creation
#(/snapshotEnableImages)#
Index Attributes
Indexing
This enables indexing of the webpages the crawler will download. This should be switched on by default, unless you want to crawl only to fill the
Document Cache without indexing.
index text :
index media :
#(remoteindexing)#::
Do Remote Indexing
If checked, the crawler will contact other peers and use them as remote indexers for your crawl.
If you need your crawling results locally, you should switch this off.
Only senior and principal peers can initiate or receive remote crawls.
A YaCyNews message will be created to inform all peers about a global crawl ,
so they can omit starting a crawl with the same start point.
#(/remoteindexing)#
Add Crawl result to collection(s)
A crawl result can be tagged with names which are candidates for a collection request.
These tags can be selected with the GSA interface using the 'site' operator.
To use this option, the 'collection_sxt'-field must be switched on in the Solr Schema
Time Zone Offset
The time zone is required when the parser detects a date in the crawled web page. Content can be searched with the on: - modifier which
requires also a time zone when a query is made. To normalize all given dates, the date is stored in UTC time zone. To get the right offset
from dates without time zones to UTC, this offset must be given here. The offset is given in minutes;
Time zone offsets for locations east of UTC must be negative; offsets for zones west of UTC must be positve.