YaCy '#[clientname]#': Index Creation

#[metas]# #[header]# #[submenuIndexCreate]#

Index Creation

Start Crawling Job: You can define URLs as start points for Web page crawling and start that crawling here.

Crawling Depth: A minimum of 1 is recommended. Be careful with the prefetch number. Consider a branching factor of average 20; A prefetch-depth of 8 would index 25.600.000.000 pages, maybe the whole WWW.

Crawling Filter: This is an emacs-like regular expression that must match with the crawled URL. Use this i.e. to crawl a single domain. If you set this filter it would make sense to increase the crawl depth.

Accept URLs with '?' / dynamic URLs: URLs pointing to dynamic content should usually not be crawled. However, there are sometimes web pages with static content that is accessed with URL's containing question marks. If you are unsure, do not check this to avoid crawl loops.

Store to Proxy Cache: This option is used by default for proxy prefetch, but is not needed for explicit crawling. We recommend to leave this switched off unless you want to control the crawl results with the Cache Monitor.

Do Local Indexing: This should be switched on by default, unless you want to crawl only to fill the Proxy Cache without indexing.

Do Remote Indexing: Describe your intention to start this global crawl (optional):

This message will appear in the 'Other Peer Crawl Start' table of other peers. If checked, the crawl will try to assign the leaf nodes of the search tree to remote peers. If you need your crawling results locally, you must switch this off. Only senior and principal peer's can initiate or receive remote crawls. A YaCyNews message will be created to inform all peers about a global crawl, so they can ommit starting a crawl with the same start point.

Exclude static Stop-Words To exclude all words given in the file yacy.stopwords from indexing, check this box.

Starting Point:

From File:
From URL:

Existing start URL's are re-crawled. Other already visited URL's are sorted out as 'double'. A complete re-crawl will be available soon.

#(error)# :: Error with profile management. Please stop yacy, delete the File DATA/PLASMADB/crawlProfiles0.db and restart. :: Error: #[errmsg]# :: Application not yet initialized. Sorry. Please wait some seconds and repeat the request. :: ERROR: Crawl filter "#[newcrawlingfilter]#" does not match with crawl root "#[crawlingStart]#". Please try again with different filter

:: Crawling of "#[crawlingURL]#" failed. Reason: #[reasonString]#
:: Error with url input "#[crawlingStart]#": #[error]# :: Error with file input "#[crawlingStart]#": #[error]# #(/error)#
#(info)# :: Set new prefetch depth to "#[newproxyPrefetchDepth]#" :: Crawling of "#[crawlingURL]#" started. You can monitor the crawling progress either by watching the URL queues (local queue, global queue, loader queue, indexing queue) or see the fill/process count of all queues on the performance page. Please wait some seconds, because the request is enqueued and delayed until the http server is idle for a certain time. The indexing result is presented on the Index Monitor-page. It will take at least 30 seconds until the first result appears there. Please be patient, the crawling will pause each time you use the proxy or web server to ensure maximum availability. If you crawl any un-wanted pages, you can delete them here.
:: Removed #[numEntries]# entries from crawl queue. This queue may fill again if the loading and indexing queue is not empty :: Crawling paused successfully. :: Continue crawling. #(/info)#
#(refreshbutton)# ::
#(/refreshbutton)#
Crawl Profile List:
#{crawlProfiles}# #{/crawlProfiles}#

Crawl Thread	Start URL	Depth	Filter	Accept '?'	Fill Proxy Cache	Local Indexing	Remote Indexing
#[name]#	#[startURL]#	#[depth]#	#[filter]#	#(withQuery)#no::yes#(/withQuery)#	#(storeCache)#no::yes#(/storeCache)#	#(localIndexing)#no::yes#(/localIndexing)#	#(remoteIndexing)#no::yes#(/remoteIndexing)#

Recently started remote crawls in progress:
#{otherCrawlStartInProgress}# #{/otherCrawlStartInProgress}#

Start Time	Peer Name	Start URL	Intention/Description	Depth	Accept '?'
#[cre]#	#[peername]#	#[startURL]#	#[intention]#	#[generalDepth]#	#(crawlingQ)#no::yes#(/crawlingQ)#

Recently started remote crawls, finished:
#{otherCrawlStartFinished}# #{/otherCrawlStartFinished}#

Start Time	Peer Name	Start URL	Intention/Description	Depth	Accept '?'
#[cre]#	#[peername]#	#[startURL]#	#[intention]#	#[generalDepth]#	#(crawlingQ)#no::yes#(/crawlingQ)#

Remote Crawling Peers: #(remoteCrawlPeers)# No remote crawl peers availible.
:: #[num]# peers available for remote crawling.

Idle Peers	#{available}##[name]# (#[due]# seconds due) #{/available}#
Busy Peers	#{busy}##[name]# (#[due]# seconds due) #{/busy}#

#(/remoteCrawlPeers)#

#[footer]#

	Accept remote crawling requests and perform crawl at maximum load
	Accept remote crawling requests and perform crawl at maximum of Pages Per Minute (minimum is 1, low system load at PPM <= 30)
	Do not accept remote crawling requests (please set this only if you cannot accept to crawl only one page per minute; see option above)