You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
362 lines
16 KiB
362 lines
16 KiB
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
|
|
<html>
|
|
<head>
|
|
<title>YaCy '#[clientname]#': Index Creation</title>
|
|
#%env/templates/metas.template%#
|
|
<script src="/js/ajax.js"></script>
|
|
<script src="/js/IndexCreate.js"></script>
|
|
</head>
|
|
<body marginheight="0" marginwidth="0" leftmargin="0" topmargin="0">
|
|
#%env/templates/header.template%#
|
|
#%env/templates/submenuIndexCreate.template%#
|
|
<br>
|
|
<h2>Index Creation</h2>
|
|
|
|
<p>
|
|
<div class=small id="startCrawling"><b>Start Crawling Job:</b>
|
|
You can define URLs as start points for Web page crawling and start crawling here. "Crawling" means that YaCy will download the given website, extract all links in it and then download the content behind these links. This is repeated as long as specified under "Crawling Depth".</p>
|
|
</div>
|
|
|
|
<table border="0" cellpadding="5" cellspacing="1" width="100%">
|
|
<form action="IndexCreate_p.html" method="post" enctype="multipart/form-data">
|
|
<tr class="TableHeader">
|
|
<td class="small"><b>Attribut</b></td>
|
|
<td class="small"><b>Value</b></td>
|
|
<td class="small"><b>Description</b></td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellLight">
|
|
<td class=small>Crawling Depth:</td>
|
|
<td class=small><input name="crawlingDepth" type="text" size="2" maxlength="2" value="#[crawlingDepth]#"></td>
|
|
<td class=small>
|
|
This defines how often the Crawler will follow links embedded in websites.<br>
|
|
A minimum of 1 is recommended and means that the page you enter under "Starting Point" will be added to the index, but no linked content is indexed. 2-4 is good for normal indexing.
|
|
Be careful with the depth. Consider a branching factor of average 20;
|
|
A prefetch-depth of 8 would index 25.600.000.000 pages, maybe this is the whole WWW.
|
|
</td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellDark">
|
|
<td class=small>Crawling Filter:</td>
|
|
<td class=small><input name="crawlingFilter" type="text" size="20" maxlength="100" value="#[crawlingFilter]#"></td>
|
|
<td class=small >
|
|
This is an emacs-like regular expression that must match with the URLs which are used to be crawled.
|
|
Use this i.e. to crawl a single domain. If you set this filter it would make sense to increase
|
|
the crawling depth.
|
|
</td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellLight">
|
|
<td class=small>Re-Crawl Option:</td>
|
|
<td class=small><input name="crawlingIfOlder" type="text" size="7" maxlength="7" value="#[crawlingIfOlder]#"></td>
|
|
<td class=small>
|
|
|
|
</td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellDark">
|
|
<td class=small>Auto-Dom-Filter Depth:</td>
|
|
<td class=small><input name="crawlingDomFilterDepth" type="text" size="2" maxlength="2" value="#[crawlingDomFilterDepth]#"></td>
|
|
<td class=small>
|
|
|
|
</td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellLight">
|
|
<td class=small>Maximum Pages per Domain:</td>
|
|
<td class=small><input name="crawlingDomMaxPages" type="text" size="6" maxlength="6" value="#[crawlingDomMaxPages]#"></td>
|
|
<td class=small>
|
|
|
|
</td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellDark">
|
|
<td class=small>Accept URLs with '?' / dynamic URLs:</td>
|
|
<td class=small><input type="checkbox" name="crawlingQ" align="top" #(crawlingQChecked)#::checked#(/crawlingQChecked)#></td>
|
|
<td class=small>
|
|
A questionmark is usually a hint for a dynamic page. URLs pointing to dynamic content should usually not be crawled. However, there are sometimes web pages with static content that
|
|
is accessed with URLs containing question marks. If you are unsure, do not check this to avoid crawl loops.
|
|
</td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellLight">
|
|
<td class=small>Store to Proxy Cache:</td>
|
|
<td class=small><input type="checkbox" name="storeHTCache" align="top" #(storeHTCacheChecked)#::checked#(/storeHTCacheChecked)#></td>
|
|
<td class=small>
|
|
This option is used by default for proxy prefetch, but is not needed for explicit crawling.
|
|
We recommend to leave this switched off unless you want to control the crawl results with the
|
|
<a href="CacheAdmin_p.html" class=small>Cache Monitor</a>.
|
|
</td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellDark">
|
|
<td class=small>Do Local Indexing:</td>
|
|
<td class=small><input type="checkbox" name="localIndexing" align="top" #(localIndexingChecked)#::checked#(/localIndexingChecked)#></td>
|
|
<td class=small>
|
|
This enables indexing of the wepages the crawler will download. This should be switched on by default, unless you want to crawl only to fill the
|
|
<a href="CacheAdmin_p.html" class=small>Proxy Cache</a> without indexing.
|
|
</td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellLight">
|
|
<td class=small>Do Remote Indexing:</td>
|
|
<td>
|
|
<table border="0" cellpadding="2" cellspacing="0">
|
|
<tr>
|
|
<td>
|
|
<input type="checkbox" name="crawlOrder" align="top" #(crawlOrderChecked)#::checked#(/crawlOrderChecked)#>
|
|
</td>
|
|
<td class=small>
|
|
Describe your intention to start this global crawl (optional):<p>
|
|
<input name="intention" type="text" size="40" maxlength="100" value=""></p>
|
|
This message will appear in the 'Other Peer Crawl Start' table of other peers.
|
|
</td>
|
|
</table>
|
|
</td>
|
|
<td class=small >
|
|
If checked, the crawler will contact other peers and use them as remote indexers for your crawl.
|
|
If you need your crawling results locally, you should switch this off.
|
|
Only senior and principal peers can initiate or receive remote crawls.
|
|
<b>A YaCyNews message will be created to inform all peers about a global crawl</b>, so they can omit starting a crawl with the same start point.
|
|
</td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellDark">
|
|
<td class=small>Exclude <i>static</i> Stop-Words</td>
|
|
<td class=small><input type="checkbox" name="xsstopw" align="top" #(xsstopwChecked)#::checked#(/xsstopwChecked)#></td>
|
|
<td class=small>
|
|
This can be useful to circumvent that extremely common words are added to the database, i.e. "the", "he", "she", "it"... To exclude all words given in the file <tt class=small>yacy.stopwords</tt> from indexing,
|
|
check this box.
|
|
</td>
|
|
</tr>
|
|
<!--
|
|
<tr valign="top" class="TableCellDark">
|
|
<td class=small>Exclude <i>dynamic</i> Stop-Words</td>
|
|
<td class=small><input type="checkbox" name="xdstopw" align="top" #(xdstopwChecked)#::checked#(/xdstopwChecked)#></td>
|
|
<td class=small colspan="3">
|
|
Excludes all words from indexing which are listed by statistic rules.
|
|
<i>THIS IS NOT YET FUNCTIONAL</i>
|
|
</td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellDark">
|
|
<td class=small>Exclude <i>parent-indexed</i> words</td>
|
|
<td class=small><input type="checkbox" name="xpstopw" align="top" #(xpstopwChecked)#::checked#(/xpstopwChecked)#></td>
|
|
<td class=small colspan="3">
|
|
Excludes all words from indexing which had been indexed in the parent web page.
|
|
<i>THIS IS NOT YET FUNCTIONAL</i>
|
|
</td>
|
|
</tr>
|
|
-->
|
|
<tr valign="top" class="TableCellLight">
|
|
<td class="small">Starting Point:</td>
|
|
<td class="small">
|
|
<table cellpadding="0" cellspacing="0">
|
|
<tr><td class="small">From File:</td>
|
|
<td class="small"><input type="radio" name="crawlingMode" value="file"></td>
|
|
<td class="small"><input type="file" name="crawlingFile" size="28"></td>
|
|
</tr>
|
|
<tr><td class="small">From URL:</td>
|
|
<td class="small"><input type="radio" name="crawlingMode" value="url" checked="checked"></td>
|
|
<td class="small">
|
|
<input name="crawlingURL" type="text" size="41" maxlength="256" value="http://" onkeypress="changed()">
|
|
<span id="robotsOK"></span>
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td colspan="2"><span id="title"></span></td>
|
|
</tr>
|
|
</table>
|
|
</td>
|
|
<td class=small colspan="3">Existing start URLs are re-crawled.
|
|
Other already visited URLs are sorted out as "double".
|
|
A complete re-crawl will be available soon.
|
|
</td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellDark">
|
|
<td class=small colspan="5"><input type="submit" name="crawlingstart" value="Start New Crawl"></td>
|
|
</tr>
|
|
</form>
|
|
</table>
|
|
</p>
|
|
|
|
<p><form action="IndexCreate_p.html" method="post" enctype="multipart/form-data">
|
|
<div class=small id="distributedIndexing">
|
|
<b>Distributed Indexing: </b>
|
|
Crawling and indexing can be done by remote peers.
|
|
Your peer can search and index for other peers and they can search for you.</div></p>
|
|
|
|
<table border="0" cellpadding="5" cellspacing="1" width="100%">
|
|
<tr valign="top" class="TableCellDark">
|
|
<td class=small width="10%">
|
|
<input type="radio" name="dcr" value="acceptCrawlMax" align="top" #(acceptCrawlMaxChecked)#::checked#(/acceptCrawlMaxChecked)#>
|
|
</td><td class=small>
|
|
Accept remote crawling requests and perform crawl at maximum load
|
|
</td>
|
|
</tr><tr valign="top" class="TableCelllight">
|
|
<td class=small width="10%">
|
|
<input type="radio" name="dcr" value="acceptCrawlLimited" align="top" #(acceptCrawlLimitedChecked)#::checked#(/acceptCrawlLimitedChecked)#>
|
|
</td><td class=small>
|
|
Accept remote crawling requests and perform crawl at maximum of
|
|
<input name="acceptCrawlLimit" type="text" size="4" maxlength="4" value="#[PPM]#"> Pages Per Minute (minimum is 1, low system load usually at PPM <= 30)
|
|
</td>
|
|
</tr><tr valign="top" class="TableCellDark">
|
|
<td class=small width="10%">
|
|
<input type="radio" name="dcr" value="acceptCrawlDenied" align="top" #(acceptCrawlDeniedChecked)#::checked#(/acceptCrawlDeniedChecked)#>
|
|
</td><td class=small>
|
|
Do not accept remote crawling requests (please set this only if you cannot accept to crawl only one page per minute; see option above)</td>
|
|
</td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellLight">
|
|
<td width="10%">
|
|
<input type="submit" name="distributedcrawling" value="set">
|
|
</td>
|
|
<td>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
</form></p>
|
|
|
|
|
|
<p>
|
|
#(error)#<!-- 0 -->
|
|
::<!-- 1 -->
|
|
Error with profile management. Please stop YaCy, delete the file DATA/PLASMADB/crawlProfiles0.db and restart.
|
|
::<!-- 2 -->
|
|
Error: #[errmsg]#
|
|
::<!-- 3 -->
|
|
Application not yet initialized. Sorry. Please wait some seconds and repeat the request.
|
|
::<!-- 4 -->
|
|
<b>ERROR: Crawl filter "#[newcrawlingfilter]#" does not match with crawl root "#[crawlingStart]#".</b> Please try again with different filter.</p><br>
|
|
::<!-- 5 -->
|
|
Crawling of "#[crawlingURL]#" failed. Reason: #[reasonString]#<br>
|
|
::<!-- 6 -->
|
|
Error with URL input "#[crawlingStart]#": #[error]#
|
|
::<!-- 7 -->
|
|
Error with file input "#[crawlingStart]#": #[error]#
|
|
#(/error)#
|
|
<br>
|
|
#(info)#
|
|
::
|
|
Set new prefetch depth to "#[newproxyPrefetchDepth]#"
|
|
::
|
|
Crawling of "#[crawlingURL]#" started.
|
|
You can monitor the crawling progress either by watching the URL queues
|
|
(<a href="/IndexCreateWWWLocalQueue_p.html">local queue</a>,
|
|
<a href="/IndexCreateWWWGlobalQueue_p.html">global queue</a>,
|
|
<a href="/IndexCreateLoaderQueue_p.html">loader queue</a>,
|
|
<a href="/IndexCreateLoaderQueue_p.html">indexing queue</a>)
|
|
or see the fill/process count of all queues on the
|
|
<a href="/PerformanceQueues_p.html">performance page</a>.
|
|
<b>Please wait some seconds, because the request is enqueued and delayed until the proxy/HTTP-server is idle for a certain time.</b>
|
|
The indexing results are presented on the
|
|
<a href="IndexMonitor.html">Index Monitor</a>-page.
|
|
<b>It will take at least 30 seconds until the first result appears there. Please be patient, the crawling will pause each time you use the proxy or web server to ensure maximum availability.</b>
|
|
If you crawl any un-wanted pages, you can delete them <a href="IndexCreateWWWLocalQueue_p.html">here</a>.<br>
|
|
::
|
|
Removed #[numEntries]# entries from crawl queue. This queue may fill again if the loading and indexing queue is not empty.
|
|
::
|
|
Crawling paused successfully.
|
|
::
|
|
Continue crawling.
|
|
#(/info)#
|
|
<br>
|
|
#(refreshbutton)#
|
|
::
|
|
<form action="IndexCreate_p.html" method="post" enctype="multipart/form-data">
|
|
<input type="submit" name="refreshpage" value="refresh">
|
|
</form>
|
|
<br>
|
|
#(/refreshbutton)#
|
|
<br>
|
|
<form action="IndexCreate_p.html" method="post" enctype="multipart/form-data">
|
|
#(crawler-paused)#
|
|
<input type="submit" name="continuecrawlqueue" value="continue crawling">
|
|
::
|
|
<input type="submit" name="pausecrawlqueue" value="pause crawling">
|
|
#(/crawler-paused)#
|
|
</form>
|
|
<b id="crawlingProfiles">Crawl Profile List:</b><br>
|
|
<table border="0" cellpadding="2" cellspacing="1" width="100%">
|
|
<tr class="TableHeader">
|
|
<td width="120" class="small"><b>Crawl Thread</b></td>
|
|
<td class="small"><b>Start URL</b></td>
|
|
<td width="16" class="small"><b>Depth</b></td>
|
|
<td width="60" class="small"><b>Filter</b></td>
|
|
<td width="10" class="small"><b>Accept "?" URLs</b></td>
|
|
<td width="10" class="small"><b>Fill Proxy Cache</b></td>
|
|
<td width="10" class="small"><b>Local Indexing</b></td>
|
|
<td width="10" class="small"><b>Remote Indexing</b></td>
|
|
</tr>
|
|
#{crawlProfiles}#
|
|
<tr class="TableCell#(dark)#Light::Dark#(/dark)#" class="small">
|
|
<td class="small">#[name]#</td>
|
|
<td class="small"><a class="small" href="#[startURL]#">#[startURL]#</a></td>
|
|
<td class="small">#[depth]#</td>
|
|
<td class="small">#[filter]#</td>
|
|
<td class="small">#(withQuery)#no::yes#(/withQuery)#</td>
|
|
<td class="small">#(storeCache)#no::yes#(/storeCache)#</td>
|
|
<td class="small">#(localIndexing)#no::yes#(/localIndexing)#</td>
|
|
<td class="small">#(remoteIndexing)#no::yes#(/remoteIndexing)#</td>
|
|
</tr>
|
|
#{/crawlProfiles}#
|
|
</table>
|
|
<br>
|
|
<b id="crawlingStarts">Recently started remote crawls in progress:</b><br>
|
|
<table border="0" cellpadding="2" cellspacing="1" width="100%">
|
|
<tr class="TableHeader">
|
|
<td class="small"><b>Start Time</b></td>
|
|
<td class="small"><b>Peer Name</b></td>
|
|
<td class="small"><b>Start URL</b></td>
|
|
<td class="small"><b>Intention/Description</b></td>
|
|
<td class="small"><b>Depth</b></td>
|
|
<td class="small"><b>Accept '?' URLs</b></td>
|
|
</tr>
|
|
#{otherCrawlStartInProgress}#
|
|
<tr class="TableCell#(dark)#Light::Dark#(/dark)#" class="small">
|
|
<td class="small">#[cre]#</td>
|
|
<td class="small">#[peername]#</td>
|
|
<td class="small"><a class="small" href="#[startURL]#">#[startURL]#</a></td>
|
|
<td class="small">#[intention]#</td>
|
|
<td class="small">#[generalDepth]#</td>
|
|
<td class="small">#(crawlingQ)#no::yes#(/crawlingQ)#</td>
|
|
</tr>
|
|
#{/otherCrawlStartInProgress}#
|
|
</table>
|
|
<br>
|
|
<b>Recently started remote crawls, finished:</b><br>
|
|
<table border="0" cellpadding="2" cellspacing="1" width="100%">
|
|
<tr class="TableHeader">
|
|
<td class="small"><b>Start Time</b></td>
|
|
<td class="small"><b>Peer Name</b></td>
|
|
<td class="small"><b>Start URL</b></td>
|
|
<td class="small"><b>Intention/Description</b></td>
|
|
<td class="small"><b>Depth</b></td>
|
|
<td class="small"><b>Accept '?' URLs</b></td>
|
|
</tr>
|
|
#{otherCrawlStartFinished}#
|
|
<tr class="TableCell#(dark)#Light::Dark#(/dark)#" class="small">
|
|
<td class="small">#[cre]#</td>
|
|
<td class="small">#[peername]#</td>
|
|
<td class="small"><a class="small" href="#[startURL]#">#[startURL]#</a></td>
|
|
<td class="small">#[intention]#</td>
|
|
<td class="small">#[generalDepth]#</td>
|
|
<td class="small">#(crawlingQ)#no::yes#(/crawlingQ)#</td>
|
|
</tr>
|
|
#{/otherCrawlStartFinished}#
|
|
</table>
|
|
<br>
|
|
<b id="remoteCrawlPeers">Remote Crawling Peers:</b>
|
|
#(remoteCrawlPeers)#
|
|
No remote crawl peers availible.<br>
|
|
::
|
|
#[num]# peers available for remote crawling.
|
|
<table border="0" cellpadding="2" cellspacing="1" width="100%">
|
|
<tr class="TableCellDark">
|
|
<th width="60" class="small">Idle Peers</th>
|
|
<td class="small">
|
|
#{available}##[name]# (#[due]# seconds due) #{/available}#
|
|
</td>
|
|
</tr>
|
|
<tr class="TableCellLight">
|
|
<th width="60" class="small">Busy Peers</th>
|
|
<td class="small">
|
|
#{busy}##[name]# (#[due]# seconds due) #{/busy}#
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
#(/remoteCrawlPeers)#
|
|
</p>
|
|
|
|
#%env/templates/footer.template%#
|
|
</body>
|
|
</html> |