You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
224 lines
13 KiB
224 lines
13 KiB
18 years ago
|
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
||
|
<html xmlns="http://www.w3.org/1999/xhtml">
|
||
|
<head>
|
||
18 years ago
|
<title>YaCy '#[clientname]#': Crawl Start (expert)</title>
|
||
18 years ago
|
#%env/templates/metas.template%#
|
||
|
<script type="text/javascript" src="/js/ajax.js"></script>
|
||
|
<script type="text/javascript" src="/js/IndexCreate.js"></script>
|
||
|
</head>
|
||
|
<body id="IndexCreate">
|
||
|
#%env/templates/header.template%#
|
||
|
#%env/templates/submenuIndexCreate.template%#
|
||
18 years ago
|
<h2>Expert Crawl Start</h2>
|
||
18 years ago
|
|
||
|
<p id="startCrawling">
|
||
|
<strong>Start Crawling Job:</strong>
|
||
|
You can define URLs as start points for Web page crawling and start crawling here. "Crawling" means that YaCy will download the given website, extract all links in it and then download the content behind these links. This is repeated as long as specified under "Crawling Depth".
|
||
|
</p>
|
||
|
|
||
17 years ago
|
<form action="WatchCrawler_p.html" method="get" enctype="multipart/form-data">
|
||
18 years ago
|
<table border="0" cellpadding="5" cellspacing="1">
|
||
|
<tr class="TableHeader">
|
||
|
<td><strong>Attribut</strong></td>
|
||
|
<td><strong>Value</strong></td>
|
||
|
<td><strong>Description</strong></td>
|
||
|
</tr>
|
||
18 years ago
|
<tr valign="top" class="TableCellSummary">
|
||
|
<td>Starting Point:</td>
|
||
|
<td>
|
||
18 years ago
|
<table cellpadding="0" cellspacing="0">
|
||
18 years ago
|
<tr>
|
||
18 years ago
|
<td><label for="url"><nobr>From URL</nobr></label>:</td>
|
||
|
<td><input type="radio" name="crawlingMode" id="url" value="url" checked="checked" /></td>
|
||
18 years ago
|
<td>
|
||
|
<input name="crawlingURL" type="text" size="41" maxlength="256" value="http://" onkeypress="changed()" />
|
||
18 years ago
|
<span id="robotsOK"></span>
|
||
18 years ago
|
</td>
|
||
|
</tr>
|
||
18 years ago
|
<tr>
|
||
|
<td><label for="url"><nobr>From Sitemap</nobr></label>:</td>
|
||
17 years ago
|
<td><input type="radio" name="crawlingMode" id="sitemap" value="sitemap" disabled="disabled"/></td>
|
||
18 years ago
|
<td>
|
||
17 years ago
|
<input name="sitemapURL" type="text" size="41" maxlength="256" value="" readonly="readonly"/>
|
||
18 years ago
|
</td>
|
||
|
</tr>
|
||
18 years ago
|
<tr>
|
||
18 years ago
|
<td><label for="file"><nobr>From File</nobr></label>:</td>
|
||
|
<td><input type="radio" name="crawlingMode" id="file" value="file" /></td>
|
||
18 years ago
|
<td><input type="file" name="crawlingFile" size="28" /></td>
|
||
|
</tr>
|
||
|
<tr>
|
||
18 years ago
|
<td colspan="3" class="commit"><span id="title"><br/></span><img src="/env/grafics/empty.gif" name="ajax" alt="empty" /></td>
|
||
18 years ago
|
</tr>
|
||
|
</table>
|
||
|
</td>
|
||
|
<td colspan="3">
|
||
17 years ago
|
Existing start URLs are always re-crawled.
|
||
|
Other already visited URLs are sorted out as "double", if they are not allowed using the re-crawl option.
|
||
18 years ago
|
</td>
|
||
|
</tr>
|
||
18 years ago
|
<tr valign="top" class="TableCellLight">
|
||
18 years ago
|
<td><label for="crawlingDepth">Crawling Depth</label>:</td>
|
||
|
<td><input name="crawlingDepth" id="crawlingDepth" type="text" size="2" maxlength="2" value="#[crawlingDepth]#" /></td>
|
||
18 years ago
|
<td>
|
||
|
This defines how often the Crawler will follow links embedded in websites.<br />
|
||
18 years ago
|
A minimum of 0 is recommended and means that the page you enter under "Starting Point" will be added
|
||
|
to the index, but no linked content is indexed. 2-4 is good for normal indexing.
|
||
18 years ago
|
Be careful with the depth. Consider a branching factor of average 20;
|
||
|
A prefetch-depth of 8 would index 25.600.000.000 pages, maybe this is the whole WWW.
|
||
|
</td>
|
||
|
</tr>
|
||
|
<tr valign="top" class="TableCellDark">
|
||
18 years ago
|
<td><label for="crawlingFilter">Crawling Filter</label>:</td>
|
||
18 years ago
|
<td>
|
||
17 years ago
|
<input type="radio" name="range" value="wide" checked="checked" />Use filter
|
||
|
<input name="crawlingFilter" id="crawlingFilter" type="text" size="20" maxlength="100" value="#[crawlingFilter]#" /><br />
|
||
|
<input type="radio" name="range" value="domain" />Restrict to start domain<br />
|
||
|
<input type="radio" name="range" value="subpath" />Restrict to sub-path
|
||
18 years ago
|
</td>
|
||
18 years ago
|
<td>
|
||
17 years ago
|
The filter is an emacs-like regular expression that must match with the URLs which are used to be crawled; default is 'catch all'.
|
||
|
You can also use an automatic domain-restriction to fully crawl a single domain.
|
||
18 years ago
|
</td>
|
||
|
</tr>
|
||
|
<tr valign="top" class="TableCellLight">
|
||
17 years ago
|
<td>Re-crawl known URLs:</td>
|
||
18 years ago
|
<td>
|
||
18 years ago
|
<label for="crawlingIfOlderChecked">Use</label>:
|
||
|
<input type="checkbox" name="crawlingIfOlderCheck" id="crawlingIfOlderChecked" #(crawlingIfOlderCheck)#::checked="checked"#(/crawlingIfOlderCheck)# />
|
||
17 years ago
|
<label for="crawlingIfOlderNumber">If older than</label>:
|
||
18 years ago
|
<input name="crawlingIfOlderNumber" id="crawlingIfOlderNumber" type="text" size="7" maxlength="7" value="#[crawlingIfOlderNumber]#" />
|
||
|
<select name="crawlingIfOlderUnit">
|
||
|
<option value="year" #(crawlingIfOlderUnitYearCheck)#::selected="selected"#(/crawlingIfOlderUnitYearCheck)#>Year(s)</option>
|
||
|
<option value="month" #(crawlingIfOlderUnitMonthCheck)#::selected="selected"#(/crawlingIfOlderUnitMonthCheck)#>Month(s)</option>
|
||
|
<option value="day" #(crawlingIfOlderUnitDayCheck)#::selected="selected"#(/crawlingIfOlderUnitDayCheck)#>Day(s)</option>
|
||
|
<option value="hour" #(crawlingIfOlderUnitHourCheck)#::selected="selected"#(/crawlingIfOlderUnitHourCheck)#>Hour(s)</option>
|
||
|
<option value="minute" #(crawlingIfOlderUnitMinuteCheck)#::selected="selected"#(/crawlingIfOlderUnitMinuteCheck)#>Minute(s)</option>
|
||
18 years ago
|
</select>
|
||
18 years ago
|
</td>
|
||
|
<td>
|
||
|
If you use this option, web pages that are already existent in your database are crawled and indexed again.
|
||
|
It depends on the age of the last crawl if this is done or not: if the last crawl is older than the given
|
||
18 years ago
|
date, the page is crawled again, otherwise it is treated as 'double' and not loaded or indexed again.
|
||
18 years ago
|
</td>
|
||
|
</tr>
|
||
|
<tr valign="top" class="TableCellDark">
|
||
|
<td>Auto-Dom-Filter:</td>
|
||
|
<td>
|
||
18 years ago
|
<label for="crawlingDomFilterCheck">Use</label>:
|
||
|
<input type="checkbox" name="crawlingDomFilterCheck" id="crawlingDomFilterCheck" #(crawlingDomFilterCheck)#::checked="checked"#(/crawlingDomFilterCheck)# />
|
||
|
<label for="crawlingDomFilterDepth">Depth</label>:
|
||
|
<input name="crawlingDomFilterDepth" id="crawlingDomFilterDepth" type="text" size="2" maxlength="2" value="#[crawlingDomFilterDepth]#" />
|
||
|
</td>
|
||
18 years ago
|
<td>
|
||
|
This option will automatically create a domain-filter which limits the crawl on domains the crawler
|
||
|
will find on the given depth. You can use this option i.e. to crawl a page with bookmarks while
|
||
|
restricting the crawl on only those domains that appear on the bookmark-page. The adequate depth
|
||
|
for this example would be 1.<br />
|
||
|
The default value 0 gives no restrictions.
|
||
|
</td>
|
||
|
</tr>
|
||
|
<tr valign="top" class="TableCellLight">
|
||
|
<td>Maximum Pages per Domain:</td>
|
||
|
<td>
|
||
18 years ago
|
<label for="crawlingDomMaxCheck">Use</label>:
|
||
|
<input type="checkbox" name="crawlingDomMaxCheck" id="crawlingDomMaxCheck" #(crawlingDomMaxCheck)#::checked="checked"#(/crawlingDomMaxCheck)# />
|
||
|
<label for="crawlingDomMaxPages">Page-Count</label>:
|
||
|
<input name="crawlingDomMaxPages" id="crawlingDomMaxPages" type="text" size="6" maxlength="6" value="#[crawlingDomMaxPages]#" />
|
||
|
</td>
|
||
18 years ago
|
<td>
|
||
|
You can limit the maxmimum number of pages that are fetched and indexed from a single domain with this option.
|
||
|
You can combine this limitation with the 'Auto-Dom-Filter', so that the limit is applied to all the domains within
|
||
|
the given depth. Domains outside the given depth are then sorted-out anyway.
|
||
|
</td>
|
||
|
</tr>
|
||
|
<tr valign="top" class="TableCellDark">
|
||
18 years ago
|
<td><label for="crawlingQ">Accept URLs with '?' / dynamic URLs</label>:</td>
|
||
|
<td><input type="checkbox" name="crawlingQ" id="crawlingQ" #(crawlingQChecked)#::checked="checked"#(/crawlingQChecked)# /></td>
|
||
18 years ago
|
<td>
|
||
|
A questionmark is usually a hint for a dynamic page. URLs pointing to dynamic content should usually not be crawled. However, there are sometimes web pages with static content that
|
||
|
is accessed with URLs containing question marks. If you are unsure, do not check this to avoid crawl loops.
|
||
|
</td>
|
||
|
</tr>
|
||
|
<tr valign="top" class="TableCellLight">
|
||
18 years ago
|
<td><label for="storeHTCache">Store to Web Cache</label>:</td>
|
||
|
<td><input type="checkbox" name="storeHTCache" id="storeHTCache" #(storeHTCacheChecked)#::checked="checked"#(/storeHTCacheChecked)# /></td>
|
||
18 years ago
|
<td>
|
||
|
This option is used by default for proxy prefetch, but is not needed for explicit crawling.
|
||
|
We recommend to leave this switched off unless you want to control the crawl results with the
|
||
|
<a href="CacheAdmin_p.html">Cache Monitor</a>.
|
||
|
</td>
|
||
|
</tr>
|
||
|
<tr valign="top" class="TableCellDark">
|
||
|
<td>Do Local Indexing:</td>
|
||
18 years ago
|
<td>
|
||
|
<label for="indexText">index text</label>:
|
||
|
<input type="checkbox" name="indexText" id="indexText" #(indexingTextChecked)#::checked="checked"#(/indexingTextChecked)# />
|
||
|
<label for="indexMedia">index media</label>:
|
||
|
<input type="checkbox" name="indexMedia" id="indexMedia" #(indexingMediaChecked)#::checked="checked"#(/indexingMediaChecked)# />
|
||
|
</td>
|
||
18 years ago
|
<td>
|
||
|
This enables indexing of the wepages the crawler will download. This should be switched on by default, unless you want to crawl only to fill the
|
||
|
<a href="CacheAdmin_p.html">Proxy Cache</a> without indexing.
|
||
|
</td>
|
||
|
</tr>
|
||
|
<tr valign="top" class="TableCellLight">
|
||
18 years ago
|
<td><label for="crawlOrder">Do Remote Indexing</label>:</td>
|
||
18 years ago
|
<td>
|
||
|
<table border="0" cellpadding="2" cellspacing="0">
|
||
|
<tr>
|
||
|
<td>
|
||
18 years ago
|
<input type="checkbox" name="crawlOrder" id="crawlOrder" #(crawlOrderChecked)#::checked="checked"#(/crawlOrderChecked)# />
|
||
18 years ago
|
</td>
|
||
|
<td>
|
||
18 years ago
|
<label for="intention">Describe your intention to start this global crawl (optional)</label>:<br />
|
||
|
<input name="intention" id="intention" type="text" size="40" maxlength="100" value="" /><br />
|
||
18 years ago
|
This message will appear in the 'Other Peer Crawl Start' table of other peers.
|
||
|
</td>
|
||
|
</tr>
|
||
|
</table>
|
||
|
</td>
|
||
|
<td>
|
||
|
If checked, the crawler will contact other peers and use them as remote indexers for your crawl.
|
||
|
If you need your crawling results locally, you should switch this off.
|
||
|
Only senior and principal peers can initiate or receive remote crawls.
|
||
18 years ago
|
<strong>A YaCyNews message will be created to inform all peers about a global crawl</strong>,
|
||
|
so they can omit starting a crawl with the same start point.
|
||
18 years ago
|
</td>
|
||
|
</tr>
|
||
|
<tr valign="top" class="TableCellDark">
|
||
18 years ago
|
<td><label for="xsstopw">Exclude <em>static</em> Stop-Words</label>:</td>
|
||
|
<td><input type="checkbox" name="xsstopw" id="xsstopw" #(xsstopwChecked)#::checked="checked"#(/xsstopwChecked)# /></td>
|
||
18 years ago
|
<td>
|
||
18 years ago
|
This can be useful to circumvent that extremely common words are added to the database, i.e. "the", "he", "she", "it"... To exclude all words given in the file <tt>yacy.stopwords</tt> from indexing,
|
||
18 years ago
|
check this box.
|
||
|
</td>
|
||
|
</tr>
|
||
|
<!--
|
||
|
<tr valign="top" class="TableCellDark">
|
||
|
<td>Exclude <em>dynamic</em> Stop-Words</td>
|
||
|
<td><input type="checkbox" name="xdstopw" #(xdstopwChecked)#::checked="checked"#(/xdstopwChecked)# /></td>
|
||
|
<td colspan="3">
|
||
|
Excludes all words from indexing which are listed by statistic rules.
|
||
|
<em>THIS IS NOT YET FUNCTIONAL</em>
|
||
|
</td>
|
||
|
</tr>
|
||
|
<tr valign="top" class="TableCellDark">
|
||
|
<td>Exclude <em>parent-indexed</em> words</td>
|
||
|
<td><input type="checkbox" name="xpstopw" #(xpstopwChecked)#::checked="checked"#(/xpstopwChecked)# /></td>
|
||
|
<td colspan="3">
|
||
|
Excludes all words from indexing which had been indexed in the parent web page.
|
||
|
<em>THIS IS NOT YET FUNCTIONAL</em>
|
||
|
</td>
|
||
|
</tr>
|
||
|
-->
|
||
|
<tr valign="top" class="TableCellLight">
|
||
|
<td colspan="5"><input type="submit" name="crawlingstart" value="Start New Crawl" /></td>
|
||
|
</tr>
|
||
|
</table>
|
||
18 years ago
|
</form>
|
||
18 years ago
|
|
||
|
#%env/templates/footer.template%#
|
||
|
</body>
|
||
18 years ago
|
</html>
|