You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
338 lines
21 KiB
338 lines
21 KiB
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
|
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" >
|
|
<head>
|
|
<title>YaCy '#[clientname]#': Crawl Start</title>
|
|
#%env/templates/metas.template%#
|
|
<script type="text/javascript" src="/js/ajax.js"></script>
|
|
<script type="text/javascript" src="/js/IndexCreate.js"></script>
|
|
<script type="text/javascript">
|
|
function check(key){
|
|
document.getElementById(key).checked = 'checked';
|
|
}
|
|
</script>
|
|
<style type="text/css">
|
|
.nobr {
|
|
white-space: nowrap;
|
|
}
|
|
</style>
|
|
</head>
|
|
<body id="IndexCreate">
|
|
|
|
<div id="api">
|
|
<a href="http://www.yacy-websuche.de/wiki/index.php/Dev:API#Managing_crawl_jobs" id="apilink"><img src="/env/grafics/api.png" width="60" height="40" alt="API"/></a>
|
|
<span>Click on this API button to see a documentation of the POST request parameter for crawl starts.</span>
|
|
</div>
|
|
|
|
#%env/templates/header.template%#
|
|
#%env/templates/submenuIndexCreate.template%#
|
|
<h2>Expert Crawl Start</h2>
|
|
|
|
<p id="startCrawling">
|
|
<strong>Start Crawling Job:</strong>
|
|
You can define URLs as start points for Web page crawling and start crawling here.
|
|
"Crawling" means that YaCy will download the given website, extract all links in it and then download the content behind these links.
|
|
This is repeated as long as specified under "Crawling Depth".
|
|
A crawl can also be started using wget and the <a href="http://www.yacy-websuche.de/wiki/index.php/Dev:API#Managing_crawl_jobs">post arguments</a> for this web page.
|
|
</p>
|
|
|
|
<form id="Crawler" action="Crawler_p.html" method="post" enctype="multipart/form-data" accept-charset="UTF-8">
|
|
<table border="0" cellpadding="5" cellspacing="1">
|
|
<tr class="TableHeader">
|
|
<td><strong>Attribute</strong></td>
|
|
<td><strong>Value</strong></td>
|
|
<td><strong>Description</strong></td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellSummary">
|
|
<td>Starting Point:</td>
|
|
<td>
|
|
<table cellpadding="0" cellspacing="0">
|
|
<tr>
|
|
<td><label for="url"><span class="nobr">From URL (must start with<br/>http:// https:// ftp:// smb:// file://)</span></label>:</td>
|
|
<td><input type="radio" name="crawlingMode" id="url" value="url" checked="checked" /></td>
|
|
<td>
|
|
<textarea name="crawlingURL" id="crawlingURL" cols="41" rows="3" size="41" onkeypress="changed()" onfocus="check('url')" >#[starturl]#</textarea>
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td></td>
|
|
<td></td>
|
|
<td>
|
|
<input name="bookmarkTitle" id="bookmarkTitle" type="text" size="50" maxlength="256" value="" readonly="readonly" style="background:transparent; border:0px"/>
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td><label for="url"><span class="nobr">From Link-List of URL</span></label>:</td>
|
|
<td><input type="radio" name="crawlingMode" id="sitelist" value="sitelist" disabled="disabled" onclick="document.getElementById('Crawler').rangeDomain.checked = true;"/></td>
|
|
<td>
|
|
<div id="sitelistURLs"></div>
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td><label for="url"><span class="nobr">From Sitemap</span></label>:</td>
|
|
<td><input type="radio" name="crawlingMode" id="sitemap" value="sitemap" disabled="disabled"/></td>
|
|
<td>
|
|
<input name="sitemapURL" type="text" size="41" maxlength="256" value="" readonly="readonly"/>
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td><label for="file"><span class="nobr">From File (enter a path<br/>within your local file system)</span></label>:</td>
|
|
<td><input type="radio" name="crawlingMode" id="file" value="file" onclick="document.getElementById('Crawler').rangeDomain.checked = true;"/></td>
|
|
<td><input type="text" name="crawlingFile" size="41" onfocus="check('file')"/><!--<input type="file" name="crawlingFile" size="18" onfocus="check('file')"/>--></td>
|
|
</tr>
|
|
<tr>
|
|
<td colspan="3" class="commit">
|
|
<span id="robotsOK"></span>
|
|
<span id="title"><br/></span>
|
|
<img id="ajax" src="/env/grafics/empty.gif" alt="empty" />
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
</td>
|
|
<td colspan="3">
|
|
Define the start-url(s) here. You can submit more than one URL, each line one URL please.
|
|
Each of these URLs are the root for a crawl start, existing start URLs are always re-loaded.
|
|
Other already visited URLs are sorted out as "double", if they are not allowed using the re-crawl option.
|
|
</td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellLight">
|
|
<td>Crawling Depth:</td>
|
|
<td>
|
|
<input name="crawlingDepth" id="crawlingDepth" type="text" size="2" maxlength="2" value="#[crawlingDepth]#" />
|
|
<input type="checkbox" name="directDocByURL" id="directDocByURL" #(directDocByURLChecked)#::checked="checked"#(/directDocByURLChecked)# />also all linked non-parsable documents<br/>
|
|
Unlimited crawl depth for URLs matching with: <input name="crawlingDepthExtension" id="crawlingDepthExtension" type="text" size="30" maxlength="100" value="#[crawlingDepthExtension]#" />
|
|
</td>
|
|
<td>
|
|
This defines how often the Crawler will follow links (of links..) embedded in websites.
|
|
0 means that only the page you enter under "Starting Point" will be added
|
|
to the index. 2-4 is good for normal indexing. Values over 8 are not useful, since a depth-8 crawl will
|
|
index approximately 25.600.000.000 pages, maybe this is the whole WWW.
|
|
</td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellDark">
|
|
<td>Scheduled re-crawl</td>
|
|
<td>
|
|
<dl>
|
|
<dt>no doubles<input type="radio" name="recrawl" value="nodoubles" #(crawlingIfOlderCheck)#checked="checked"::#(/crawlingIfOlderCheck)#/></dt>
|
|
<dd>run this crawl once and never load any page that is already known, only the start-url may be loaded again.</dd>
|
|
<dt>re-load<input type="radio" name="recrawl" value="reload" #(crawlingIfOlderCheck)#::checked="checked"#(/crawlingIfOlderCheck)# /></dt>
|
|
<dd>run this crawl once, but treat urls that are known since<br/>
|
|
<select name="crawlingIfOlderNumber" id="crawlingIfOlderNumber">
|
|
<option value="1">1</option><option value="2">2</option><option value="3">3</option>
|
|
<option value="4">4</option><option value="5">5</option><option value="6">6</option>
|
|
<option value="7" selected="selected">7</option>
|
|
<option value="8">8</option><option value="9">9</option><option value="10">10</option>
|
|
<option value="12">12</option><option value="14">14</option><option value="21">21</option>
|
|
<option value="28">28</option><option value="30">30</option>
|
|
</select>
|
|
<select name="crawlingIfOlderUnit">
|
|
<option value="year" #(crawlingIfOlderUnitYearCheck)#::selected="selected"#(/crawlingIfOlderUnitYearCheck)#>years</option>
|
|
<option value="month" #(crawlingIfOlderUnitMonthCheck)#::selected="selected"#(/crawlingIfOlderUnitMonthCheck)#>months</option>
|
|
<option value="day" #(crawlingIfOlderUnitDayCheck)#::selected="selected"#(/crawlingIfOlderUnitDayCheck)#>days</option>
|
|
<option value="hour" #(crawlingIfOlderUnitHourCheck)#::selected="selected"#(/crawlingIfOlderUnitHourCheck)#>hours</option>
|
|
</select> not as double and load them again. No scheduled re-crawl.
|
|
</dd>
|
|
<dt>scheduled<input type="radio" name="recrawl" value="scheduler"/></dt>
|
|
<dd>after starting this crawl, repeat the crawl every<br/>
|
|
<select name="repeat_time">
|
|
<option value="1">1</option><option value="2">2</option><option value="3">3</option>
|
|
<option value="4">4</option><option value="5">5</option><option value="6">6</option>
|
|
<option value="7" selected="selected">7</option>
|
|
<option value="8">8</option><option value="9">9</option><option value="10">10</option>
|
|
<option value="12">12</option><option value="14">14</option><option value="21">21</option>
|
|
<option value="28">28</option><option value="30">30</option>
|
|
</select>
|
|
<select name="repeat_unit">
|
|
<option value="selminutes">minutes</option>
|
|
<option value="selhours">hours</option>
|
|
<option value="seldays" selected="selected">days</option>
|
|
</select> automatically.
|
|
</dd>
|
|
</dl>
|
|
</td>
|
|
<td>
|
|
A web crawl performs a double-check on all links found in the internet against the internal database. If the same url is found again,
|
|
then the url is treated as double when you check the 'no doubles' option. A url may be loaded again when it has reached a specific age,
|
|
to use that check the 're-load' option. When you want that this web crawl is repeated automatically, then check the 'scheduled' option.
|
|
In this case the crawl is repeated after the given time and no url from the previous crawl is omitted as double.
|
|
</td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellLight">
|
|
<td><label for="mustmatch">Must-Match Filter for URLs for crawling</label>:</td>
|
|
<td>
|
|
<input type="radio" name="range" id="rangeWide" value="wide" checked="checked" onclick="document.getElementById('deleteold').checked=false;document.getElementById('deleteold').disabled=true;"/>Use filter
|
|
<input name="mustmatch" id="mustmatch" type="text" size="60" maxlength="100" value="#[mustmatch]#" /><br />
|
|
<input type="radio" name="range" id="rangeDomain" value="domain" onclick="document.getElementById('deleteold').disabled=false;document.getElementById('deleteold').checked=true;"/>Restrict to start domain<br />
|
|
<input type="radio" name="range" id="rangeSubpath" value="subpath" onclick="document.getElementById('deleteold').disabled=false;document.getElementById('deleteold').checked=true;" />Restrict to sub-path<br />
|
|
<input type="checkbox" name="deleteold" id="deleteold" disabled/>Delete all old documents in domain/subpath
|
|
</td>
|
|
<td>
|
|
The filter is a <b><a href="http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html">regular expression</a></b>
|
|
that <b>must match</b> with the URLs which are used to be crawled; default is 'catch all'.
|
|
Example: to allow only urls that contain the word 'science', set the filter to '.*science.*'.
|
|
You can also use an automatic domain-restriction to fully crawl a single domain.
|
|
</td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellDark">
|
|
<td><label for="mustnotmatch">Must-Not-Match Filter for URLs for crawling</label>:</td>
|
|
<td>
|
|
<input name="mustnotmatch" id="mustnotmatch" type="text" size="60" maxlength="100" value="#[mustnotmatch]#" />
|
|
</td>
|
|
<td>
|
|
The filter is a <b><a href="http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html">regular expression</a></b>
|
|
that <b>must not match</b> to allow that the page is accepted for crawling.
|
|
The empty string is a never-match filter which should do well for most cases.
|
|
If you don't know what this means, please leave this field empty.
|
|
</td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellLight">
|
|
<td><label for="indexmustmatch">Must-Match Filter for URLs for indexing</label>:</td>
|
|
<td>
|
|
<input name="indexmustmatch" id="indexmustmatch" type="text" size="60" maxlength="100" value="#[indexmustmatch]#" /><br />
|
|
</td>
|
|
<td>
|
|
The filter is a <b><a href="http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html">regular expression</a></b>
|
|
that <b>must match</b> with the URLs to allow that the content of the url is indexed.
|
|
</td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellDark">
|
|
<td><label for="indexmustnotmatch">Must-Not-Match Filter for URLs for indexing</label>:</td>
|
|
<td>
|
|
<input name="indexmustnotmatch" id="indexmustnotmatch" type="text" size="60" maxlength="100" value="#[indexmustnotmatch]#" />
|
|
</td>
|
|
<td>
|
|
The filter is a <b><a href="http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html">regular expression</a></b>
|
|
that <b>must not match</b> with the URLs to allow that the content of the url is indexed.
|
|
</td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellLight">
|
|
<td><label for="ipMustmatch">Must-Match Filter for IPs</label>:</td>
|
|
<td>
|
|
<input name="ipMustmatch" id="ipMustmatch" type="text" size="60" maxlength="100" value="#[ipMustmatch]#" />
|
|
</td>
|
|
<td>
|
|
Like the MUST-Match Filter for URLs this filter must match, but only for the IP of the host.
|
|
YaCy performs a DNS lookup for each host and this filter restricts the crawl to specific IPs
|
|
</td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellDark">
|
|
<td><label for="ipMustnotmatch">Must-Not-Match Filter for IPs</label>:</td>
|
|
<td>
|
|
<input name="ipMustnotmatch" id="ipMustnotmatch" type="text" size="60" maxlength="100" value="#[ipMustnotmatch]#" />
|
|
</td>
|
|
<td>
|
|
This filter must not match on the IP of the crawled host.
|
|
</td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellLight">
|
|
<td><label for="crawlingCountryMustMatch">Must-Match List for Country Codes</label>:</td>
|
|
<td>
|
|
<input type="radio" name="countryMustMatchSwitch" id="countryMustMatchSwitch" value="true" />Use filter
|
|
<input name="countryMustMatchList" id="countryMustMatchList" type="text" size="60" maxlength="256" value="#[countryMustMatch]#" /><br />
|
|
<input type="radio" name="countryMustMatchSwitch" id="countryMustMatchSwitch" value="false" checked="checked" />no country code restriction
|
|
</td>
|
|
<td>
|
|
Crawls can be restricted to specific countries. This uses the country code that can be computed from
|
|
the IP of the server that hosts the page. The filter is not a regular expressions but a list of country codes, separated by comma.
|
|
</td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellDark">
|
|
<td>Maximum Pages per Domain:</td>
|
|
<td>
|
|
<label for="crawlingDomMaxCheck">Use</label>:
|
|
<input type="checkbox" name="crawlingDomMaxCheck" id="crawlingDomMaxCheck" #(crawlingDomMaxCheck)#::checked="checked"#(/crawlingDomMaxCheck)# />
|
|
<label for="crawlingDomMaxPages">Page-Count</label>:
|
|
<input name="crawlingDomMaxPages" id="crawlingDomMaxPages" type="text" size="6" maxlength="6" value="#[crawlingDomMaxPages]#" />
|
|
</td>
|
|
<td>
|
|
You can limit the maximum number of pages that are fetched and indexed from a single domain with this option.
|
|
You can combine this limitation with the 'Auto-Dom-Filter', so that the limit is applied to all the domains within
|
|
the given depth. Domains outside the given depth are then sorted-out anyway.
|
|
</td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellLight">
|
|
<td><label for="crawlingQ">Accept URLs with '?' / dynamic URLs</label>:</td>
|
|
<td><input type="checkbox" name="crawlingQ" id="crawlingQ" #(crawlingQChecked)#::checked="checked"#(/crawlingQChecked)# /></td>
|
|
<td>
|
|
A questionmark is usually a hint for a dynamic page. URLs pointing to dynamic content should usually not be crawled. However, there are sometimes web pages with static content that
|
|
is accessed with URLs containing question marks. If you are unsure, do not check this to avoid crawl loops.
|
|
</td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellDark">
|
|
<td><label for="storeHTCache">Store to Web Cache</label>:</td>
|
|
<td><input type="checkbox" name="storeHTCache" id="storeHTCache" #(storeHTCacheChecked)#::checked="checked"#(/storeHTCacheChecked)# /></td>
|
|
<td>
|
|
This option is used by default for proxy prefetch, but is not needed for explicit crawling.
|
|
</td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellLight">
|
|
<td><label for="mustmatch">Policy for usage of Web Cache</label>:</td>
|
|
<td>
|
|
<input type="radio" name="cachePolicy" value="nocache" />no cache
|
|
<input type="radio" name="cachePolicy" value="iffresh" checked="checked" />if fresh
|
|
<input type="radio" name="cachePolicy" value="ifexist" />if exist
|
|
<input type="radio" name="cachePolicy" value="cacheonly" />cache only
|
|
</td>
|
|
<td>
|
|
The caching policy states when to use the cache during crawling:
|
|
<b>no cache</b>: never use the cache, all content from fresh internet source;
|
|
<b>if fresh</b>: use the cache if the cache exists and is fresh using the proxy-fresh rules;
|
|
<b>if exist</b>: use the cache if the cache exist. Do no check freshness. Otherwise use online source;
|
|
<b>cache only</b>: never go online, use all content from cache. If no cache exist, treat content as unavailable
|
|
</td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellDark">
|
|
<td>Do Local Indexing:</td>
|
|
<td>
|
|
<label for="indexText">index text</label>:
|
|
<input type="checkbox" name="indexText" id="indexText" #(indexingTextChecked)#::checked="checked"#(/indexingTextChecked)# />
|
|
<label for="indexMedia">index media</label>:
|
|
<input type="checkbox" name="indexMedia" id="indexMedia" #(indexingMediaChecked)#::checked="checked"#(/indexingMediaChecked)# />
|
|
</td>
|
|
<td>
|
|
This enables indexing of the wepages the crawler will download. This should be switched on by default, unless you want to crawl only to fill the
|
|
Document Cache without indexing.
|
|
</td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellLight">
|
|
<td><label for="crawlOrder">Do Remote Indexing</label>:</td>
|
|
<td>
|
|
<table border="0" cellpadding="2" cellspacing="0">
|
|
<tr>
|
|
<td>
|
|
<input type="checkbox" name="crawlOrder" id="crawlOrder" #(crawlOrderChecked)#::checked="checked"#(/crawlOrderChecked)# />
|
|
</td>
|
|
<td>
|
|
<label for="intention">Describe your intention to start this global crawl (optional)</label>:<br />
|
|
<input name="intention" id="intention" type="text" size="40" maxlength="100" value="" /><br />
|
|
This message will appear in the 'Other Peer Crawl Start' table of other peers.
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
</td>
|
|
<td>
|
|
If checked, the crawler will contact other peers and use them as remote indexers for your crawl.
|
|
If you need your crawling results locally, you should switch this off.
|
|
Only senior and principal peers can initiate or receive remote crawls.
|
|
<strong>A YaCyNews message will be created to inform all peers about a global crawl</strong>,
|
|
so they can omit starting a crawl with the same start point.
|
|
</td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellLight">
|
|
<td><label for="collection">Add Crawl result to collection(s)</label>:</td>
|
|
<td>
|
|
<input name="collection" id="collection" type="text" size="60" maxlength="100" value="#[collection]#" #(collectionEnabled)#disabled="disabled"::#(/collectionEnabled)# />
|
|
</td>
|
|
<td>
|
|
A crawl result can be tagged with names which are candidates for a collection request. These tags can be selected with the <a href="/gsa/search?q=www&site=#[collection]#">GSA interface</a> using the 'site' operator. To use this option, the 'collection_sxt'-field must be switched on in the <a href="/IndexFederated_p.html">Solr Schema</a>
|
|
</td>
|
|
</tr>
|
|
<tr valign="top" class="TableCellSummary">
|
|
<td colspan="5"><input type="submit" name="crawlingstart" value="Start New Crawl" class="submitready"/></td>
|
|
</tr>
|
|
</table>
|
|
</form>
|
|
|
|
#%env/templates/footer.template%#
|
|
</body>
|
|
</html>
|