<inputtype="checkbox"name="directDocByURL"id="directDocByURL"#(directDocByURLChecked)#::checked="checked"#(/directDocByURLChecked)#/>also all linked non-parsable documents<br/>
Unlimited crawl depth for URLs matching with: <inputname="crawlingDepthExtension"id="crawlingDepthExtension"type="text"size="30" maxlength="100"value="#[crawlingDepthExtension]#"/>
Unlimited crawl depth for URLs matching with: <inputname="crawlingDepthExtension"id="crawlingDepthExtension"type="text"size="40" maxlength="100"value="#[crawlingDepthExtension]#"/>
</td>
<td>
This defines how often the Crawler will follow links (of links..) embedded in websites.
<inputtype="radio"name="range"id="rangeDomain"value="domain"onclick="document.getElementById('mustmatch').disabled=true;document.getElementById('deleteoldon').disabled=false;document.getElementById('deleteoldage').disabled=false;document.getElementById('deleteoldon').checked=true;"/>Restrict to start domain(s)<br/>
<inputtype="radio"name="range"id="rangeSubpath"value="subpath"onclick="document.getElementById('mustmatch').disabled=true;document.getElementById('deleteoldon').disabled=false;document.getElementById('deleteoldage').disabled=false;document.getElementById('deleteoldon').checked=true;"/>Restrict to sub-path(s)<br/>
<tr><td>on IPs for Crawling:</td><td><inputname="ipMustmatch"id="ipMustmatch"type="text"size="55"maxlength="100"value="#[ipMustmatch]#"/></td></tr>
<tr><td>on URLs for Indexing</td><td><inputname="indexmustmatch"id="indexmustmatch"type="text"size="55"maxlength="100"value="#[indexmustmatch]#"/></td></tr>
</table>
</td>
<td>
The filter is a <b><ahref="http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html">regular expression</a></b>
that <b>must match</b> with the URLs which are used to be crawled; default is 'catch all'.
Example: to allow only urls that contain the word 'science', set the filter to '.*science.*'.
You can also use an automatic domain-restriction to fully crawl a single domain.
<tr><tdwidth="160">on URLs for Crawling:</td><td><inputname="mustnotmatch"id="mustnotmatch"type="text"size="55"maxlength="1000"value="#[mustnotmatch]#"/></td></tr>
<tr><td>on IPs for Crawling:</td><td><inputname="ipMustnotmatch"id="ipMustnotmatch"type="text"size="55"maxlength="1000"value="#[ipMustnotmatch]#"/></td></tr>
<tr><td>on URLs for Indexing:</td><td><inputname="indexmustnotmatch"id="indexmustnotmatch"type="text"size="55"maxlength="1000"value="#[indexmustnotmatch]#"/></td></tr>
</table>
</td>
<td>
The filter is a <b><ahref="http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html">regular expression</a></b>
that <b>must not match</b> with the URLs to allow that the content of the url is indexed.
<inputtype="radio"name="range"id="rangeDomain"value="domain"onclick="document.getElementById('deleteold').disabled=false;document.getElementById('deleteold').checked=true;"/>Restrict to start domain<br/>
<inputtype="radio"name="range"id="rangeSubpath"value="subpath"onclick="document.getElementById('deleteold').disabled=false;document.getElementById('deleteold').checked=true;"/>Restrict to sub-path<br/>
<inputtype="checkbox"name="deleteold"id="deleteold"disabled/>Delete all old documents in domain/subpath
</td>
<td>
The filter is a <b><ahref="http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html">regular expression</a></b>
that <b>must match</b> with the URLs which are used to be crawled; default is 'catch all'.
Example: to allow only urls that contain the word 'science', set the filter to '.*science.*'.
You can also use an automatic domain-restriction to fully crawl a single domain.
</td>
</tr>
<trvalign="top"class="TableCellDark">
<td><labelfor="mustnotmatch">Must-Not-Match Filter for URLs for crawling</label>:</td>
<inputtype="checkbox"name="crawlingQ"id="crawlingQ"#(crawlingQChecked)#::checked="checked"#(/crawlingQChecked)#/> allow <ahref="http://en.wikipedia.org/wiki/Query_string">query-strings</a> (urls with a '?' in the path)
if(newcrawlingMustMatch.length()<2)newcrawlingMustMatch=CrawlProfile.MATCH_ALL_STRING;// avoid that all urls are filtered out if bad value was submitted
finalbooleanfullDomain="domain".equals(post.get("range","wide"));// special property in simple crawl start
finalbooleansubPath="subpath".equals(post.get("range","wide"));// special property in simple crawl start