Not loading URLs with unsupported file extension is faster but less accurate.
Indeed, for some web resources the actual Media Type is not consistent with the URL file extension. Here are some examples:
<ul>
<li><ahref="https://en.wikipedia.org/wiki/.de"target="_blank">https://en.wikipedia.org/wiki/.de</a> : the .de extension is unknown, but the actual Media Type of this page is text/html</li>
<li><ahref="https://en.wikipedia.org/wiki/Ask.com"target="_blank">https://en.wikipedia.org/wiki/Ask.com</a> : the .com extension is not supported (executable file format), but the actual Media Type of this page is text/html</li>
<li><ahref="https://commons.wikimedia.org/wiki/File:YaCy_logo.png"target="_blank">https://commons.wikimedia.org/wiki/File:YaCy_logo.png</a> : the .png extension is a supported image format, but the actual Media Type of this page is text/html</li>
</ul>
</span>
</div>
<label>
<inputtype="radio"aria-describedby="mediaTypeCheckingInfo"name="crawlerAlwaysCheckMediaType"value="false"#(crawlerAlwaysCheckMediaType)#checked="checked"::#(/crawlerAlwaysCheckMediaType)#/> Do not load URLs with an unsupported file extension
</label>
<label>
<inputtype="radio"name="crawlerAlwaysCheckMediaType"value="true"#(crawlerAlwaysCheckMediaType)#::checked="checked"#(/crawlerAlwaysCheckMediaType)#/> Always cross check file extension against Content-Type header
Not loading URLs with unsupported file extension is faster but less accurate.
Indeed, for some web resources the actual Media Type is not consistent with the URL file extension. Here are some examples:
<ul>
<li><ahref="https://en.wikipedia.org/wiki/.de"target="_blank">https://en.wikipedia.org/wiki/.de</a> : the .de extension is unknown, but the actual Media Type of this page is text/html</li>
<li><ahref="https://en.wikipedia.org/wiki/Ask.com"target="_blank">https://en.wikipedia.org/wiki/Ask.com</a> : the .com extension is not supported (executable file format), but the actual Media Type of this page is text/html</li>
<li><ahref="https://commons.wikimedia.org/wiki/File:YaCy_logo.png"target="_blank">https://commons.wikimedia.org/wiki/File:YaCy_logo.png</a> : the .png extension is a supported image format, but the actual Media Type of this page is text/html</li>
</ul>
</span>
</div>
<label>
<inputtype="radio"aria-describedby="mediaTypeCheckingInfo"name="crawlerAlwaysCheckMediaType"value="false"#(crawlerAlwaysCheckMediaType)#checked="checked"::#(/crawlerAlwaysCheckMediaType)#/> Do not load URLs with an unsupported file extension
</label>
<label>
<inputtype="radio"name="crawlerAlwaysCheckMediaType"value="true"#(crawlerAlwaysCheckMediaType)#::checked="checked"#(/crawlerAlwaysCheckMediaType)#/> Always cross check file extension against Content-Type header
<tr><tdcolspan="2"><inputtype="radio"name="range"id="rangeDomain"value="domain"#(range_domain)#::checked="checked"#(/range_domain)#/><divid="rangeDomainDescription"style="display:inline">Restrict to start domain(s)</div></td></tr>
<tr><tdcolspan="2"><inputtype="radio"name="range"id="rangeSubpath"value="subpath"#(range_subpath)#::checked="checked"#(/range_subpath)#/><divid="rangeSubpathDescription"style="display:inline">Restrict to sub-path(s)</div></td></tr>
<tdstyle="vertical-align: bottom"><inputname="mustmatch"id="mustmatch"type="text"size="55"maxlength="100000"value="#[mustmatch]#"onblur="if (this.value=='') this.value='.*';"/> (must not be empty)</td></tr>
<tr><tdcolspan="2"><inputtype="radio"name="range"id="rangeDomain"value="domain"#(range_domain)#::checked="checked"#(/range_domain)#/><divid="rangeDomainDescription"style="display:inline">Restrict to start domain(s)</div></td></tr>
<tr><tdcolspan="2"><inputtype="radio"name="range"id="rangeSubpath"value="subpath"#(range_subpath)#::checked="checked"#(/range_subpath)#/><divid="rangeSubpathDescription"style="display:inline">Restrict to sub-path(s)</div></td></tr>
<tdstyle="vertical-align: bottom"><inputname="mustmatch"id="mustmatch"type="text"size="55"maxlength="100000"value="#[mustmatch]#"onblur="if (this.value=='') this.value='.*';"/> (must not be empty)</td></tr>
<td><inputname="crawlerOriginURLMustMatch"id="crawlerOriginURLMustMatch"type="text"size="55"maxlength="100000"value="#[crawlerOriginURLMustMatch]#"onblur="if (this.value=='') this.value='.*';"/> (must not be empty)</td>
<td><inputname="crawlerOriginURLMustMatch"id="crawlerOriginURLMustMatch"type="text"size="55"maxlength="100000"value="#[crawlerOriginURLMustMatch]#"onblur="if (this.value=='') this.value='.*';"/> (must not be empty)</td>
<tr><tdstyle="width:110px"><imgsrc="env/grafics/plus.gif"alt=""> must-match</td><td><inputname="ipMustmatch"id="ipMustmatch"type="text"size="55"maxlength="100000"value="#[ipMustmatch]#"onblur="if (this.value=='') this.value='.*';"/> (must not be empty)</td></tr>
Crawls can be restricted to specific countries. This uses the country code that can be computed from
the IP of the server that hosts the page. The filter is not a regular expressions but a list of country codes, separated by comma.
</span></span>
<inputtype="radio"name="countryMustMatchSwitch"id="noCountryMustMatchSwitch"value="0"#(countryMustMatchSwitchChecked)#checked="checked"::#(/countryMustMatchSwitchChecked)#/>no country code restriction<br/>
<inputtype="radio"name="countryMustMatchSwitch"id="noCountryMustMatchSwitch"value="0"#(countryMustMatchSwitchChecked)#checked="checked"::#(/countryMustMatchSwitchChecked)#/>no country code restriction<br/>
The filter is a <b><ahref="https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html"target="_blank">regular expression</a></b>
that <b>must not match</b> with the URLs to allow that the content of the url is indexed.
Attention: you can test the functionality of your regular expressions using the <ahref="RegexTest.html">Regular Expression Tester</a> within YaCy.
</span></span>
<tablestyle="border-width: 0px">
<tr><tdstyle="width:110px"><imgsrc="env/grafics/plus.gif"alt=""> must-match</td><td><inputname="indexmustmatch"id="indexmustmatch"type="text"size="55"maxlength="100000"value="#[indexmustmatch]#"onblur="if (this.value=='') this.value='.*';"/> (must not be empty)</td></tr>
<dt>Filter on Content of Document<br/>(all visible text, including camel-case-tokenized url and title)</dt>
<dd>
<tr><tdstyle="width:110px"><imgsrc="env/grafics/plus.gif"alt=""> must-match</td><td><inputname="indexmustmatch"id="indexmustmatch"type="text"size="55"maxlength="100000"value="#[indexmustmatch]#"onblur="if (this.value=='') this.value='.*';"/> (must not be empty)</td></tr>
<dt>Filter on Content of Document<br/>(all visible text, including camel-case-tokenized url and title)</dt>
<dd>
<tablestyle="border-width: 0px">
<tr><tdstyle="width:110px"><imgsrc="env/grafics/plus.gif"alt=""> must-match</td><td><inputname="indexcontentmustmatch"id="indexcontentmustmatch"type="text"size="55"maxlength="100000"value="#[indexcontentmustmatch]#"onblur="if (this.value=='') this.value='.*';"/> (must not be empty)</td></tr>
Each parsed document is checked against the given Solr query before being added to the index.
The query must be written in respect to the <ahref="https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.html#the-standard-query-parser"target="_blank">standard</a> Solr query syntax.
<tr><tdstyle="width:110px"><imgsrc="env/grafics/plus.gif"alt=""> must-match</td><td><inputname="indexcontentmustmatch"id="indexcontentmustmatch"type="text"size="55"maxlength="100000"value="#[indexcontentmustmatch]#"onblur="if (this.value=='') this.value='.*';"/> (must not be empty)</td></tr>
Each parsed document is checked against the given Solr query before being added to the index.
The query must be written in respect to the <ahref="https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.html#the-standard-query-parser"target="_blank">standard</a> Solr query syntax.
<p>These are limitations on parts of a document. The filter will be applied after a web page was loaded.</p>
<dl>
<dt>Filter div or nav class names</dt>
<dd>
<dt>Filter div or nav class names</dt>
<dd>
<tablestyle="border-width: 0px">
<tr><tdstyle="width:110px">set of CSS class names</td><td><inputname="ignoreclassname"id="ignoreclassname"type="text"size="55"maxlength="100000"value="#[ignoreclassname]#"onblur="if (this.value=='') this.value='';"/></td><td>comma-separated list of <div> or <nav> element class names which should be filtered out</td></tr>
</table>
</dd>
</dl>
<tr><tdstyle="width:110px">set of CSS class names</td><td><inputname="ignoreclassname"id="ignoreclassname"type="text"size="55"maxlength="100000"value="#[ignoreclassname]#"onblur="if (this.value=='') this.value='';"/></td><td>comma-separated list of <div> or <nav> element class names which should be filtered out</td></tr>
</table>
</dd>
</dl>
</fieldset>
<fieldset>
<legend>Clean-Up before Crawl Start</legend>
<dl>
<dt><labelfor="cleanSearchCache">Clean up search events cache</label></dt>
<imgsrc="env/grafics/i16.gif"width="16"height="16"alt="Clean up search events cache info"/>
<spanstyle="right:0px;"id="cleanSearchCacheInfo">
Check this option to be sure to get fresh search results including newly crawled documents. Beware that it will also interrupt any refreshing/resorting of search results currently requested from browser-side.
<imgsrc="env/grafics/i16.gif"width="16"height="16"alt="Clean up search events cache info"/>
<spanstyle="right:0px;"id="cleanSearchCacheInfo">
Check this option to be sure to get fresh search results including newly crawled documents. Beware that it will also interrupt any refreshing/resorting of search results currently requested from browser-side.
</span></span><inputtype="radio"name="deleteold"id="deleteoldoff"value="off"#(deleteold_off)#::checked="checked"#(/deleteold_off)#/>Do not delete any document before the crawl is started.</dd>
<dt>Delete sub-path</dt>
<dd><inputtype="radio"name="deleteold"id="deleteoldon"value="on"#(deleteold_on)#::checked="checked"#(/deleteold_on)#/>For each host in the start url list, delete all documents (in the given subpath) from that host.</dd>
<dt>Delete only old</dt>
<dd><inputtype="radio"name="deleteold"id="deleteoldage"value="age"#(deleteold_age)#::checked="checked"#(/deleteold_age)#/>Treat documents that are loaded
<dd><inputtype="radio"name="deleteold"id="deleteoldage"value="age"#(deleteold_age)#::checked="checked"#(/deleteold_age)#/>Treat documents that are loaded
</select> ago as stale and delete them before the crawl is started.
</dd>
</dl>
</dd>
</dl>
</fieldset>
<fieldset>
<legend>Double-Check Rules</legend>
@ -535,8 +535,8 @@
then the url is treated as double when you check the 'no doubles' option. A url may be loaded again when it has reached a specific age,
to use that check the 're-load' option.
</span></span><inputtype="radio"name="recrawl"id="reloadoldoff"value="nodoubles"#(recrawl_nodoubles)#::checked="checked"#(/recrawl_nodoubles)#/>Never load any page that is already known. Only the start-url may be loaded again.</dd>
<dt>Re-load</dt>
<dd><inputtype="radio"name="recrawl"id="reloadoldage"value="reload"#(recrawl_reload)#::checked="checked"#(/recrawl_reload)#/>Treat documents that are loaded
<dt>Re-load</dt>
<dd><inputtype="radio"name="recrawl"id="reloadoldage"value="reload"#(recrawl_reload)#::checked="checked"#(/recrawl_reload)#/>Treat documents that are loaded
You are running YaCy in non-p2p mode and because YaCy can be used as replacement for commercial search appliances
(like the GSA) the user must be able to crawl all web pages that are granted to such commercial plattforms.
Because YaCy can be used as replacement for commercial search appliances
(like the Google Search Appliance aka GSA) the user must be able to crawl all web pages that are granted to such commercial platforms.
Not having this option would be a strong handicap for professional usage of this software. Therefore you are able to select
alternative user agents here which have different crawl timings and also identify itself with another user agent and obey the corresponding robots rule.
<divclass="info">Only XML snapshots can be generated. as the <ahref="https://wkhtmltopdf.org/"target="_blank">wkhtmltopdf</a> util is not found by YaCy on your system.
It is required to generate PDF snapshots from crawled pages that can then be converted to images.</div>
It is required to generate PDF snapshots from crawled pages that can then be converted to images.</div>