New "Media Type detection" section in the advanced crawl start page
allow to choose between :
- not loading URLs with unknown or unsupported file extension without
checking the actual Media Type (relying Content-Type header for now).
This was the old default behavior, faster, but not really accurate.
- always cross check URL file extension against the actual Media Type.
This lets properly parse URLs ending with an apparently odd file
extension, but which have actually a supported Media Type such as
text/html.
Sample URLs with misleading file extensions added as documentation in
the crawl start page.
fixes issue #244
Not loading URLs with unsupported file extension is faster but less accurate.
Indeed, for some web resources the actual Media Type is not consistent with the URL file extension. Here are some examples:
<ul>
<li><ahref="https://en.wikipedia.org/wiki/.de"target="_blank">https://en.wikipedia.org/wiki/.de</a> : the .de extension is unknown, but the actual Media Type of this page is text/html</li>
<li><ahref="https://en.wikipedia.org/wiki/Ask.com"target="_blank">https://en.wikipedia.org/wiki/Ask.com</a> : the .com extension is not supported (executable file format), but the actual Media Type of this page is text/html</li>
<li><ahref="https://commons.wikimedia.org/wiki/File:YaCy_logo.png"target="_blank">https://commons.wikimedia.org/wiki/File:YaCy_logo.png</a> : the .png extension is a supported image format, but the actual Media Type of this page is text/html</li>
</ul>
</span>
</div>
<label>
<inputtype="radio"aria-describedby="mediaTypeCheckingInfo"name="crawlerAlwaysCheckMediaType"value="false"#(crawlerAlwaysCheckMediaType)#checked="checked"::#(/crawlerAlwaysCheckMediaType)#/> Do not load URLs with an unsupported file extension
</label>
<label>
<inputtype="radio"name="crawlerAlwaysCheckMediaType"value="true"#(crawlerAlwaysCheckMediaType)#::checked="checked"#(/crawlerAlwaysCheckMediaType)#/> Always cross check file extension against Content-Type header
CrawlStacker.log.fine("CrawlStacker.stackCrawl of URL "+entry.url().toNormalform(true)+" - not pushed to "+NoticedURL.StackType.NOLOAD+" stack : "+warning);
}
returnnull;
}
error="URL '"+entry.url().toString()+"' file extension is not supported and indexing of linked non-parsable documents is disabled.";
CRAWLER_ALWAYS_CHECK_MEDIA_TYPE("crawlerAlwaysCheckMediaType",false,CrawlAttribute.BOOLEAN,"Always cross check file extension against actual Media Type"),
finalbooleanaddAllLinksToCrawlStack=response.profile().isIndexNonParseableUrls()/* unsupported resources have to be indexed as pure links if no parser support them */
||response.profile().isCrawlerAlwaysCheckMediaType()/* the crawler must always load resources to double-check the actual Media Type even on unsupported file extensions */;