New "Media Type detection" section in the advanced crawl start page
allow to choose between :
- not loading URLs with unknown or unsupported file extension without
checking the actual Media Type (relying Content-Type header for now).
This was the old default behavior, faster, but not really accurate.
- always cross check URL file extension against the actual Media Type.
This lets properly parse URLs ending with an apparently odd file
extension, but which have actually a supported Media Type such as
text/html.
Sample URLs with misleading file extensions added as documentation in
the crawl start page.
fixes issue #244
Not using the JDK URLDecoder.decode() function, as it strips '+'
characters when they occur after '?' (both characters having regular
expression semantics when used in blacklist path patterns)
Normalize blacklist path patterns using percent-encoding, at pattern
edition in web interface and at loading from configuration files.
Fixes issue #237
- To meet current browsers security rules, which prevent selecting a
full file path with an html input field of type 'file'
- As it does not make sense to select a local file path when a the
administered YaCy server is remote (not on the same computer as the
browser)
To make seed upload (in /Settings_p.html?page=seed page) with SCP easier
when the user specify a remote target directory path.
See report by @vikulin in issue #227
- support of "%h" and "%t" pattern components
- more proper initialization of file handler when the data folder is not
the default one, notably to prevent a non blocking but ugly error stack
trace reported by the log manager at startup with that kind of setup
Previously, if YaCy log folder was for example at
`/home/user/yacy/DATA/LOG`, because of improper truncation of log path,
an unnecessary directory creation was atempted at `/home/us`.
Considering that the sliders usage on that page is very basic, using
standard HTML5 inputs of type "range" has here the following advantages
:
- better keyboard accessibility
- remove not very necessary additional jquery dependencies
Today browsers suport for range inputs is good, and even on old
unsupporting browsers such as IE < 10 they nicely fall back to text
inputs.
When searching images, thumbnails that could not be rendered (because of
a load error such as HTTP 404, networking issue or an internal error on
the rendering servlet) are now hidden as default. But can be revealed
with a button if desired.
Fix for issue #217