By default when the Snap package is installed, YaCy data is stored in a
versioned user folder, allowing to revert to previous data after a
package refresh for example. But it can consumes much disk space, so it
is now possible to tell YaCy snap not to version its data, with the Snap
configuration setting "data.versioned=false".
- set the chunksize to 100 to meet the max of the embedded solr
- re-enable sorting (the case where we switched it of should be away)
- enable recrawling on remote-solr
Building now wkhtmltopdf (used for snapshots generation) from sources,
as its package is only available on the Alpine edge branch and is not
compatible with the current Alpine (3.8) stable base image used for
YaCy.
- Use the configured administrator user name instead of always
defaulting to "admin"
- Do not echo the password in clear text
- Check the password minimum size as will be applied in
ConfigAccounts_p
- Let user type a password when not provided as a parameter
When the YACY_DATA_PATH environment variable is set, shell scripts will
now use the given path instead of relative ../DATA which remains the
default when the variable is not set.
Necessary in the context of Snap package (see issue #254) as YaCy is
started with startYACY.sh and an absolute DATA parent path in parameter.
Since upgrade from Solr 5.5 to Solr 6.6 (commit 6fe7359), hard
autocommits were still enabled to regularly persist the Solr index to
the file system, but new index entries were no more automatically made
available for use by the application (soft autocommit).
Therefore, YaCy features such as index statistics, that do not perform
an explicit commit (as recommended by Solr documentation) were no more
accurate.
Soft autocommit is now restored as a default, with a time period
expected to be sufficient for accuracy while adding only a reasonable
system load overhead.
Fixes issue #251
Processing of gzip encoded incoming requests (on /yacy/transferRWI.html
and /yacy/transferURL.html) was no more working since upgrade to Jetty
9.4.12 (see commit 51f4be1).
To prevent any conflicting behavior with Jetty internals, use now the
GzipHandler provided by Jetty to decompress incoming gzip encoded
requests rather than the previously used custom GZIPRequestWrapper.
Fixes issue #249
On some conditions (especially when reaching timeout), concurrent Solr
query tasks used by the /HostBrowser.html and /api/linkstructure.json
never terminated, thus leaking resources, as reported by @Vort in issue
#246
New "Media Type detection" section in the advanced crawl start page
allow to choose between :
- not loading URLs with unknown or unsupported file extension without
checking the actual Media Type (relying Content-Type header for now).
This was the old default behavior, faster, but not really accurate.
- always cross check URL file extension against the actual Media Type.
This lets properly parse URLs ending with an apparently odd file
extension, but which have actually a supported Media Type such as
text/html.
Sample URLs with misleading file extensions added as documentation in
the crawl start page.
fixes issue #244