For finer control over which parsed documents can trigger an addition of
their links to the crawl stack, complementary to the existing crawl
depth parameter.
Acces rate limitations to this search mode by unauthenticated users are
set low by default to prevent unwanted server overload but can be
customized through the SearchAccessRate_p.html configuration page
Fixes#291
Previously, when mixing results from local RWI and local Solr (Stealth
mode), total local Solr count could be ignored on last result pages,
when the page offset was higher than local Solr count but lower than
total RWI count.
- Do not use spaces in logger identifier name so the log level can be
configured in yacy.logging
- Hold the logger instance to avoid the logging system to look for it
from its name at each appended log message
Previously search navigators/facets elements were sorted only by counts.
Now from the ConfigSearchPage_p.html admin page, sort direction
(ascending/descending) and type (on counts or labels) can be customized
independently for each navigator.
- set the chunksize to 100 to meet the max of the embedded solr
- re-enable sorting (the case where we switched it of should be away)
- enable recrawling on remote-solr
Processing of gzip encoded incoming requests (on /yacy/transferRWI.html
and /yacy/transferURL.html) was no more working since upgrade to Jetty
9.4.12 (see commit 51f4be1).
To prevent any conflicting behavior with Jetty internals, use now the
GzipHandler provided by Jetty to decompress incoming gzip encoded
requests rather than the previously used custom GZIPRequestWrapper.
Fixes issue #249
On some conditions (especially when reaching timeout), concurrent Solr
query tasks used by the /HostBrowser.html and /api/linkstructure.json
never terminated, thus leaking resources, as reported by @Vort in issue
#246
New "Media Type detection" section in the advanced crawl start page
allow to choose between :
- not loading URLs with unknown or unsupported file extension without
checking the actual Media Type (relying Content-Type header for now).
This was the old default behavior, faster, but not really accurate.
- always cross check URL file extension against the actual Media Type.
This lets properly parse URLs ending with an apparently odd file
extension, but which have actually a supported Media Type such as
text/html.
Sample URLs with misleading file extensions added as documentation in
the crawl start page.
fixes issue #244
Not using the JDK URLDecoder.decode() function, as it strips '+'
characters when they occur after '?' (both characters having regular
expression semantics when used in blacklist path patterns)
Normalize blacklist path patterns using percent-encoding, at pattern
edition in web interface and at loading from configuration files.
Fixes issue #237
To make seed upload (in /Settings_p.html?page=seed page) with SCP easier
when the user specify a remote target directory path.
See report by @vikulin in issue #227
- support of "%h" and "%t" pattern components
- more proper initialization of file handler when the data folder is not
the default one, notably to prevent a non blocking but ugly error stack
trace reported by the log manager at startup with that kind of setup
Previously, if YaCy log folder was for example at
`/home/user/yacy/DATA/LOG`, because of improper truncation of log path,
an unnecessary directory creation was atempted at `/home/us`.
- Removed calls to no more existing clearResources functions (on PDFont
class and its children) since upgrade to pdfbox 2.n.n
- Removed hacky usage of protected internal ClassLoader function. This
removes the warnings displayed when running with JDK9 or JDK10 :
[java] WARNING: Illegal reflective access by
net.yacy.document.parser.pdfParser$ResourceCleaner (file:<path>) to
method java.lang.ClassLoader.findLoadedClass(java.lang.String)
[java] WARNING: Please consider reporting this to the maintainers
of net.yacy.document.parser.pdfParser$ResourceCleaner
[java] WARNING: Use --illegal-access=warn to enable warnings of
further illegal reflective access operations
[java] WARNING: All illegal access operations will be denied in a
future release
Crawling thousands of pdf documents from various sources after
modifications applied, revealed no new memory leak related to pdfbox
(measurements done with JVisualVM).
As reported by @vikulin in issue #187, crawling websites using a raw
IPv6 address as host name in their URL failed when running on Microsoft
Windows platforms (FAT32 or NTFS filesystems) when YaCy crawler created
the crawl queue folder, as the ':' character which is part of an IPV6
address is forbidden on these filesystems.
The status of the library in the DictionaryLoader_p.html page now also
advertises the user that an upgrade can be applied when an older dump is
already loaded.
Upgrade applied as suggested by Niklas Andrus @fapth_gitlab on Gitter
chat.
Relative URLs to CSS stylesheets were not properly rendered when using
the Solr html response writer and the "/solr/collection1/select" entry
point instead of "/solr/select".
Was useless as done in an already synchronized block, and the lock
object was assigned a new value in that same block, and nowhere else a
lock is requested on that same object.
When using the 'From Link-List of URL' as a crawl start, with lists in
the order of one or more thousands of links, the failreason_s Solr field
maximum size (32kb) was exceeded by the string representation of the URL
must-match filter when a crawl URL was rejected because not matching.
SimpleDateFormat must not be used by concurrent threads without
synchronization for parsing or formating dates as it is not thread-safe
(internally holds a calendar instance that is not synchronized).
Prefer now DateTimeFormatter when possible as it is thread-safe without
concurrent access performance bottleneck (does not internally use
synchronization locks).
SimpleDateFormat must not be used by concurrent threads without
synchronization for parsing or formating dates as it is not thread-safe
(internally holds a calendar instance that is not synchronized).
Prefer now DateTimeFormatter when possible as it is thread-safe without
concurrent access performance bottleneck (does not internally use
synchronization locks).
SimpleDateFormat must not be used by concurrent threads without
synchronization for parsing or formating dates as it is not thread-safe
(internally holds a calendar instance that is not synchronized).
Prefer now DateTimeFormatter when possible as it is thread-safe without
concurrent access performance bottleneck (does not internally use
synchronization locks).