As reported by @vikulin in issue #187, crawling websites using a raw
IPv6 address as host name in their URL failed when running on Microsoft
Windows platforms (FAT32 or NTFS filesystems) when YaCy crawler created
the crawl queue folder, as the ':' character which is part of an IPV6
address is forbidden on these filesystems.
The status of the library in the DictionaryLoader_p.html page now also
advertises the user that an upgrade can be applied when an older dump is
already loaded.
Upgrade applied as suggested by Niklas Andrus @fapth_gitlab on Gitter
chat.
Relative URLs to CSS stylesheets were not properly rendered when using
the Solr html response writer and the "/solr/collection1/select" entry
point instead of "/solr/select".
Was useless as done in an already synchronized block, and the lock
object was assigned a new value in that same block, and nowhere else a
lock is requested on that same object.
When using the 'From Link-List of URL' as a crawl start, with lists in
the order of one or more thousands of links, the failreason_s Solr field
maximum size (32kb) was exceeded by the string representation of the URL
must-match filter when a crawl URL was rejected because not matching.
SimpleDateFormat must not be used by concurrent threads without
synchronization for parsing or formating dates as it is not thread-safe
(internally holds a calendar instance that is not synchronized).
Prefer now DateTimeFormatter when possible as it is thread-safe without
concurrent access performance bottleneck (does not internally use
synchronization locks).
SimpleDateFormat must not be used by concurrent threads without
synchronization for parsing or formating dates as it is not thread-safe
(internally holds a calendar instance that is not synchronized).
Prefer now DateTimeFormatter when possible as it is thread-safe without
concurrent access performance bottleneck (does not internally use
synchronization locks).
SimpleDateFormat must not be used by concurrent threads without
synchronization for parsing or formating dates as it is not thread-safe
(internally holds a calendar instance that is not synchronized).
Prefer now DateTimeFormatter when possible as it is thread-safe without
concurrent access performance bottleneck (does not internally use
synchronization locks).
This makes possbile to set up much more advanced document crawl filters,
by filtering on one or more document indexed fields before inserting in
the index.
For a better control on the maximum simultaneous outgoing http
connections, as already done for any other http connections (crawls, rwi
search, p2p protocol) using the net.yacy.cora.protocol.http.HTTPClient
If not interested in displaying this on your search results and notably
on a peer with limited resources this can help saving some CPU and
outgoing network connections.
Consistently with the custom Solr http client used for https connections
to remote Solr peers or to YaCy external Solr storage.
This prevent remote Solr requests threads to wait for establishing a
connection to a remote peer longer than the configured timeout.
Initializing Thread names using the Thread constructor parameter is
faster as it already sets a thread name even if no customized one is
given, while an additional call to the Thread.setName() function
internally do synchronized access, eventually runs access check on the
security manager and performs a native call.
Profiling a running YaCy server revealed that the total processing time
spent on Thread.setName() for a typical p2p search was in the range of
seconds.
* less CPU usage using the Solr 'allowedTime' parameter
* increase chances to get some results even when a first operation step
goes in time out by letting some time for final snippets results
processing
Solr can provide partial results for example when a processing time
limit (specified with the parameter `timeAllowed`) is exceeded.
Before this fix, getting partial results from an embedded Solr index
resulted in a ClassCastException :
"org.apache.solr.common.SolrDocumentList cannot be cast to
org.apache.solr.response.ResultContext".
That was caused by concurrent modifications (with addHighlightField()
function) to the same SolrQuery instance when requesting Solr on remote
peers in p2p search.