a main problem when crawling is long waiting time cuased by crawl-delay
values from robots.txt entries. that attribute is not supported by
google and interpreted by yandex and bing in different ways. In large
crawls there is always one host which blocks the whole crawl with
extreme large values. YaCy now still obeys crawl-delay but limits them
to 10 seconds.
Additionally the blocking logic when loading new robots.txt was analyzed
and a deadlock was removed. Furthermore the construction of new queue
lists was redesigned and it was ensured that always a large list of
different hosts for host-balancing is provided for the loader.
We will use the default value for now on.
This is much better for resource economy and fits better into a
container/docker/kubernetes strategy.
Furthermore, a small memory footprint is essential for the usage on
small devices like RaspberryPi.
- replace new guava 30 with older 25 because that is the correct
dependency for solr 8.8.1. The newer one did actually not work!
- index will be crated in a DATA/INDEX/freeworld/SEGMENTS/solr_8_8_1
subfolder. The older solr_6_6 index is not touched but also not
migrated. The index starts with fresh (empty) content.
- Older indexes must be migrated by hand (export/import) so far until a
better solution is found.
- Large schema adoptions for lucene 8.8.1
variables
To use that feature, set an environment variable with prefix "yacy." and
suffix identical to the yacy configuration attribute name.
Additionaly we implemented a way to set a peer name using the setting
"network.unit.agent". This can therefore now be used to set a peer name
with the java call parameter
-Dyacy.network.unit.agent=anonymous
The purpose for this feature is the ability to set peer names in
mass-deployed kubernetes clusters to the same name to prevent that we
are flooding peer name statistics with auto-deployment-generated names.
Acces rate limitations to this search mode by unauthenticated users are
set low by default to prevent unwanted server overload but can be
customized through the SearchAccessRate_p.html configuration page
Fixes#291
Previously search navigators/facets elements were sorted only by counts.
Now from the ConfigSearchPage_p.html admin page, sort direction
(ascending/descending) and type (on counts or labels) can be customized
independently for each navigator.
Since upgrade from Solr 5.5 to Solr 6.6 (commit 6fe7359), hard
autocommits were still enabled to regularly persist the Solr index to
the file system, but new index entries were no more automatically made
available for use by the application (soft autocommit).
Therefore, YaCy features such as index statistics, that do not perform
an explicit commit (as recommended by Solr documentation) were no more
accurate.
Soft autocommit is now restored as a default, with a time period
expected to be sufficient for accuracy while adding only a reasonable
system load overhead.
Fixes issue #251
If not interested in displaying this on your search results and notably
on a peer with limited resources this can help saving some CPU and
outgoing network connections.
This is necessary when you want to attach to a dedicated external Solr
server protected with basic http authentication and requested over https
but having only a self-signed certificate.
With the appropriate vocabulary settings in Vocabulary_p.html page, this
can produce Vocabulary search facets displaying item types referenced in
html documents by microdata annotation.
Tested notably, but not limited to, vocabulary classes/types defined by
Schema.org and Dublin Core.
Thus allowing to choose at configuration or per search request, whether
extending or not results beyond strict content domain filter (image,
video, audio or application).
Related graphical controls to be added to user interface.
Introduced through the new configurable setting
network.unit.protocol.https.preferred, defaulting to false for now.
Let choose to prefer using https when available on remote peers to
perform YaCy protocol operations including notably hello or transferRWI.
Not yet implemented for every YaCy protocol operations.
Upgraded to InetAccessHandler.
Added InetPathAccessHandler extension to InetAccessHandler to maintain
path patterns capability previously available in IPAccessHandler but
lost in InetAccessHandler.
Filtering on IPv6 addresses is now supported.
Support for deprecated pattern formats such as "192.168." and
"192.168.1.1/path" has been removed, but startup automated migration
should convert such patterns eventually present in serverClient.
Default is still http to prevent any regressions, but a new setting is
available to choose https as the preferred protocol to perform remote
searches.
New configuration setting 'remotesearch.https.preferred' is manually
editable in yacy.conf file or in Advanced Properties page
(/ConfigProperties_p.html).
Should be enabled as default in the future for improved privacy.
Https could also eventually be used for other peers communications.
Thus allowing a more convenient way (wihout the need to go to the admin
section) to login when searching on your remote or password protected
peer and benefit from extended search features such as Heuristics,
Bookmarking or JavasScript resorting.
Can be disabled using the ConfigSearchPage_p.html.