a main problem when crawling is long waiting time cuased by crawl-delay
values from robots.txt entries. that attribute is not supported by
google and interpreted by yandex and bing in different ways. In large
crawls there is always one host which blocks the whole crawl with
extreme large values. YaCy now still obeys crawl-delay but limits them
to 10 seconds.
Additionally the blocking logic when loading new robots.txt was analyzed
and a deadlock was removed. Furthermore the construction of new queue
lists was redesigned and it was ensured that always a large list of
different hosts for host-balancing is provided for the loader.
- replace new guava 30 with older 25 because that is the correct
dependency for solr 8.8.1. The newer one did actually not work!
- index will be crated in a DATA/INDEX/freeworld/SEGMENTS/solr_8_8_1
subfolder. The older solr_6_6 index is not touched but also not
migrated. The index starts with fresh (empty) content.
- Older indexes must be migrated by hand (export/import) so far until a
better solution is found.
- Large schema adoptions for lucene 8.8.1
you can i.e. do:
export YACY_PORT=8092 && ./startYACY.sh
Just append "YACY_" to uppercase version of environment variables and
replace all "." with "_".
protocol completely
If you set now an empty password, then the http server will not ask to
authentify. This is required for environment where we attach an outside
authentification service like keycloak or similar using authentication
in an ingress proxy.
This change is part of the approach to run YaCy inside of a kubernetes
cluster where we do not want individual authentication of peers and want
to apply a ingress authentication.
variables
To use that feature, set an environment variable with prefix "yacy." and
suffix identical to the yacy configuration attribute name.
Additionaly we implemented a way to set a peer name using the setting
"network.unit.agent". This can therefore now be used to set a peer name
with the java call parameter
-Dyacy.network.unit.agent=anonymous
The purpose for this feature is the ability to set peer names in
mass-deployed kubernetes clusters to the same name to prevent that we
are flooding peer name statistics with auto-deployment-generated names.
Added also an example for one of the existing APIs. The problem is the
comma separator between objects which must not be there for the last
entry in a sequence. The new syntax adds the separator symbol
automatically.
This does not affect security because:
- it is going to localhost only
- only users who have already access to the pw hash can do this
- no clear text pw is transmitted because that is not stored anywhere
The switch to basic is required because these commands are required
in the context of hosting on root servers and docker containers
where a password change must be done. But the password shell command
was not working without password which made the concept unusable.
This deficit made it virtually impossible for root server operators
to use YaCy because they had been unable to set up a proper password.
For finer control over which parsed documents can trigger an addition of
their links to the crawl stack, complementary to the existing crawl
depth parameter.
Acces rate limitations to this search mode by unauthenticated users are
set low by default to prevent unwanted server overload but can be
customized through the SearchAccessRate_p.html configuration page
Fixes#291
Previously, when mixing results from local RWI and local Solr (Stealth
mode), total local Solr count could be ignored on last result pages,
when the page offset was higher than local Solr count but lower than
total RWI count.
- Do not use spaces in logger identifier name so the log level can be
configured in yacy.logging
- Hold the logger instance to avoid the logging system to look for it
from its name at each appended log message
Previously search navigators/facets elements were sorted only by counts.
Now from the ConfigSearchPage_p.html admin page, sort direction
(ascending/descending) and type (on counts or labels) can be customized
independently for each navigator.
- set the chunksize to 100 to meet the max of the embedded solr
- re-enable sorting (the case where we switched it of should be away)
- enable recrawling on remote-solr
Processing of gzip encoded incoming requests (on /yacy/transferRWI.html
and /yacy/transferURL.html) was no more working since upgrade to Jetty
9.4.12 (see commit 51f4be1).
To prevent any conflicting behavior with Jetty internals, use now the
GzipHandler provided by Jetty to decompress incoming gzip encoded
requests rather than the previously used custom GZIPRequestWrapper.
Fixes issue #249
On some conditions (especially when reaching timeout), concurrent Solr
query tasks used by the /HostBrowser.html and /api/linkstructure.json
never terminated, thus leaking resources, as reported by @Vort in issue
#246
New "Media Type detection" section in the advanced crawl start page
allow to choose between :
- not loading URLs with unknown or unsupported file extension without
checking the actual Media Type (relying Content-Type header for now).
This was the old default behavior, faster, but not really accurate.
- always cross check URL file extension against the actual Media Type.
This lets properly parse URLs ending with an apparently odd file
extension, but which have actually a supported Media Type such as
text/html.
Sample URLs with misleading file extensions added as documentation in
the crawl start page.
fixes issue #244
Not using the JDK URLDecoder.decode() function, as it strips '+'
characters when they occur after '?' (both characters having regular
expression semantics when used in blacklist path patterns)
Normalize blacklist path patterns using percent-encoding, at pattern
edition in web interface and at loading from configuration files.
Fixes issue #237
To make seed upload (in /Settings_p.html?page=seed page) with SCP easier
when the user specify a remote target directory path.
See report by @vikulin in issue #227
- support of "%h" and "%t" pattern components
- more proper initialization of file handler when the data folder is not
the default one, notably to prevent a non blocking but ugly error stack
trace reported by the log manager at startup with that kind of setup
Previously, if YaCy log folder was for example at
`/home/user/yacy/DATA/LOG`, because of improper truncation of log path,
an unnecessary directory creation was atempted at `/home/us`.
- Removed calls to no more existing clearResources functions (on PDFont
class and its children) since upgrade to pdfbox 2.n.n
- Removed hacky usage of protected internal ClassLoader function. This
removes the warnings displayed when running with JDK9 or JDK10 :
[java] WARNING: Illegal reflective access by
net.yacy.document.parser.pdfParser$ResourceCleaner (file:<path>) to
method java.lang.ClassLoader.findLoadedClass(java.lang.String)
[java] WARNING: Please consider reporting this to the maintainers
of net.yacy.document.parser.pdfParser$ResourceCleaner
[java] WARNING: Use --illegal-access=warn to enable warnings of
further illegal reflective access operations
[java] WARNING: All illegal access operations will be denied in a
future release
Crawling thousands of pdf documents from various sources after
modifications applied, revealed no new memory leak related to pdfbox
(measurements done with JVisualVM).
As reported by @vikulin in issue #187, crawling websites using a raw
IPv6 address as host name in their URL failed when running on Microsoft
Windows platforms (FAT32 or NTFS filesystems) when YaCy crawler created
the crawl queue folder, as the ':' character which is part of an IPV6
address is forbidden on these filesystems.
The status of the library in the DictionaryLoader_p.html page now also
advertises the user that an upgrade can be applied when an older dump is
already loaded.
Upgrade applied as suggested by Niklas Andrus @fapth_gitlab on Gitter
chat.
Relative URLs to CSS stylesheets were not properly rendered when using
the Solr html response writer and the "/solr/collection1/select" entry
point instead of "/solr/select".
Was useless as done in an already synchronized block, and the lock
object was assigned a new value in that same block, and nowhere else a
lock is requested on that same object.
When using the 'From Link-List of URL' as a crawl start, with lists in
the order of one or more thousands of links, the failreason_s Solr field
maximum size (32kb) was exceeded by the string representation of the URL
must-match filter when a crawl URL was rejected because not matching.
SimpleDateFormat must not be used by concurrent threads without
synchronization for parsing or formating dates as it is not thread-safe
(internally holds a calendar instance that is not synchronized).
Prefer now DateTimeFormatter when possible as it is thread-safe without
concurrent access performance bottleneck (does not internally use
synchronization locks).
SimpleDateFormat must not be used by concurrent threads without
synchronization for parsing or formating dates as it is not thread-safe
(internally holds a calendar instance that is not synchronized).
Prefer now DateTimeFormatter when possible as it is thread-safe without
concurrent access performance bottleneck (does not internally use
synchronization locks).
SimpleDateFormat must not be used by concurrent threads without
synchronization for parsing or formating dates as it is not thread-safe
(internally holds a calendar instance that is not synchronized).
Prefer now DateTimeFormatter when possible as it is thread-safe without
concurrent access performance bottleneck (does not internally use
synchronization locks).
This makes possbile to set up much more advanced document crawl filters,
by filtering on one or more document indexed fields before inserting in
the index.
For a better control on the maximum simultaneous outgoing http
connections, as already done for any other http connections (crawls, rwi
search, p2p protocol) using the net.yacy.cora.protocol.http.HTTPClient
If not interested in displaying this on your search results and notably
on a peer with limited resources this can help saving some CPU and
outgoing network connections.
Consistently with the custom Solr http client used for https connections
to remote Solr peers or to YaCy external Solr storage.
This prevent remote Solr requests threads to wait for establishing a
connection to a remote peer longer than the configured timeout.
Initializing Thread names using the Thread constructor parameter is
faster as it already sets a thread name even if no customized one is
given, while an additional call to the Thread.setName() function
internally do synchronized access, eventually runs access check on the
security manager and performs a native call.
Profiling a running YaCy server revealed that the total processing time
spent on Thread.setName() for a typical p2p search was in the range of
seconds.
* less CPU usage using the Solr 'allowedTime' parameter
* increase chances to get some results even when a first operation step
goes in time out by letting some time for final snippets results
processing
Solr can provide partial results for example when a processing time
limit (specified with the parameter `timeAllowed`) is exceeded.
Before this fix, getting partial results from an embedded Solr index
resulted in a ClassCastException :
"org.apache.solr.common.SolrDocumentList cannot be cast to
org.apache.solr.response.ResultContext".
That was caused by concurrent modifications (with addHighlightField()
function) to the same SolrQuery instance when requesting Solr on remote
peers in p2p search.
By not generating MD5 hashes on all words of indexed texts, processing
time is reduced by 30 to 50% on indexed documents with more than 1Mbytes
of plain text.
if at least one of the image size fields is enabled in index (images_height_val,
images_width_val, images_pixel_val).
Previously all fields were required to be enabled (hint: default setting
is height + width enabled)
- properly handle IPv6 loopback address replacement
- replace loopback address or host only when accessing peer remotely
- replace loopback part with the peer hostname as requested rather than
with its seed public IP as this works better for Intranet mode and when
peer is behind a reverse proxy.
Otherwise once this operation is applied, the remote Solr(s) instances
are deconnected and the embedded Solr is connected even if disabled by
setting "core.service.fulltext".
Also use constants for related default setting values.
- Use the EnhancedXMLResponseWriter only when requested output is "exml"
- Use the Standard Solr writers when possible, for example for json, xml
or javabin output formats
- Return an error when the requested format can not been rendered with
an external Solr server only
Important : this modification is necessary for peers using exclusively
an external Solr server to be reachable as robinson targets in p2p
search, as the binary format ("javabin") is the default Solr exchange
format for peers.
Before this, when a peer requested a remote one attached only to an
external Solr (no embedded one), it ended with "Invalid type" error, as
the remote peer answered with xml although binary format was requested.