introduce FederateSearchManager handling search heuristic to external systems via specific FederateSearchConnectors,
which provide the query() functionallity, the translation to YaCy schema .toYaCySchema() and the search() routine to deliver results to searchevents, which is generally implemented in Abstract connector.
The manager enforces now a min 15s delay between calls to external systems.
Besides the OpensearchConnector a SolrFederateSearchConnector is available. It uses a additional config file for fieldname translation.
default heuristicopensearch.conf:
- openbdb.com removed - seems not longer to deliver results
- config via solrconnector to datacite.org added (large technical library archive)
thread pools will flush their cached (dead) threads after 60 seconds.
This will cause that YaCy now runs constantly withl about 50 threads,
about 100 at peak times. Previously, about 400 threads had been cached
and kept in a hibernation state, which caused that the numproc counter
in /proc/user_beancounters (exists only in VM-hosted linux) was as high
as the cached number of threads. This caused that VM supervisors
terminated whole VM sessions if a limit was reached. Many VM providers
have limits of numproc=96 which made it virtually impossible to run YaCy
on such machines. With this change, it will be possible to run many YaCy
instances even on VM hosts.
during document parsing; instead use the same references that would also
be written into the webgraph. That should cause that the webgraph and
the citation index express the exact same semantic.
- close stackfile inputstream at end of ChunkIterator
This should solve startup delay while unfinished crawl jobs exist (maybe also too many open file situation)
This organizes all urls to be loaded in separate queues for each host.
Each host separates the crawl depth into it's own queue. The primary
rule for urls taken from any queue is, that the crawl depth is minimal.
This produces a crawl depth which is identical to the clickdepth.
Furthermorem the crawl is able to create a much better balancing over
all hosts which is fair to all hosts that are in the queue.
This process will create a very large number of files for wide crawls in
the QUEUES folder: for each host a directory, for each crawl depth a
file inside the directory. A crawl with maxdepth = 4 will be able to
create 10.000s of files. To be able to use that many file readers, it
was necessary to implement a new index data structure which opens the
file only if an access is wanted (OnDemandOpenFileIndex). The usage of
such on-demand file reader shall prevent that the number of file
pointers is over the system limit, which is usually about 10.000 open
files. Some parts of YaCy had to be adopted to handle the crawl depth
number correctly. The logging and the IndexCreateQueues servlet had to
be adopted to show the crawl queues differently, because the host name
is attached to the port on the host to differentiate between http,
https, and ftp services.