jdk-based logger tend to block
at java.util.logging.Logger.log(Logger.java:476) in concurrent
environments. This makes logging a main performance issue. To overcome
this problem, this is a add-on to jdk logging to put log entries on a
concurrent message queue and log the messages one by one using a
separate process.
- FTPClient uses the concurrent logging instead of the log4j logger
- move setting of system property solr.directoryFactory=solr.MMapDirectoryFactory to solrcore.properties
- add check of os.arch for 64bit system, if it fails use default/solrcore.x86.properties (if exists) as solrcore.properties
reason: on 32bit MMapDirectoryFactory may fail with.....
Caused by: java.io.IOException: Map failed
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:849)
at org.apache.lucene.store.MMapDirectory.map(MMapDirectory.java:283)
soft commits, reduced caching size of search events, ensured that solr
results are processed before connection is closed to keep that stuff not
too long in RAM
API recording for this method so it can be repeated automatically. The
index dump generation is now also available for API recording. Added
some synchronization in backend which was necessary for this.
id to be tested, but with a collection of ids. This will cause only a
single call to solr instead of many. The result is a much better
performace when testing the existence of many urls. The effect should
cause very much less IO during index transmission, both on sender and
receiver side.
- removed httpclient 3.1 which has been used by solrj < 4.x.x and is now
not used any more
- fixed some parts in YaCy which used methods from httpclient 3.1
Because the index size is now provided by solr, and the only way to do
that is a match for [* TO *], a size computation is quite complex and
time-consuming. Therefore this patch prevents that the method is called
at all and if necessary puts a DOS-preventing barrier in front of it.
- intruduced raw-queries for the re-introduced byId-Queries (they are
hopefully faster than full edismax queries)
- removed the cached solr connector (testing this) to rely only on the
solr built-in search caches. That should save some RAM (also). We will
see if this is usable.
- removed unused solr access classes
- made snippet generation for documents aus YaCy RWI/DHT concurrent (as
it was before the search process removation)
- reduced the number of remote results in settings file because the
processing of such mass documents add is too CPU-intensive (in Solr)
- removed 'worker' processes
- no internal time-out behaviour: methods either are successful or
return null
- waiting is only done on top-level
- removed snippet-production; this is replaced by solr snippets
- removed statistics based on solr size queries (they had been VERY
long); the statistics (like suggestions or tag cloud) are now again
based on the old but very fast RWI index. In portal or intranet mode the
RWI index is usually switched off; if you like to have statistics again
then you must switch on the rwis again in this mode.
- fixed many bugs regarding correct page counter
The default schema uses only some of them and the resting search index
has now the following properties:
- webgraph size will have about 40 times as much entries as default
index
- the complete index size will increase and may be about the double size
of current amount
As testing showed, not much indexing performance is lost. The default
index will be smaller (moved fields out of it); thus searching
can be faster.
The new index will cause that some old parts in YaCy can be removed,
i.e. specialized webgraph data and the noload crawler. The new index
will make it possible to:
- search within link texts of linked but not indexed documents (about 20
times of document index in size!!)
- get a very detailed link graph
- enhance ranking using a complete link graph
To get the full access to the new index, the API to solr has now two
access points: one with attribute core=collection1 for the default
search index and core=webgraph to the new webgraph search index. This is
also avaiable for p2p operation but client access is not yet
implemented.
structure, but is not filled yet. To have the opportunity of a second
core, multi-core functionality had to be implemented to the
deep-embedded solr:
- migrated the solr_40 directory content to a subdirectory
'collection1'; the previously used default core is now called
collection1
- added solr_40/webgraph subdirectory as second core
- added a servlet configuration for the second core 'webgraph' in
/IndexSchema_p.html
- added instance handling as addition to solr connections: all solr
connectors are now instances of an solr 'instance' object; this required
a complete re-design of the solr embedding
- migrated also caching and sharding ontop of new instance handling
- migrated the search apis to handle now the access to a specific core,
the default core named 'collection1'
- migrated the remote solr search interface to access shards of cores;
for the yacy remote search the default core is now called 'solr'; using
the peer address as solr address
- migrated the solr backup and restore process: old backups cannot be
used after this migration!
- redesign of solr instance handling in all methods which access the
instances: they cannot hold copies of these instances any more; the must
retrieve the actuall connection object every time they want to write to
it (this solves also some bugs when switching the index/network)
- added another schema 'solr.webgraph.schema', the old solr.keys.list is
replaced by solr.collection.schema
multiple solr cores instead of just one. Therefore it is now necessary
to distuingish between solr server connections (called an 'Instance')
and a connection to a single solr core. One Instance may now have
multiple connector classes assigned to it, each connecting to a single
core.
To support multiple cores it is also necessary to distinguish between
the connection configuration and the configuration of the index schema.
We will have multiple schema configurations in the future, each for
every solr core. This caused that the IndexFederated servlet had to be
split into two parts, the new Servlet for the Schema editor is now in
the IndexSchema Servlet.
one request:
- allow larger match-fields in html interface
- delete all host hashes at once from zurl
- when deleting by host, do not count size of deleted entries since that
was the reason it took so long
4.0.0 there is a new softcommit feature which implements a
near-real-time (NRT) search option. The softcommit does not do IO and
does not cause performance issues.
YaCy has now an extension in its solr connectors to use the softcommit
feature. The softcommit call now replaces all places where a hard commit
was used. Furthermore the commit strategy in when doing a search from
the web interface was changed (it's done every time before a search is
done).
The softcommit feature was implemented because it was needed for the
following changes (customer demands), which is also included in this
git commit:
- added a feature to identify all documents which have unique titles
and/or unique descriptions. These unique flags are disabled by default.
- added also a feature to set a flag when the url from a canonical tag
is equal to the document url. This is also disabled by default.
To support the new softcommit strategy, the commitWithinMs option was
set to -1 do disable automatic commit based on document insert times. If
documents are inserted permanently then also a commit would happen
permanently whenever the commitWithinMs time is reached. This would
conflict with the regular autocommit of 10 minutes and the new
softcommit strategy.
- migrates all entries in old urldb
Metadata coordinate (lat / lon) NumberFormatException still relative often (see excerpt below),
- added try/catch for URIMetadataRow (seems not to be needed in URIMetaDataNode, as Solr internally checks for number format)
- removed possible typ conversion for lat() / lon() comparison with 0.0f, changed to 0.0 (leaving it to the compiler/optimizer to choose number format)
current log excerpt for NumberFormatException:
W 2013/01/14 00:10:07 StackTrace For input string: "-"
java.lang.NumberFormatException: For input string: "-"
at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source)
at java.lang.Double.parseDouble(Unknown Source)
at net.yacy.kelondro.data.meta.URIMetadataRow$Components.lon(URIMetadataRow.java:525)
at net.yacy.kelondro.data.meta.URIMetadataRow.lon(URIMetadataRow.java:279)
at net.yacy.search.index.SolrConfiguration.metadata2solr(SolrConfiguration.java:277)
at net.yacy.search.index.Fulltext.putMetadata(Fulltext.java:329)
at transferURL.respond(transferURL.java:152)
...
Caused by: java.lang.NumberFormatException: For input string: "-"
at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source)
at java.lang.Double.parseDouble(Unknown Source)
at net.yacy.kelondro.data.meta.URIMetadataRow$Components.lon(URIMetadataRow.java:525)
at net.yacy.kelondro.data.meta.URIMetadataRow.lon(URIMetadataRow.java:279)
at net.yacy.search.index.SolrConfiguration.metadata2solr(SolrConfiguration.java:277)
at net.yacy.search.index.Fulltext.putMetadata(Fulltext.java:329)
at transferURL.respond(transferURL.java:152)
metadata and old rwi and for the citation index. The important
advancement is the separation of the citation index deletion because
that index is responsible for the linkdepth calculation. Now a search
index can be deleted without the citation index and that should cause
that less clickdepths must be post-processed.