causes Solr error (and wordindex likely finds suggestion)
org.apache.solr.core.SolrCore org.apache.solr.common.SolrException: Cannot parse 'text_t:""d"': Lexical error at line 1, column 12. Encountered: <EOF> after : ""
at org.apache.solr.handler.component.QueryComponent.prepare(
at org.apache.solr.handler.component.SearchHandler.handleRequestBody(
at org.apache.solr.handler.RequestHandlerBase.handleRequest(
at net.yacy.cora.federate.solr.connector.EmbeddedSolrConnector.query(
at net.yacy.cora.federate.solr.connector.EmbeddedSolrConnector$DocListSearcher.<init>(
at net.yacy.cora.federate.solr.connector.EmbeddedSolrConnector.getCountByQuery(
at net.yacy.cora.federate.solr.connector.MirrorSolrConnector.getCountByQuery(
at net.yacy.cora.federate.solr.connector.ConcurrentUpdateSolrConnector.getCountByQuery(
at suggest.respond(
- the admin user name can be configured, in apiExec calls the default "admin" username is used.
TODO: the bin/ script should likely take that into account.
as path for solr index dumps (instead of the SEGMENTS path). This will
make a maintenance of index backups easier. It will also provide a tool
to migrate from an freeworld index to a webportal index.
at net.yacy.http.servlets.SolrServlet.service(
at org.eclipse.jetty.servlet.ServletHolder.handle(
at org.eclipse.jetty.servlet.ServletHandler.doHandle(
via Jetty IPAccessHandler to allow only configured IP's to access.
Handler is only loaded if a restriction is configured.
Since IPAcessHandler (Jetty 8) does not support IPv6 system property
Testing showed system.setProperty seems to be sensitive to point of calling (earliest possible time seems to be best = early in yacy.main).
Moved the "isrunning..." just open browser check also to the new routine to preread the yacy.config only once.
hash even if localhost access is disabled. This is urgently needed for
the script since that is used for high-availability set-up
(checkalive and indexdump for index mirroring)
call response with post=0 (if post empty) simulating previous behavior.
(template servlets typically test for post==null,
found one more were empty post caused problem,
= defaults not correctly set)
with proxy handler, what is currently
- use switched on in config
- access from a local IP / hostname
fix shutdown exception for crashprotection handler on interrupted connections.
execAPIActions require http to be up. The 10s sleep was sufficient to allow Jetty to start,
but it's more robust to place the call after http is assigned to switchboard/serverSwitch.
at net.yacy.cora.federate.solr.responsewriter.GSAResponseWriter.highlight(
at net.yacy.cora.federate.solr.responsewriter.GSAResponseWriter.write(
at net.yacy.http.servlets.SolrServlet.service(
- added default filename filter to select field (as only addition to *.black list is permanent)
- modified Blacklist_p header/legend to show all active blacklists
(to support understanding that all configured lists are active)
- removed obsolete code in Blacklist_p servlet
- metatags my be null
Caused by: java.lang.NullPointerException
at IndexControlRWIs_p.genSearchresult(
at IndexControlRWIs_p.respond(
at net.yacy.peers.Network.peerPing(
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.lang.reflect.Method.invoke(
- user entry in UserDB with admin right can login to access protected pages
- dto. admin user, choosen username is stored in conf (adminAccountUserName=)
- userDB is not sync'ed with Jetty credentials as of now only the std. admin account can login
switched initial browser open with ssl active back to std. http port
regular expressions cause no results. Usage of '*' followed by a dot or
any expression will now cause that this expression is used as a filetype
!!! attention !!! to make sure YaCy can start, https will be disabled if port 8443 is used
- added ping test for above to migration
- as of now port for https is hardcoded to default 8443
- if not urgend required I'd leave it this way (it's standard) to use different ports for http and https
- post https port on ConfigBasic.html (if active)
a dublin core record inside of surrogate input files may now contain
tokens within the namespace 'md' (short for: metadata). The token names
must be valid withing the namespace of the solr field names. All
md-tokens inside of surrogate files then overwrite values within solr
documents before they are written to the solr index. This makes it
possible to assign collection names to each surrogate entry and also
ranking information can be added. Please see the example file.
work around the unfolding process in Solr's BinaryResponseWriter.
This was a huge performance bottleneck in the embedded solr connector
and the problem is actually on Solr side, but we have now a workaround.
- This made it possible to abstract a high-performance index access
method which is implemented as method getDocumentListByParams. That
method is also implemented in the SolrServerConnector and provides a
very efficient access to a solr index if the index is embedded.
- a popular use of the document list retrieval is a result count which
can now also make use of the new method, via getDocumentCountByParams.
- enhanced the Error cache which now does not store error documents
within the ram cache if the document is also written to solr. When
documents are retrieved from the cache, they are partly read from the
ram cache and if not existent there, from the Solr index.
and it is highly recommend to close every SolrRequest.
Every Request, which is not closed leaves a Searcher with its Chaches an
can not be garbage-collectet.
servlet since YaCy 1.63. This is much more performant for the client
than using the XMLResponseWriter because parsing of XML data is very CPU
intensive. Older YaCy peers are still requested using the
XMLResponseWriter but the majority of YaCy peers already respond with
the binary writer. This makes remote searches much faster and less CPU
the class org.apache.pdfbox.pdmodel.font.PDFont occupies 8MB of space
which cannot be cleaned if PDFont.clearResources is called.
The attempt to clean the class cache therefore causes that the class is
loaded and this cache is initialized with some rubbish. I tried to
prevent to instantiate this class by usage of a hacked findLoadedClass
call to the SystemClassLoader (which is protected ...).
Now, without using the PDF parser at all, 8MB of RAM space is not
occupied, however, when the first PDF arrives this space will be taked
and never given back to GC.
not-flushed Solr cache is now handled in this way:
- it is smaller by default
- an Solr-internal process is started to flush the cache periodically
(this does NOT clean the cache, just removes old objects)
- a Solr-external process (the standard YaCy cleanup-process) now has
direct access to the solr internal cache and flushes them completely.
The time frame for such a flush is defined by the cleanup-process
frequency, by default 10 minutes.
- see numerous idx entries with content_type image without url_file_ext_s (for various reason) which should be included in result
- try it yourself with following sample query
/solr/select?q=content_type:image/* AND -url_file_ext_s:[* TO *]&defType=edismax&fl=sku,url_file_ext_s,content_type
adresses also possible url without or deviating extension.
the embedded Solr (the default). This was obtained by cirumventing solrj
search encapsulation and the implementation of direct index access
methods to Solr.
The effect will not only be seen during search, but this has also a
strong effect on suggestions (much more) and less CPU power usage during
index distribution (which needs many search requests)
- based on Jetty ProxyServlet
- at this time use existing HTTPD ProxyHandler for url rewrite
- add jetty-client jar (dependency in Jetty ProxyServlet)
reuse ProxyHandler.convertHeaderFromJetty in YaCyDefaultServlet
- transformed log lines to String before they are stored because the
storage space is about 1:250 (45kb for one line before transformation,
180 bytes afterwards)
- this saves up to 10MB RAM so we can increase the number of lines to
1000 again.
which had a problem because of badly used concurrency.
This fix also caused a redesign of the whole host deletion process.
This should fix bug
the right content domain (i.e. identifying that it is an image, text
etc.) because it used the file extension and not an existing mime type
- fixed the new setting that images shall be loaded for a better image
- both fixes together makes it now possible to crawl which makes use of 'funny' document names (i.e.
ending with .jpg while the document is html)
may have contained multiple same expressions within the disjunction of
domain-restrictions. This fix removes the redundant restrictions and
makes the regex shorter.