* less CPU usage using the Solr 'allowedTime' parameter
* increase chances to get some results even when a first operation step
goes in time out by letting some time for final snippets results
processing
Solr can provide partial results for example when a processing time
limit (specified with the parameter `timeAllowed`) is exceeded.
Before this fix, getting partial results from an embedded Solr index
resulted in a ClassCastException :
"org.apache.solr.common.SolrDocumentList cannot be cast to
org.apache.solr.response.ResultContext".
That was caused by concurrent modifications (with addHighlightField()
function) to the same SolrQuery instance when requesting Solr on remote
peers in p2p search.
By not generating MD5 hashes on all words of indexed texts, processing
time is reduced by 30 to 50% on indexed documents with more than 1Mbytes
of plain text.
if at least one of the image size fields is enabled in index (images_height_val,
images_width_val, images_pixel_val).
Previously all fields were required to be enabled (hint: default setting
is height + width enabled)
- properly handle IPv6 loopback address replacement
- replace loopback address or host only when accessing peer remotely
- replace loopback part with the peer hostname as requested rather than
with its seed public IP as this works better for Intranet mode and when
peer is behind a reverse proxy.
Otherwise once this operation is applied, the remote Solr(s) instances
are deconnected and the embedded Solr is connected even if disabled by
setting "core.service.fulltext".
Also use constants for related default setting values.
- Use the EnhancedXMLResponseWriter only when requested output is "exml"
- Use the Standard Solr writers when possible, for example for json, xml
or javabin output formats
- Return an error when the requested format can not been rendered with
an external Solr server only
Important : this modification is necessary for peers using exclusively
an external Solr server to be reachable as robinson targets in p2p
search, as the binary format ("javabin") is the default Solr exchange
format for peers.
Before this, when a peer requested a remote one attached only to an
external Solr (no embedded one), it ended with "Invalid type" error, as
the remote peer answered with xml although binary format was requested.
This is necessary when you want to attach to a dedicated external Solr
server protected with basic http authentication and requested over https
but having only a self-signed certificate.
The current default boost function
(`recip(ms(NOW,last_modified),3.16e-11,1,1)`) for the Date ranking
profile is indeed working fine.
What can trigger the error `unexpected docvalues type NUMERIC for field
'last_modified'` is the previous default boost function (quite old now)
or any custom one using the Solr `ord` or `rord` functions on the
last_modified field.
Then the problem was that the migration code in the Switchboard supposed
to detect the old date boost function was incorrect (one trailing right
parenthesis in excess), so the deprecated function remained.
This fixes issue #169.
- Fixes issue #160 : handle properly syntax exceptions with a user
friendly message
- Fixes loss of information on multiple blacklist entries editions
- Fixes loss of entries when moving entries from one list to another
Previously, when clicking a selected facet in the search results page to
unselect it, all other eventually selected modifiers/facets were also
removed.
With the appropriate vocabulary settings in Vocabulary_p.html page, this
can produce Vocabulary search facets displaying item types referenced in
html documents by microdata annotation.
Tested notably, but not limited to, vocabulary classes/types defined by
Schema.org and Dublin Core.
This adds the possibility for the HTML parser to gather typed items URLs
annotated in HTML tags with itemscope and itemtype attributes (see
microdata specification https://www.w3.org/TR/microdata/ ), notably
Types from the schema.org vocabulary, but also Types/Classes from any
other vocabulary, such as the common ones listed in the RDFa core
context ( https://www.w3.org/2011/rdfa-context/rdfa-1.1.html ).
Recrawl default profile was previously effectively used for crawl
stacker acceptance check, but request entries were indeed still created
with the "snippetGlobalText" profile.
Associate cached content to the last redirection location, instead of
the first URL of a redirection(s) chain :
- for proper base URL processing in parsers (fixes mantis 636 -
http://mantis.tokeek.de/view.php?id=636)
- to prevent duplicated content in Solr index when recrawling a
redirected URL
- with only light constraint on known indexed documents load date, as it
can already been controlled by the selection query, and the goal of the
job is indeed to recrawl selected documents now
- using the iffresh cache strategy
check case insensitive.
As keywords are compared lower case, make sure user input keyword:Key
or keyword:key will be shown as active in facet entry key.
Thus allowing to choose at configuration or per search request, whether
extending or not results beyond strict content domain filter (image,
video, audio or application).
Related graphical controls to be added to user interface.
Required for proper operation when the default system locale is Turkish,
as dottless and dotted i characters have specific case conversion rules
in this language.
Introduced through the new configurable setting
network.unit.protocol.https.preferred, defaulting to false for now.
Let choose to prefer using https when available on remote peers to
perform YaCy protocol operations including notably hello or transferRWI.
Not yet implemented for every YaCy protocol operations.
When a crawl is started, a new field to exclude content from scraping is
available. The field can be identified with the class name of div tags.
All text contained in such a div tag where the configured class name(s)
match are not indexed, while the remaining page is indexed.
Upgraded to InetAccessHandler.
Added InetPathAccessHandler extension to InetAccessHandler to maintain
path patterns capability previously available in IPAccessHandler but
lost in InetAccessHandler.
Filtering on IPv6 addresses is now supported.
Support for deprecated pattern formats such as "192.168." and
"192.168.1.1/path" has been removed, but startup automated migration
should convert such patterns eventually present in serverClient.