Was inadequately modified in my previous related commits (making next
pages buttons unavailable in Search portal mode), as
SearchEvent.local_solr_available did not count the total filtered
results but only the ones within the currently fetched result page(s).
Using unfiltered detailed counts (local and remote entries found before
doubles detection and before applying query modifiers) was confusing and
inconsistent with the total count. It could let think more results are
to come in the next pages, without understanding why they are not
displayed.
This modification has indeed low incidence as eventual query modifiers
are already applied when requesting the local solr index.
It mainly impact doublons detected with results from remote peers.
Also updated javadocs for clarification.
As a server-side oriented alternative to the JavaScript realtime
resorting feature proposed in PR #104.
The goal is the same as in this PR : having the possibility compensate
the network latency of various peers results fetching and obtain once
possible a consistently ranked result set.
As reported edycop in mantis 765 (
http://mantis.tokeek.de/view.php?id=765 ), parsing of xlsx files was
quite incomplete.
Now properly support "Shared String Table" entry in Office Open XML
spreadsheets, an also detect embedded URLs.
Integrating the Apache poi-ooxml library could be an option for finer
OOXML formats support, but their SAX style parsing example (
http://poi.apache.org/spreadsheet/how-to.html#xssf_sax_api ) tends to
show that a custom SAX handler is still efficient for lightweight and
low memory footprint processing.
Previously, when checking for the first time the robots.txt policy on a
unknown host (not cached in the robots table), result was always empty
in the /getpageinfo_p.xml api and in the /CrawlCheck_p.html page. Next
calls returned however the correct information.
Complements the recent modification related to images in commit 7f395ef.
Unfortunately many documents metadata fetched from the freeworld p2p
network have only partial information about embedded images. Without
proper error handling, this made many searches in p2p mode to fail
completely.
This should be a help to make a preview of search results.
The image is computed from the list of embedded images, it is
always the first image in that list.
In rss-type results the image is presented like
<media:content medium="image" url="https://abc.xyz/logo.png"/>
as defined in
http://www.rssboard.org/media-rss#media-content
Fix Conjunction.addOperator to do nothing if term is empty
prevent to result in query string with repeated logical operator
like "field:term AND AND field:term"
possibliy causing out of mem in postprocessing_doublecontent
prevent to result in query string with repeated logical operator
like "field:term AND AND field:term"
possibliy causing out of mem in postprocessing_doublecontent
Dependency required by poi-3.16.
Dependency was not provided in YaCy but already defined on previous poi
versions. This only became problematic since upgrade from poi-3.15 to
poi-3.16 (commit dedc6552d3). Indeed in
this new poi release, a poi component used in some YaCy parsers code
paths now explicitely needs a class from the commons-collections4
library : org.apache.poi.hpsf.Section uses now
org.apache.commons.collections4.bidimap.TreeBidiMap.
Impacted YaCy parsers : xlsParser, pptParser, docParser.
Issue detected by the folowing JUnit tests failing :
ParserTest.testpptParsers(), ParserTest.testdocParsers(),
xlsParserTest.testParse()
to make sure updated documents are indexed with their last-modified
date as provided in current crawl.
(to patch moddate always with firstseen might bear the risk of miss
actual updates).
This overrides Solr default to use managed schema. As we don't use
programatic schema changes this directs Solr to use schema.xml, eliminating
the warning.