prevent to result in query string with repeated logical operator
like "field:term AND AND field:term"
possibliy causing out of mem in postprocessing_doublecontent
Dependency required by poi-3.16.
Dependency was not provided in YaCy but already defined on previous poi
versions. This only became problematic since upgrade from poi-3.15 to
poi-3.16 (commit dedc6552d3). Indeed in
this new poi release, a poi component used in some YaCy parsers code
paths now explicitely needs a class from the commons-collections4
library : org.apache.poi.hpsf.Section uses now
org.apache.commons.collections4.bidimap.TreeBidiMap.
Impacted YaCy parsers : xlsParser, pptParser, docParser.
Issue detected by the folowing JUnit tests failing :
ParserTest.testpptParsers(), ParserTest.testdocParsers(),
xlsParserTest.testParse()
to make sure updated documents are indexed with their last-modified
date as provided in current crawl.
(to patch moddate always with firstseen might bear the risk of miss
actual updates).
This overrides Solr default to use managed schema. As we don't use
programatic schema changes this directs Solr to use schema.xml, eliminating
the warning.
Some web servers provide both 'Content-Encoding : "gzip"' and
'Content-Type : "application/x-gzip"' HTTP headers on their ".gz" files.
This was annoying to fail on such resources which are not so uncommon,
while non conforming (see RFC 7231 section 3.1.2.2 for
"Content-Encoding" header specification
https://tools.ietf.org/html/rfc7231#section-3.1.2.2)
This allow large files parsing and preview, while preventing unwanted
OutOfMemory errors which are likely to occur when adding to the Solr
Index resources larger than configured crawler limits.
Thus enable getpageinfo_p API to return something in a reasonable amount
of time on resources over MegaBytes size range.
Support added first with the generic XML parser, for other formats
regular crawler limits apply as usual.
Recursive processing was removed in commit
67beef657f, but one remained for anchors
content(likely omitted from refactoring). It is no more necessary :
other links such as images embedded in anchors are currently correctly
detected by the parser.
More annoying : that remaining recursive processing could lead to almost
endless processing when encountering some (invalid) HTML structures
involving nested anchors, as detected and reported by lucipher on YaCy
forum ( http://forum.yacy-websuche.de/viewtopic.php?f=23&t=6005 ).
As reported by davide on YaCy forums (
http://forum.yacy-websuche.de/viewtopic.php?f=23&t=6004 ) when the
system is on high load, unless reading carefully YaCy configuration
file, it could be difficult to understand why remote search results are
not fetched.
On content size known from HTTP headers, terminates connection faster
and improves error reports quality by reporting relevant message
"Content to download exceed maximum value..." rather than previously "no
response (NULL) for url...".