HTTP "Referer" header sent by the browser when using YaCy can now be
controlled either with the referrer meta tag as a global policy, or only
for search result links by adding the attribute rel="noreferrer".
To improve privacy with the less possible regressions, the default is
set as meta tag with value "origin-when-cross-origin" : internal YaCy
links behavior is not affected, but when visiting external websites
referrer url is not empty but stripped from query parameters and path.
Older browsers, Safari, MS IE and Edge do not support the referrer meta
tag, so the standard but less flexible noreferrer link type can also be
enabled as an alternative.
User-friendly settings page to be implemented.
low number of found documents - by adding additional end condition to
remove processed query with number of found docs <= process-chunck-size.
Noticed on query h4_txt:[* TO *], found 21, process 21, call of commit happend
but on next cycle same query again 21 docs found (while h4_txt was removed
from schema and committed inputdocuments).
(expected scheme e.g. http, was protocol version).
Depreceate obsolete custom X-...-Scheme header constant.
Use existing FORMAT_ANSIC Dateformatter in HeaderFramework.
Correct htmlParserTest (del one not intended println)
recognized as tag like 1<a
reported in https://github.com/yacy/yacy_search_server/issues/109
Script content is ignored by default, but the text is filtered for html
tags. Modified scraper to skip tag filtering while within a <script>
section (until a closing tag is detected </script>.
Possible side effect, missing </script> end-tag will truncate trailing
content text.
Fix NPE on disabled local SolrIndex, occuring on search moving to the 2nd result page.
The debug purpose only setting to disabeling local SolrIndex (System Admin -> Debug Settings) should long term probably be removed from production code.
These fields are default enabled but with no doubt not strictly
mandatory with the current code base.
As reported by @reger24, splitting between essential mandatory and
optional fields is still to be improved to reflect the current YaCy
needs.
- Added a new method to check activation of mandatory fields on
Collection Configuration commit, consistently with checks previously
performed in Switchboard startup and with mandatory fields in the
default schema.
- Reorganized default schema and CollectionConfiguration enumeration :
moved no more mandatory fields in a specific section, and moved fields
enabled at startup to the mandatory section.
- Marked mandatory fields as required and with stronger font in the
IndexSchema_p.html page
sq=solr-querystring) to allow to filter simple text queries for processing,
remove toString for counter parameter
use more predefined constants in solrservlet
As reported by Palulukas in YaCy forum
(http://forum.yacy-websuche.de/viewtopic.php?f=18&t=5944&sid=dcef5b899ab4aa9b40e3a3d158c13aed#p33454)
the Index Export operation can fails, notably when the Solr index
contains one or more documents with empty (despite required)
"load_date_dt" field.
This fixes the export failure when the situation finally occurs, but
more should be done to harden verifications on minimum required fields.
to avoid repeat of tokenized url as description,
continuation of 7e09bff4a11409cabe8b
Add some javadoc, and not needed remove of omitted fields in postprocessing.
As noticed by @reger24, abusive use of OpenSearch systems should be
prevented, especially if allowing to parse and reuse HTML results.
robots.txt file is now checked before requesting an external OpenSearch
system to respect the host exclusions and eventual crawl-delay value.
The check is also performed when trying to add a new OpenSearch URL
template through the /ConfigHeuristics_p.html admin page.
Many OpenSearch systems do not provide results as standard RSS/Atom
feeds but only as HTML.
This modification add some support for custom OpenSearch HTML results
through the use of mapping files (as already done for federated Solr
search) relying on CSS-like selectors to retrieve information from HTML
content.
An example mapping file is provided to map results from the
www.npmjs.com OpenSearch URL.
As discussed in PR #93 with @JeremyRand and @reger24 this new advanced
settings page includes:
- a new setting to control remote Solr responses encoding
- some existing debug settings which could not be set through the admin
user interface
On timeout, closing remote Solr requests is proper than simply using
Thread.interrupt() that is not effective in most cases. Closing does not
ask commit on remote solr, but release http connections resources and is
more likely to end those threads that can else wait indefinitely.
Other related improvements included :
- no more marking remote peer as not available when remote search is
interrupted before timeout by the cleanup job.
- added a short fine log level trace of failing remote solr requests
index schema.
To assemble the original link url for out-/inboundlinks, icons and pictures
the *_protocol_sxt and *_urlstub_sxt is needed (due to the used data-reduced
storage methode). Auto-enable *_protocol_sxt if *_urlstub_sxt is enabled.
to be able to correctly assemble the original link url.
As mentioned in issue #103, control settings over YaCy disk usage
already existed but lacked a user-friendly way to set them.
I added it to the Performance_p.html administration page with a little
refactoring on the "Resource Observer" fieldset for improved
accessibility and HTML standards respect.
Also added the possibility to enable/disable the autoregulation fonction
from this page.
As reported by @tglman on issue #90, when searching images on the local
index only, pages next to the first were always empty. This was a
regression from commit c25e48e969.
This new "documentStructure" parameter can be set to false to only get
hosts accumulated references on a resource and thus prevent scraping the
specified URL and getting citations references.
Also set WebStructureGraph constants as final and updated the Javadoc
with example api call URLs.
As described in mantis 721 (http://mantis.tokeek.de/view.php?id=721)
WatchWebStructure_p.html failed to include in its structure view https
and other protocols and ports than default http.
As described in mantis 720 (http://mantis.tokeek.de/view.php?id=720),
when requesting this API with a domain name instead of a complete URL
only HTTP references on default port were listed.
This ensure consistent implementation of the url host hash generation
and easier usage finding in source code.
Also added a unit test for this function.