When using a public HTTP URL in /IndexImportMediawiki_p.html, the remote
file now is directly streamed and processed, allowing import of several
GB dumps even with a low memory remote peer, and without need to
manually download the dump file first.
Detected when importing recent MediaWiki dumps containing some pages
with script content in plain text format (see Scribunto extension
https://www.mediawiki.org/wiki/Extension:Scribunto ).
Further improvement : modify the MediawikiImporter to prevent processing
revisions whose <model> is not wikitext.
Creating a MultiProtocolURL instance from a File object and then
retrieving a File with getFSFile() was inconsistent with file paths
containing space or non ASCII chars.
count.
This might be tangential related to http://mantis.tokeek.de/view.php?id=736
as the example includes a local index search, while rwi results are not
counted.
The keywords field string is split into words as navigator entries.
A keyword navigator facet is essential for search appliance usage were
documents and metadata use often specialized keyword vocabularies to
filter search results. This navi can be used without custom index schema.
As we don't have defined a search query command to filter "keywords" yet,
the filtering is limited by adding the keyword to the search query.
warc = Web ARChive File Format.
Warc files with extension .warc or compressed warc.gz can be placed in the
DATA/surrogate/in and contained responses are imported to the index.
The used library is stream based so we can easily extend it later to use
and load warc's from the net.
- enabled HTTP POST calls with Digest HTTP authentication
- made API calls compatible with API newly restricted to HTTP POST only
with transaction token validation
- ensured backward compatibility with older entries recorded as HTTP
GET
- ensure use of HTTP POST method : HTTP GET should only be used for
information retrieval and not to perform server side effect operations
(see HTTP standard https://tools.ietf.org/html/rfc7231#section-4.2.1)
- a transaction token is now required for these administrative form
submissions to ensure the request can not be included in an external
site and performed silently/by mistake by the user browser
When programmatically requesting the local peer with Apache http client,
authentication credentials must be passed as clear-text values.
This extension to the apache org.apache.http.impl.auth.DigestScheme
permits use of the YaCy encoded password stored in the
adminAccountBase64MD5 configuration property.
A port value of -1 will disable this option.
If set to a value greater 0, YaCy listens on this of on the local loopback
address (127.0.0.1) for a shutdown or restart signal.
E.g. connect to http://localhost:8005/shutdown will stop the YaCy server.
http://localhost:8005/restart will restart it.
This option allows to stop YaCy locally independant from the web web
frontend (which might be configured for password protected remote access).
by using icu.ULocale for languages not already covered (ICU normalizes
to ISO639-1 2 char codes).
Add test class
Use DublinCore vocabulary declarations in DCEntry and SurrogateReader
for easier usage debugging,
Init SurrogateReader.inputSource on first use.
following comment "use of properties as header values is discouraged"
in case where (proxy)HTTPClient overwrites values with supplied url.
Use defined request.referer procedure in response class.
HTTP "Referer" header sent by the browser when using YaCy can now be
controlled either with the referrer meta tag as a global policy, or only
for search result links by adding the attribute rel="noreferrer".
To improve privacy with the less possible regressions, the default is
set as meta tag with value "origin-when-cross-origin" : internal YaCy
links behavior is not affected, but when visiting external websites
referrer url is not empty but stripped from query parameters and path.
Older browsers, Safari, MS IE and Edge do not support the referrer meta
tag, so the standard but less flexible noreferrer link type can also be
enabled as an alternative.
User-friendly settings page to be implemented.
low number of found documents - by adding additional end condition to
remove processed query with number of found docs <= process-chunck-size.
Noticed on query h4_txt:[* TO *], found 21, process 21, call of commit happend
but on next cycle same query again 21 docs found (while h4_txt was removed
from schema and committed inputdocuments).
(expected scheme e.g. http, was protocol version).
Depreceate obsolete custom X-...-Scheme header constant.
Use existing FORMAT_ANSIC Dateformatter in HeaderFramework.
Correct htmlParserTest (del one not intended println)
recognized as tag like 1<a
reported in https://github.com/yacy/yacy_search_server/issues/109
Script content is ignored by default, but the text is filtered for html
tags. Modified scraper to skip tag filtering while within a <script>
section (until a closing tag is detected </script>.
Possible side effect, missing </script> end-tag will truncate trailing
content text.
Fix NPE on disabled local SolrIndex, occuring on search moving to the 2nd result page.
The debug purpose only setting to disabeling local SolrIndex (System Admin -> Debug Settings) should long term probably be removed from production code.
These fields are default enabled but with no doubt not strictly
mandatory with the current code base.
As reported by @reger24, splitting between essential mandatory and
optional fields is still to be improved to reflect the current YaCy
needs.
- Added a new method to check activation of mandatory fields on
Collection Configuration commit, consistently with checks previously
performed in Switchboard startup and with mandatory fields in the
default schema.
- Reorganized default schema and CollectionConfiguration enumeration :
moved no more mandatory fields in a specific section, and moved fields
enabled at startup to the mandatory section.
- Marked mandatory fields as required and with stronger font in the
IndexSchema_p.html page