- Added a new method to check activation of mandatory fields on
Collection Configuration commit, consistently with checks previously
performed in Switchboard startup and with mandatory fields in the
default schema.
- Reorganized default schema and CollectionConfiguration enumeration :
moved no more mandatory fields in a specific section, and moved fields
enabled at startup to the mandatory section.
- Marked mandatory fields as required and with stronger font in the
IndexSchema_p.html page
sq=solr-querystring) to allow to filter simple text queries for processing,
remove toString for counter parameter
use more predefined constants in solrservlet
As reported by Palulukas in YaCy forum
(http://forum.yacy-websuche.de/viewtopic.php?f=18&t=5944&sid=dcef5b899ab4aa9b40e3a3d158c13aed#p33454)
the Index Export operation can fails, notably when the Solr index
contains one or more documents with empty (despite required)
"load_date_dt" field.
This fixes the export failure when the situation finally occurs, but
more should be done to harden verifications on minimum required fields.
to avoid repeat of tokenized url as description,
continuation of 7e09bff4a11409cabe8b
Add some javadoc, and not needed remove of omitted fields in postprocessing.
As noticed by @reger24, abusive use of OpenSearch systems should be
prevented, especially if allowing to parse and reuse HTML results.
robots.txt file is now checked before requesting an external OpenSearch
system to respect the host exclusions and eventual crawl-delay value.
The check is also performed when trying to add a new OpenSearch URL
template through the /ConfigHeuristics_p.html admin page.
Many OpenSearch systems do not provide results as standard RSS/Atom
feeds but only as HTML.
This modification add some support for custom OpenSearch HTML results
through the use of mapping files (as already done for federated Solr
search) relying on CSS-like selectors to retrieve information from HTML
content.
An example mapping file is provided to map results from the
www.npmjs.com OpenSearch URL.
As discussed in PR #93 with @JeremyRand and @reger24 this new advanced
settings page includes:
- a new setting to control remote Solr responses encoding
- some existing debug settings which could not be set through the admin
user interface
On timeout, closing remote Solr requests is proper than simply using
Thread.interrupt() that is not effective in most cases. Closing does not
ask commit on remote solr, but release http connections resources and is
more likely to end those threads that can else wait indefinitely.
Other related improvements included :
- no more marking remote peer as not available when remote search is
interrupted before timeout by the cleanup job.
- added a short fine log level trace of failing remote solr requests
index schema.
To assemble the original link url for out-/inboundlinks, icons and pictures
the *_protocol_sxt and *_urlstub_sxt is needed (due to the used data-reduced
storage methode). Auto-enable *_protocol_sxt if *_urlstub_sxt is enabled.
to be able to correctly assemble the original link url.
As mentioned in issue #103, control settings over YaCy disk usage
already existed but lacked a user-friendly way to set them.
I added it to the Performance_p.html administration page with a little
refactoring on the "Resource Observer" fieldset for improved
accessibility and HTML standards respect.
Also added the possibility to enable/disable the autoregulation fonction
from this page.
As reported by @tglman on issue #90, when searching images on the local
index only, pages next to the first were always empty. This was a
regression from commit c25e48e969.
This new "documentStructure" parameter can be set to false to only get
hosts accumulated references on a resource and thus prevent scraping the
specified URL and getting citations references.
Also set WebStructureGraph constants as final and updated the Javadoc
with example api call URLs.
As described in mantis 721 (http://mantis.tokeek.de/view.php?id=721)
WatchWebStructure_p.html failed to include in its structure view https
and other protocols and ports than default http.
As described in mantis 720 (http://mantis.tokeek.de/view.php?id=720),
when requesting this API with a domain name instead of a complete URL
only HTTP references on default port were listed.
This ensure consistent implementation of the url host hash generation
and easier usage finding in source code.
Also added a unit test for this function.
Measurement sample : import from blacklist local file containing about
15000 entries
- before refactoring : several minutes
- after refactoring : a few seconds!
host_id_s. Use Solr setField instead of addField to prevent
java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String
at net.yacy.kelondro.data.meta.URIMetadataNode.hosthash(URIMetadataNode.java:247)
at net.yacy.search.query.SearchEvent.addNodes(SearchEvent.java:966)
at net.yacy.peers.Protocol.solrQuery(Protocol.java:1242)
at net.yacy.peers.RemoteSearch$2.run(RemoteSearch.java:349)
to use javax.servlet.http.Cookie parameters.
Depreciate now obsolete getHeaderCookies.
Adjust setting of MaxAge to spec if >= 0 otherwise keep default.
The main shutdown hook thread was not properly waiting for the main
thread termination which consequently could not properly close resources
and threads. After terminating a running YaCy peer this way (Ctrl+C in
console, or kill <pid> for example), you could see the still existing
DATA/yacy.running file.
Tested with :
- Debian Jessie openjdk 7 and 8 : regular shutdown, Ctrl+C, kill
command, system restart while yacy is running
- Windows 10 Oracle JDK 7 and 8 : non regression on regular shutdown
add "datetime" property of <time> tag to scrapers startdate list.
Datetime is parsed as iso8601 (xml) date, html5 allows partial as well
as duration (not handled by this)
Applied rules :
- when the FTP URL denotes a file resource, stack it as any start URL :
eventually embedded links can be followed applying the usual depth rules
- when the FTP URL denotes a directory, list files under this directory
and stack them for crawl, and repeat the process on sub folders until
crawl depth is reached