- added some missing increments from RWI results
- decrement relevant navigator counts when solr or RWI results are
evicted because duplicates detection or constraints checked belatedly
- do not compute facets when unnecessary to avoid unwanted CPU load
- do not increment from facets when already done
- do not rely on facets on remote solr peers requests, as most of the
time only a limited part of their total results if fetched (thus also
preventing unnecessary load on remote peers)
- use a concurrency friendly score map for the dates navigators to
prevent unwanted ConcurrentModificationExceptions
This improves the situation for the most obvious inconsistencies in
search navigators counts, but more has to be done for a true accuracy
(notably when query modifiers constraints are applied belatedly - after
the solr or RWI retrieval request - such as the content domain
constraint)
Was inadequately modified in my previous related commits (making next
pages buttons unavailable in Search portal mode), as
SearchEvent.local_solr_available did not count the total filtered
results but only the ones within the currently fetched result page(s).
Using unfiltered detailed counts (local and remote entries found before
doubles detection and before applying query modifiers) was confusing and
inconsistent with the total count. It could let think more results are
to come in the next pages, without understanding why they are not
displayed.
As a server-side oriented alternative to the JavaScript realtime
resorting feature proposed in PR #104.
The goal is the same as in this PR : having the possibility compensate
the network latency of various peers results fetching and obtain once
possible a consistently ranked result set.
Previously, when checking for the first time the robots.txt policy on a
unknown host (not cached in the robots table), result was always empty
in the /getpageinfo_p.xml api and in the /CrawlCheck_p.html page. Next
calls returned however the correct information.
Complements the recent modification related to images in commit 7f395ef.
Unfortunately many documents metadata fetched from the freeworld p2p
network have only partial information about embedded images. Without
proper error handling, this made many searches in p2p mode to fail
completely.
This should be a help to make a preview of search results.
The image is computed from the list of embedded images, it is
always the first image in that list.
In rss-type results the image is presented like
<media:content medium="image" url="https://abc.xyz/logo.png"/>
as defined in
http://www.rssboard.org/media-rss#media-content
This allow large files parsing and preview, while preventing unwanted
OutOfMemory errors which are likely to occur when adding to the Solr
Index resources larger than configured crawler limits.
Thus enable getpageinfo_p API to return something in a reasonable amount
of time on resources over MegaBytes size range.
Support added first with the generic XML parser, for other formats
regular crawler limits apply as usual.
As reported by davide on YaCy forums (
http://forum.yacy-websuche.de/viewtopic.php?f=23&t=6004 ) when the
system is on high load, unless reading carefully YaCy configuration
file, it could be difficult to understand why remote search results are
not fetched.
Otherwise on a malformed getpageinfo_p XML response (from the browser
point of view), JavaScript errors where thrown and the ajax status
steering wheel remained displayed indefinitely.
Especially for Turkish speaking users using "tr" as their system default
locale : strings for technical stuff (URLs, tag names, constants...)
must not be lower cased with the default locale, as 'I' doesn't becomes
'i' like in other locales such as "en", but becomes 'ı'.
This prevent rendering a big and inconvenient scrollbar on resources
containing many links.
If really needed, preview of all links is still available with a "Show
all links" button.
Doesn't affect the number of links used once the crawl is effectively
started, as the list is then loaded again server-side.
This enables keyword navigator to filter on keywords. Added search page
output and layout config for keywords, allowing e.g. in Intranet use
to display the keywords. No styling or links applied to the keyword
text (but is desirable possibly in combination with bootstrap-tagsinput
for future/intranet).
Redirections set for the transition of any eventual external uses:
- /api/getpageinfo.xml to /api/getpageinfo_p.xml
- /api/getpageinfo.json to /api/getpageinfo_p.json
Replaced by shortcuts defined by the HTML "accesskey" attribute which
has the advantage to be advertised by screen readers when focusing the
corresponding buttons, contrary to custom JavasScript key handlers.
Now With Firefox :
- "Alt + Shift + n" for next page
- "Alt + Shift + p" for previous page
Following ARIA recommendation : "keyboard shortcuts enhance, not
replace, standard keyboard access." ( see
https://www.w3.org/TR/wai-aria-practices/#kbd_shortcuts_behavior_design)
Fix for mantis 711 (http://mantis.tokeek.de/view.php?id=711)
Added as an additional icon with title in the search progress bar, to
inform about background search feeder threads terminated or still
running. While giving a bit more information to users about the p2p
search process, this can help choosing whether or not wait a little bit
more time before going to the next page, in order to get results from
various sources sorted as best as possible (see #91 for a discussion
about sorting accuracy and network latency).
Other related modifications included :
- regular updates to statistics in the progress bar until the
background feeders are completely terminated.
- removed some uses of unsecure and discouraged JavaScript elements
- added the new setting as configurable in the "Debug/Analysis" settings
page. Debug/analysis is its main purpose for now as there is currently
no nice and "understansable" ranking score info servlet (see forum
discussion http://forum.yacy-websuche.de/viewtopic.php?f=8&t=5884 )
- render in the "Search Page Layout" page preview when enabled
- added constants
When import thread is terminated :
- now stop refreshing and stay on the monitoring page to give user a
feedback after a long running import
- added link to the next monitoring step : results from surrogates
reader
- added link to new import
On the new import page, added a link on the eventual last import report.
Take into account the already existing default limit value (especially
useful after a long crawl or surrogates import), or a custom one from
parameter "count".
Added a "Show all" link for convenience.
When using a public HTTP URL in /IndexImportMediawiki_p.html, the remote
file now is directly streamed and processed, allowing import of several
GB dumps even with a low memory remote peer, and without need to
manually download the dump file first.
Using an HTML "file" input was confusing (as reported by promocore on
YaCy forum : http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5965) ,
and it only worked with MS IE/Edge on a local YaCy peer :
- for security reasons some current major browsers such as Firefox or
Chrome do not allow to send full file path information when using a file
form input
- the local file system selection popup doesn't make sense when you
want to import a dump on a remote YaCy server
count.
This might be tangential related to http://mantis.tokeek.de/view.php?id=736
as the example includes a local index search, while rwi results are not
counted.
- ensure use of HTTP POST method when performing server side effect
operations
- transaction token required to ensure the request has effectively been
requested by user interaction
- enabled HTTP POST calls with Digest HTTP authentication
- made API calls compatible with API newly restricted to HTTP POST only
with transaction token validation
- ensured backward compatibility with older entries recorded as HTTP
GET
- ensure use of HTTP POST method : HTTP GET should only be used for
information retrieval and not to perform server side effect operations
(see HTTP standard https://tools.ietf.org/html/rfc7231#section-4.2.1)
- a transaction token is now required for these administrative form
submissions to ensure the request can not be included in an external
site and performed silently/by mistake by the user browser
A port value of -1 will disable this option.
If set to a value greater 0, YaCy listens on this of on the local loopback
address (127.0.0.1) for a shutdown or restart signal.
E.g. connect to http://localhost:8005/shutdown will stop the YaCy server.
http://localhost:8005/restart will restart it.
This option allows to stop YaCy locally independant from the web web
frontend (which might be configured for password protected remote access).
This is a fix for mantis 715 (http://mantis.tokeek.de/view.php?id=715).
A possible path scenario that could leading to this case :
- YaCy is running low in memory
- a search is requested
- before the end of search results rendering, the cleanup job runs and
deletes the running search event from the cache because of short memory
- then yacysearchitem renders with "-UNRESOLVED_PATTERN-" parameter
values passed to the statistics() JavaScript function
HTTP "Referer" header sent by the browser when using YaCy can now be
controlled either with the referrer meta tag as a global policy, or only
for search result links by adding the attribute rel="noreferrer".
To improve privacy with the less possible regressions, the default is
set as meta tag with value "origin-when-cross-origin" : internal YaCy
links behavior is not affected, but when visiting external websites
referrer url is not empty but stripped from query parameters and path.
Older browsers, Safari, MS IE and Edge do not support the referrer meta
tag, so the standard but less flexible noreferrer link type can also be
enabled as an alternative.
User-friendly settings page to be implemented.
- Added a new method to check activation of mandatory fields on
Collection Configuration commit, consistently with checks previously
performed in Switchboard startup and with mandatory fields in the
default schema.
- Reorganized default schema and CollectionConfiguration enumeration :
moved no more mandatory fields in a specific section, and moved fields
enabled at startup to the mandatory section.
- Marked mandatory fields as required and with stronger font in the
IndexSchema_p.html page
As noticed by @reger24, abusive use of OpenSearch systems should be
prevented, especially if allowing to parse and reuse HTML results.
robots.txt file is now checked before requesting an external OpenSearch
system to respect the host exclusions and eventual crawl-delay value.
The check is also performed when trying to add a new OpenSearch URL
template through the /ConfigHeuristics_p.html admin page.