As a server-side oriented alternative to the JavaScript realtime
resorting feature proposed in PR #104.
The goal is the same as in this PR : having the possibility compensate
the network latency of various peers results fetching and obtain once
possible a consistently ranked result set.
Previously, when checking for the first time the robots.txt policy on a
unknown host (not cached in the robots table), result was always empty
in the /getpageinfo_p.xml api and in the /CrawlCheck_p.html page. Next
calls returned however the correct information.
Complements the recent modification related to images in commit 7f395ef.
Unfortunately many documents metadata fetched from the freeworld p2p
network have only partial information about embedded images. Without
proper error handling, this made many searches in p2p mode to fail
completely.
This should be a help to make a preview of search results.
The image is computed from the list of embedded images, it is
always the first image in that list.
In rss-type results the image is presented like
<media:content medium="image" url="https://abc.xyz/logo.png"/>
as defined in
http://www.rssboard.org/media-rss#media-content
This allow large files parsing and preview, while preventing unwanted
OutOfMemory errors which are likely to occur when adding to the Solr
Index resources larger than configured crawler limits.
Thus enable getpageinfo_p API to return something in a reasonable amount
of time on resources over MegaBytes size range.
Support added first with the generic XML parser, for other formats
regular crawler limits apply as usual.
As reported by davide on YaCy forums (
http://forum.yacy-websuche.de/viewtopic.php?f=23&t=6004 ) when the
system is on high load, unless reading carefully YaCy configuration
file, it could be difficult to understand why remote search results are
not fetched.
Otherwise on a malformed getpageinfo_p XML response (from the browser
point of view), JavaScript errors where thrown and the ajax status
steering wheel remained displayed indefinitely.
Especially for Turkish speaking users using "tr" as their system default
locale : strings for technical stuff (URLs, tag names, constants...)
must not be lower cased with the default locale, as 'I' doesn't becomes
'i' like in other locales such as "en", but becomes 'ı'.
This prevent rendering a big and inconvenient scrollbar on resources
containing many links.
If really needed, preview of all links is still available with a "Show
all links" button.
Doesn't affect the number of links used once the crawl is effectively
started, as the list is then loaded again server-side.
This enables keyword navigator to filter on keywords. Added search page
output and layout config for keywords, allowing e.g. in Intranet use
to display the keywords. No styling or links applied to the keyword
text (but is desirable possibly in combination with bootstrap-tagsinput
for future/intranet).
Redirections set for the transition of any eventual external uses:
- /api/getpageinfo.xml to /api/getpageinfo_p.xml
- /api/getpageinfo.json to /api/getpageinfo_p.json
Replaced by shortcuts defined by the HTML "accesskey" attribute which
has the advantage to be advertised by screen readers when focusing the
corresponding buttons, contrary to custom JavasScript key handlers.
Now With Firefox :
- "Alt + Shift + n" for next page
- "Alt + Shift + p" for previous page
Following ARIA recommendation : "keyboard shortcuts enhance, not
replace, standard keyboard access." ( see
https://www.w3.org/TR/wai-aria-practices/#kbd_shortcuts_behavior_design)
Fix for mantis 711 (http://mantis.tokeek.de/view.php?id=711)
Added as an additional icon with title in the search progress bar, to
inform about background search feeder threads terminated or still
running. While giving a bit more information to users about the p2p
search process, this can help choosing whether or not wait a little bit
more time before going to the next page, in order to get results from
various sources sorted as best as possible (see #91 for a discussion
about sorting accuracy and network latency).
Other related modifications included :
- regular updates to statistics in the progress bar until the
background feeders are completely terminated.
- removed some uses of unsecure and discouraged JavaScript elements
- added the new setting as configurable in the "Debug/Analysis" settings
page. Debug/analysis is its main purpose for now as there is currently
no nice and "understansable" ranking score info servlet (see forum
discussion http://forum.yacy-websuche.de/viewtopic.php?f=8&t=5884 )
- render in the "Search Page Layout" page preview when enabled
- added constants
When import thread is terminated :
- now stop refreshing and stay on the monitoring page to give user a
feedback after a long running import
- added link to the next monitoring step : results from surrogates
reader
- added link to new import
On the new import page, added a link on the eventual last import report.
Take into account the already existing default limit value (especially
useful after a long crawl or surrogates import), or a custom one from
parameter "count".
Added a "Show all" link for convenience.
When using a public HTTP URL in /IndexImportMediawiki_p.html, the remote
file now is directly streamed and processed, allowing import of several
GB dumps even with a low memory remote peer, and without need to
manually download the dump file first.
Using an HTML "file" input was confusing (as reported by promocore on
YaCy forum : http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5965) ,
and it only worked with MS IE/Edge on a local YaCy peer :
- for security reasons some current major browsers such as Firefox or
Chrome do not allow to send full file path information when using a file
form input
- the local file system selection popup doesn't make sense when you
want to import a dump on a remote YaCy server
count.
This might be tangential related to http://mantis.tokeek.de/view.php?id=736
as the example includes a local index search, while rwi results are not
counted.
- ensure use of HTTP POST method when performing server side effect
operations
- transaction token required to ensure the request has effectively been
requested by user interaction
- enabled HTTP POST calls with Digest HTTP authentication
- made API calls compatible with API newly restricted to HTTP POST only
with transaction token validation
- ensured backward compatibility with older entries recorded as HTTP
GET
- ensure use of HTTP POST method : HTTP GET should only be used for
information retrieval and not to perform server side effect operations
(see HTTP standard https://tools.ietf.org/html/rfc7231#section-4.2.1)
- a transaction token is now required for these administrative form
submissions to ensure the request can not be included in an external
site and performed silently/by mistake by the user browser
A port value of -1 will disable this option.
If set to a value greater 0, YaCy listens on this of on the local loopback
address (127.0.0.1) for a shutdown or restart signal.
E.g. connect to http://localhost:8005/shutdown will stop the YaCy server.
http://localhost:8005/restart will restart it.
This option allows to stop YaCy locally independant from the web web
frontend (which might be configured for password protected remote access).
This is a fix for mantis 715 (http://mantis.tokeek.de/view.php?id=715).
A possible path scenario that could leading to this case :
- YaCy is running low in memory
- a search is requested
- before the end of search results rendering, the cleanup job runs and
deletes the running search event from the cache because of short memory
- then yacysearchitem renders with "-UNRESOLVED_PATTERN-" parameter
values passed to the statistics() JavaScript function
HTTP "Referer" header sent by the browser when using YaCy can now be
controlled either with the referrer meta tag as a global policy, or only
for search result links by adding the attribute rel="noreferrer".
To improve privacy with the less possible regressions, the default is
set as meta tag with value "origin-when-cross-origin" : internal YaCy
links behavior is not affected, but when visiting external websites
referrer url is not empty but stripped from query parameters and path.
Older browsers, Safari, MS IE and Edge do not support the referrer meta
tag, so the standard but less flexible noreferrer link type can also be
enabled as an alternative.
User-friendly settings page to be implemented.
- Added a new method to check activation of mandatory fields on
Collection Configuration commit, consistently with checks previously
performed in Switchboard startup and with mandatory fields in the
default schema.
- Reorganized default schema and CollectionConfiguration enumeration :
moved no more mandatory fields in a specific section, and moved fields
enabled at startup to the mandatory section.
- Marked mandatory fields as required and with stronger font in the
IndexSchema_p.html page
As noticed by @reger24, abusive use of OpenSearch systems should be
prevented, especially if allowing to parse and reuse HTML results.
robots.txt file is now checked before requesting an external OpenSearch
system to respect the host exclusions and eventual crawl-delay value.
The check is also performed when trying to add a new OpenSearch URL
template through the /ConfigHeuristics_p.html admin page.
As discussed in PR #93 with @JeremyRand and @reger24 this new advanced
settings page includes:
- a new setting to control remote Solr responses encoding
- some existing debug settings which could not be set through the admin
user interface
This property has been deprecated four years ago by commit
d74472f562. For any active search event
id, it was then always filled with "-UNRESOLVED_PATTERN-".
This is to be more easily understandable and to reflect more accurately
the current memory strategies implementations that eventually set the
"proper" state not only because DHT reception.
As described in mantis 725 (http://mantis.tokeek.de/view.php?id=725) the
HostBrowser.xml api directory entries had incorrect count attribute
value.
This was because the HostBrowser html page and backing template servlet
evolved, but modifications were not reported on the xml api.
This will prevent mistakenly hiding a div element not designed to be an
infobox but having a ".info" parent (After having previously added the
possibility for a div - and not only a span element - to be an infobox).
As mentioned in issue #103, control settings over YaCy disk usage
already existed but lacked a user-friendly way to set them.
I added it to the Performance_p.html administration page with a little
refactoring on the "Resource Observer" fieldset for improved
accessibility and HTML standards respect.
Also added the possibility to enable/disable the autoregulation fonction
from this page.
Made use of ol, li, thead, th, tbody, h1 and h2 html tags.
Added aria-label attributes to provide alternative textual information
previously only conveyed by color cue.
Tested behavior with NVDA 2016.4 screen reader.
Also removed deprecated HTML attributes uses.
Validation performed with Nu Html Checker 17.1.0.
Cross browser tested with :
- Debian Jessie : Firefox ESR 45.6.0
- MS Windows 10 : Firefox 50.1.0, Chrome 55.0.2883.87, MS Edge
In the /HostBrowser.html page "only hosts with urls pending in the
crawler", "only with load errors" and "Administration Options" all
require administration credentials. But they were displayed even to
unauthenticated users, and clicking them did nothing and returned the
/HostBrowser.html page empty.
This new "documentStructure" parameter can be set to false to only get
hosts accumulated references on a resource and thus prevent scraping the
specified URL and getting citations references.
Also set WebStructureGraph constants as final and updated the Javadoc
with example api call URLs.
Host names should not contain XML special characters such as quotation
mark, but at this stage the WebGraph may have mistakenly recorded a host
name with such characters. What's more the DigestURL constructor does
not prevent this.
By the way using serverObjects.putXML to encode host names we ensure
here the rendered XML is well formed and can be parsed by external tools
even if an structure entry is incorrect.
As described in mantis 721 (http://mantis.tokeek.de/view.php?id=721)
WatchWebStructure_p.html failed to include in its structure view https
and other protocols and ports than default http.
As described in mantis 720 (http://mantis.tokeek.de/view.php?id=720),
when requesting this API with a domain name instead of a complete URL
only HTTP references on default port were listed.
This ensure consistent implementation of the url host hash generation
and easier usage finding in source code.
Also added a unit test for this function.
The augmented Browsing option was reduced to the web proxy functionallity.
Augmented browsing is not available and no known plan exist to reimplement
alteration of result pages with additional information.
Measurement sample : import from blacklist local file containing about
15000 entries
- before refactoring : several minutes
- after refactoring : a few seconds!
Favicon display only makes sense for http(s) websites, being public or
intranet. So I modified the favicon conditional display to verify the
result URL protocol rather than if we are in intranet mode.
Also prevented rendering an img HTML tag with empty src on other results
protocols such as ftp or file.
Fixing this thanks to priest2 report
(http://forum.yacy-websuche.de/viewtopic.php?f=23&t=5923).
to use javax.servlet.http.Cookie parameters.
Depreciate now obsolete getHeaderCookies.
Adjust setting of MaxAge to spec if >= 0 otherwise keep default.
For this on the header of the viewed result a "add bookmark" button is
available (for authenticated users).
Currently the bookmark is added to a (virtual) bookmark folder "/proxy"
w/o any additional tags etc.
to not already actives.
Dht results are now included in count this might over shoot on redundant
dht and solr, while the previous solr facet based was always low.
Fixes issue #90 for local queries only: Stealth mode, Portal mode or
Intranet mode.
For P2p mode, the issue would probably be difficult to solve with
reasonable performance. This is still to dig.
Also switched some InterreputedException catch log messages to warn
level as this is normal behavior when shutting down a peer.
Fixed yacysearch buttons navbar behavior to deal correctly with total
results count or offset over 1000. Also improved the buttons navbar to
be able to navigate over 10th page for local queries.
When a downloaded archive release is corrupted, empty, or can not be
opened for any reason, the update script must not be launched because it
erases the existing lib/*.jar libraries.
Fixes second part of mantis 708
(http://mantis.tokeek.de/view.php?id=708)
The bootstrap-switch component has some sizing issues with long labels,
which are not likely to be solved soon due to a lack of resources on
that project (see issue
https://github.com/nostalgiaz/bootstrap-switch/issues/419 )
This fix works by applying the following ideas :
- labels are long, so font-size and padding are reduced on small screen
sizes using a media query
- use relative percent width values on the component wrappers to
prevent overlapping on the neighbour content
- disable animation because it relies on absolute pixels width values
keeping order of added nav's.
The search page preview template displays active navs. Therefore a select
and add button has been added below the preview (to keep it close to actual).
This should in future likely be done by drag&drop (html5 feature).
- using a icon-only admin button at small and medium screen size
- using a icon-only "Search Interfaces" button at small screen size
- hiding the YaCy brand at extra-small screen size
Fixes the header part of mantis 708
(http://mantis.tokeek.de/view.php?id=708).
Navigator button overlapping is still to fix.
This error occurs on /ConfigSearchPage_p.html and on search results page
when Metadata links are enabled.
The fix was to remove unnecessary use of hs.htmlExpand() which is now
part of highslide-full.js library file, currently not distributed with
YaCy (only includes highslide.js). The Metadata links work correctly and
the initial dynamic expansion offered by htmlExpand() did not bring much
usability.
As reported by @reger24, image and favicon viewing was broken with
unauthenticated requests on peers configured to require authentication
even from localhost.
So I unified viewing rights check in a single new function on
ImageViewer class.
This makes YaCy easier to configure when running behind a reverse Proxy.
The check on status avoids trying to update the page with error text
content when the server returned a 404 or 500 error message for example.
to work directly with javax.servlet.http.Cookie (rename headerProps to
cookieStore as is only used for this).
(Re)implement set-cookie in DefaultServlet to make cookieAuthentication
work as designed.
When starting a crawl from a file containing thousands of links,
configuration setting "crawler.MaxActiveThreads" is effective to prevent
saturating the system with too many outgoing HTTP connections threads
launched by the crawler.
But robots.txt was not affected by this setting and was indefinitely
increasing the number of concurrently loading threads until most ot the
connections timed out.
To improve performance control, added a pool of threads for Robots.txt,
consistently used in its ensureExist() and massCrawlCheck() methods.
The Robots.txt threads pool max size can now be configured in the
/PerformanceQueus_p.html page, or with the new
"robots.txt.MaxActiveThreads" setting, initialized with the same default
value as the crawler.
It can take any Date field of the index and displays a list of year strings
in reverse order by the year (not the score/count).
To allow to define the index field to use, the fieldname (and title can be
appended to the navi's name "year" e.g. year:load_date_dt:LoadDate
It works also with dates_in_content_dts field (from the graphical date
navigator). Here the query parameter from: to: are used on selection as
Query modifier (for other dates currently no query parameter available, so
selection won't work to filter search results).
Not included in the UI Searchpage layout config so far (for experiment with
it manual change to conf needed).
Upgraded the following JavaScript libraries dependencies :
- bootstrap-switch to 3.3.2
- html5shiv to 3.7.3 and switched to minified version
- typeahead to 0.10.5
- jQuery to 1.12.4
Removed unused bootstratp-rtl.css and bootstrap-rtl.min.css.
Tested non regressions on the following systems :
- Debian Jessie :
- Firefox 45.4.0
- MS Windows 10 :
- Chrome 54.0.2840.99
- Firefox 50.0
- Edge
- Emulated IE 11, 10 and 9
to make all readily available information from the original ServletRequest
available to YaCy servlets (without converting data to internal structures).
The implementation of the common interface allows easier integration of
YaCy servlets with the servlet standard (e.g. shared login service with
the servlet container etc.)
3bcd9d622b
crawler servlet log warning line on failure in one of multiple urls (instead of exception msg)
indexcontrolrwi skip not needed type conversion on ranking
When WikiCode inserted in a peer hosted Blog, Wiki, Messages or Profile
contains relative links (images or any content, hosted in DATA/HTDOCS),
it is more reliable to keep these links relative, especially when the
peer is behind any kind of reverse Proxy.
This is more reliable when YaCy is behind a reverse proxy.
Also updated integration examples to keep the current protocol part
(http or https) in the example address.
the navigator to include counts all matches (rwi+fulltext).
Fixing also unresolved_pattern in navigators title (of the counter)
The use of inurl: query modifier as filter has not been changed keeping
it as soft (unsharp) filter facet.
Upd StringNavigator to prevent empty string form multivalued solr fields,
removed date value conversion (better handled elsewhere, not need here).
Prepared the first basic navigators (for authors and collections) for the
list of SearchEvent.navigatorPlugins and adjusted servlet to use these.
- this allows to configure display order of these navigators (by ordering config string)
- eventually allows for additional and/or custom navigators using any
available index field without need for changing servlets
- the Collection navigation has been adjusted to exclude the internal,
default robot_* and dht collections from displaying
- rwi results are now also checked for navigatior by the refactored navi's
So far no config options were added to customize or add navigators (may
come later if route of upcoming modularization/plugin system is defined).
used for rwi ranking.
Main changes:
- introduce a posintext() to access the stored value. This reduces also mem alloc of position array for WordReferenceRow (index access)
- use the positions() array for joined references on multi-word queries if needed (otherwise allow positions() to be null
- adjust assignments and the min() max() and distance() calculation accordingly
Applied strategy : when there is no restriction on domains or
sub-path(s), stack anchor links once discovered by the content scraper
instead of waiting the complete parsing of the file.
This makes it possible to handle a crawling start file with thousands of
links in a reasonable amount of time.
Performance limitation : even if the crawl start faster with a large
file, the content of the parsed file still is fully loaded in memory.
This pages were already no more XHTML 1.0 because made use of the HTML5
syntax and elements.
Applied current (2016) HTML standard recommended Doctype declaration
(see https://www.w3.org/TR/html/syntax.html#the-doctype ).
was using servlet for network access and missing network.unit.name
fix for http://mantis.tokeek.de/view.php?id=694
+ prevent unresoved_pattern in yacy/list servlet
It had incorrect "-UNRESOLVED_PATTERN-" value (see second part of
mantis 691 http://mantis.tokeek.de/view.php?id=691 )
Note : crawlingDomFilterDepth is apparently unused in current (2016)
YaCy code-base. It was also unnecessary because crawlingDomFilterCheck
hidden field is set to "off".
Removed scheme, host and port from URL to avoid dealing with http/https,
external host and port retrieving issues.
What's more, this is consistent with how URL are displayed in
/Tables_p.html?table=api&count=100&reverse=on&search= or
Tables_p.xml?table=api&count=100&search=
This fixes mantis 691 first part
(http://mantis.tokeek.de/view.php?id=691)
This page was already no more XHTML 1.0 as it makes use of the HTML5
<progress> element.
Applied current HTML standard recommended Doctype declaration (see
https://www.w3.org/TR/html/syntax.html#the-doctype ).
These are no valid link relationships, and do not appear to be used in
scripting or styling.
If necessary, a valid alternative could be to add an attribute such as
data-count="[count]"
This file is used by Bootstrap documentation website
(http://getbootstrap.com/) but is not part of the Bootstrap distribution
and has not be included in a Bootstrap based application.
This ensure consistency between the index link and the
opensearchdescription, even when switching language after having added
your YaCy peer to the browser engines list.
- move the maxcount limit restriction completely to getTopicNavigator (as there not used in getTopics)
- let search servlet use getTopics by default (w/o RWI connected check, as of now, Topics are available w/o any additional index interaction)
New or modified translation (via /Translator_p.html) can be shared/distributed
via the YaCy internal news service. Remote peers can see and vote on the
translation via the new http://localhost:8090/TransNews_p.html servlet.
A positive vote will add the received translation to the local translation
list and post a voting message to the news service.
(at this no processing of received votings is implemented)
+ fixed the msg service retention time check (NewsPool.automaticProcessP)
If language is set to "browser" the client/user browser language is used to choose from
available translation.
simply: one users browser speaks English -> YaCy responds in English, other users browser speaks French -> YaCy responds in French.
! To make a translation/language available you have to activate the language once !
(or manually use the utility class TranslateAll)
In ConfigBasic.html availabel translations are marked green on setting language=Browser
The client language is determined by http header Accept-Language (checked in DefaultServlet)
use directly HttpServletRequest. This is used to get the http protocol version
in HTTPDProxyHandler.fulfillRequestFromWeb() for error response to client.
- adjust YaCyProxyServlet and UrlProxyServlet accordingly
- use more http_version constants in headerframework and httpdeamon
- equalize servlets (3) use of HeaderFramework.CONNECTION_PROP_HOST to HeaderFramework.HOST
We now use the same protocol as the one used to display the config page
: so when using https, the content is not blocked by the browser
detecting mixed-content.
process 1. load default from locales/*.*
2. load and merge(overwrite) from DATA/LOCALE/*.* (can be partial translation as it is merged)
- include all entries from DATA/LOCAL to be edited in Translator servlet
and save just modifications (instead of full list) to DATA/LOCALE
This shall make it easy to share modifications.