if at least one of the image size fields is enabled in index (images_height_val,
images_width_val, images_pixel_val).
Previously all fields were required to be enabled (hint: default setting
is height + width enabled)
Otherwise once this operation is applied, the remote Solr(s) instances
are deconnected and the embedded Solr is connected even if disabled by
setting "core.service.fulltext".
Also use constants for related default setting values.
- Use the EnhancedXMLResponseWriter only when requested output is "exml"
- Use the Standard Solr writers when possible, for example for json, xml
or javabin output formats
- Return an error when the requested format can not been rendered with
an external Solr server only
Important : this modification is necessary for peers using exclusively
an external Solr server to be reachable as robinson targets in p2p
search, as the binary format ("javabin") is the default Solr exchange
format for peers.
Before this, when a peer requested a remote one attached only to an
external Solr (no embedded one), it ended with "Invalid type" error, as
the remote peer answered with xml although binary format was requested.
This is necessary when you want to attach to a dedicated external Solr
server protected with basic http authentication and requested over https
but having only a self-signed certificate.
The current default boost function
(`recip(ms(NOW,last_modified),3.16e-11,1,1)`) for the Date ranking
profile is indeed working fine.
What can trigger the error `unexpected docvalues type NUMERIC for field
'last_modified'` is the previous default boost function (quite old now)
or any custom one using the Solr `ord` or `rord` functions on the
last_modified field.
Then the problem was that the migration code in the Switchboard supposed
to detect the old date boost function was incorrect (one trailing right
parenthesis in excess), so the deprecated function remained.
This fixes issue #169.
Previously, when clicking a selected facet in the search results page to
unselect it, all other eventually selected modifiers/facets were also
removed.
With the appropriate vocabulary settings in Vocabulary_p.html page, this
can produce Vocabulary search facets displaying item types referenced in
html documents by microdata annotation.
Tested notably, but not limited to, vocabulary classes/types defined by
Schema.org and Dublin Core.
check case insensitive.
As keywords are compared lower case, make sure user input keyword:Key
or keyword:key will be shown as active in facet entry key.
Thus allowing to choose at configuration or per search request, whether
extending or not results beyond strict content domain filter (image,
video, audio or application).
Related graphical controls to be added to user interface.
Required for proper operation when the default system locale is Turkish,
as dottless and dotted i characters have specific case conversion rules
in this language.
Introduced through the new configurable setting
network.unit.protocol.https.preferred, defaulting to false for now.
Let choose to prefer using https when available on remote peers to
perform YaCy protocol operations including notably hello or transferRWI.
Not yet implemented for every YaCy protocol operations.
When a crawl is started, a new field to exclude content from scraping is
available. The field can be identified with the class name of div tags.
All text contained in such a div tag where the configured class name(s)
match are not indexed, while the remaining page is indexed.
Allow typing directly internationalized domain names including non ASCII
characters in the search field.
Search is done using the ASCII Compatible Encoding (ACE) representation.
Default is still http to prevent any regressions, but a new setting is
available to choose https as the preferred protocol to perform remote
searches.
New configuration setting 'remotesearch.https.preferred' is manually
editable in yacy.conf file or in Advanced Properties page
(/ConfigProperties_p.html).
Should be enabled as default in the future for improved privacy.
Https could also eventually be used for other peers communications.
Also ensure authentication is not lost by Digest timeout when navigating
between index.html and search results page.
This way, running searches with extended features on a remote peer or a
password protected peer works with a regular user (with "Extended
search" rights).
When authenticating on the search page with a user without "Extended
search" rights, it appears as authenticated, but has just its usual
access to the public search features.
Thus allowing a more convenient way (wihout the need to go to the admin
section) to login when searching on your remote or password protected
peer and benefit from extended search features such as Heuristics,
Bookmarking or JavasScript resorting.
Can be disabled using the ConfigSearchPage_p.html.
The SearchEvent listen to changes on each of its navigators, and the
information about their overall state is sent with each fetched search
item (as a "data-nav-generation" attribute). Then the browser can
regularly fetch a fresh version of yacysearchtrailer.html only if
necessary (when that nav-generation value change).
When the limit is reached, a button allow expanding/collapsing remaining
tags.
When this feature is activated without a limit to the number of
displayed tags, when encountering search results with a very large
number of keywords, the results page can become almost unusable (very
long vertical scrollbar)
This is a fix for mantis 766 ( http://mantis.tokeek.de/view.php?id=766 )
Since the upgrade to Digest authentication, access to protected search
features was indeed disabled once the Digest nonce timed out.
After Digest auth timeout the browser no more sent authentication
information and as the search results page is not private, protected
features were simply be hidden without asking browser again for
authentication.
Adding a supplementary parameter when accessing the search results as
authenticated fixes this.
Inspired from the existing one used on image search, and consistent with
post filtering on content domain applied in SearchEvent.addNodes().
These filters are quite simplistic but at least audio, video or
application search now return results. Previously, when filtering on
these content domains, many results pages (and often even the first
page) were empty while the total results count suggested that results
should be available. This was because filtering on domain was only
applied AFTER requesting Solr indexes.
- added some missing increments from RWI results
- decrement relevant navigator counts when solr or RWI results are
evicted because duplicates detection or constraints checked belatedly
- do not compute facets when unnecessary to avoid unwanted CPU load
- do not increment from facets when already done
- do not rely on facets on remote solr peers requests, as most of the
time only a limited part of their total results if fetched (thus also
preventing unnecessary load on remote peers)
- use a concurrency friendly score map for the dates navigators to
prevent unwanted ConcurrentModificationExceptions
This improves the situation for the most obvious inconsistencies in
search navigators counts, but more has to be done for a true accuracy
(notably when query modifiers constraints are applied belatedly - after
the solr or RWI retrieval request - such as the content domain
constraint)
Was inadequately modified in my previous related commits (making next
pages buttons unavailable in Search portal mode), as
SearchEvent.local_solr_available did not count the total filtered
results but only the ones within the currently fetched result page(s).
This modification has indeed low incidence as eventual query modifiers
are already applied when requesting the local solr index.
It mainly impact doublons detected with results from remote peers.
Also updated javadocs for clarification.
As a server-side oriented alternative to the JavaScript realtime
resorting feature proposed in PR #104.
The goal is the same as in this PR : having the possibility compensate
the network latency of various peers results fetching and obtain once
possible a consistently ranked result set.
to make sure updated documents are indexed with their last-modified
date as provided in current crawl.
(to patch moddate always with firstseen might bear the risk of miss
actual updates).
As reported by davide on YaCy forums (
http://forum.yacy-websuche.de/viewtopic.php?f=23&t=6004 ) when the
system is on high load, unless reading carefully YaCy configuration
file, it could be difficult to understand why remote search results are
not fetched.
Especially for Turkish speaking users using "tr" as their system default
locale : strings for technical stuff (URLs, tag names, constants...)
must not be lower cased with the default locale, as 'I' doesn't becomes
'i' like in other locales such as "en", but becomes 'ı'.
Using a Reentrant lock instead of the intrinsic synchronization lock
permits limiting the blocking time to acquire a lock.
Useful on a very busy Cache concurrently accessed by many threads : when
the time to acquire a lock is too high, getting/storing content on the
cache becomes inefficient, and it is then better to fall back to loading
remote resources.
Illustrated by the CacheTest stress test and some traces reported in
mantis 751 ( http://mantis.tokeek.de/view.php?id=751 )
Also add when possible a warning level log message on input stream
closing error instead of failing silently. This could help understanding
some IO exceptions such as "too many files open".
This enables keyword navigator to filter on keywords. Added search page
output and layout config for keywords, allowing e.g. in Intranet use
to display the keywords. No styling or links applied to the keyword
text (but is desirable possibly in combination with bootstrap-tagsinput
for future/intranet).
When Webgraph Solr core is enabled, crawling and removing from index an
URL whose hash starts with the '-' character (example URL :
https://cs.wikipedia.org/ whose hash is "-2-HuTEndn4x") produced a full
ParseException stack trace in YaCy logs. This was not blocking because
the Solr query parser is able to escape itself the query and run it
successfully, but filled uselessly YaCy logs.
Added as an additional icon with title in the search progress bar, to
inform about background search feeder threads terminated or still
running. While giving a bit more information to users about the p2p
search process, this can help choosing whether or not wait a little bit
more time before going to the next page, in order to get results from
various sources sorted as best as possible (see #91 for a discussion
about sorting accuracy and network latency).
Other related modifications included :
- regular updates to statistics in the progress bar until the
background feeders are completely terminated.
- removed some uses of unsecure and discouraged JavaScript elements
- added the new setting as configurable in the "Debug/Analysis" settings
page. Debug/analysis is its main purpose for now as there is currently
no nice and "understansable" ranking score info servlet (see forum
discussion http://forum.yacy-websuche.de/viewtopic.php?f=8&t=5884 )
- render in the "Search Page Layout" page preview when enabled
- added constants
Revealed by commit c77e43a : the exception was then thrown when indexing
pages containing mailto: scheme URL links with the Solr Webgraph core
enabled.
Fixed the error case and restored filtering on mailto links in
Document.resortLinks() as these URLs still should not appear in
Document.hyperlinks.
When using a public HTTP URL in /IndexImportMediawiki_p.html, the remote
file now is directly streamed and processed, allowing import of several
GB dumps even with a low memory remote peer, and without need to
manually download the dump file first.
count.
This might be tangential related to http://mantis.tokeek.de/view.php?id=736
as the example includes a local index search, while rwi results are not
counted.
The keywords field string is split into words as navigator entries.
A keyword navigator facet is essential for search appliance usage were
documents and metadata use often specialized keyword vocabularies to
filter search results. This navi can be used without custom index schema.
As we don't have defined a search query command to filter "keywords" yet,
the filtering is limited by adding the keyword to the search query.
warc = Web ARChive File Format.
Warc files with extension .warc or compressed warc.gz can be placed in the
DATA/surrogate/in and contained responses are imported to the index.
The used library is stream based so we can easily extend it later to use
and load warc's from the net.
A port value of -1 will disable this option.
If set to a value greater 0, YaCy listens on this of on the local loopback
address (127.0.0.1) for a shutdown or restart signal.
E.g. connect to http://localhost:8005/shutdown will stop the YaCy server.
http://localhost:8005/restart will restart it.
This option allows to stop YaCy locally independant from the web web
frontend (which might be configured for password protected remote access).
HTTP "Referer" header sent by the browser when using YaCy can now be
controlled either with the referrer meta tag as a global policy, or only
for search result links by adding the attribute rel="noreferrer".
To improve privacy with the less possible regressions, the default is
set as meta tag with value "origin-when-cross-origin" : internal YaCy
links behavior is not affected, but when visiting external websites
referrer url is not empty but stripped from query parameters and path.
Older browsers, Safari, MS IE and Edge do not support the referrer meta
tag, so the standard but less flexible noreferrer link type can also be
enabled as an alternative.
User-friendly settings page to be implemented.
low number of found documents - by adding additional end condition to
remove processed query with number of found docs <= process-chunck-size.
Noticed on query h4_txt:[* TO *], found 21, process 21, call of commit happend
but on next cycle same query again 21 docs found (while h4_txt was removed
from schema and committed inputdocuments).
Fix NPE on disabled local SolrIndex, occuring on search moving to the 2nd result page.
The debug purpose only setting to disabeling local SolrIndex (System Admin -> Debug Settings) should long term probably be removed from production code.
These fields are default enabled but with no doubt not strictly
mandatory with the current code base.
As reported by @reger24, splitting between essential mandatory and
optional fields is still to be improved to reflect the current YaCy
needs.
- Added a new method to check activation of mandatory fields on
Collection Configuration commit, consistently with checks previously
performed in Switchboard startup and with mandatory fields in the
default schema.
- Reorganized default schema and CollectionConfiguration enumeration :
moved no more mandatory fields in a specific section, and moved fields
enabled at startup to the mandatory section.
- Marked mandatory fields as required and with stronger font in the
IndexSchema_p.html page
sq=solr-querystring) to allow to filter simple text queries for processing,
remove toString for counter parameter
use more predefined constants in solrservlet
As reported by Palulukas in YaCy forum
(http://forum.yacy-websuche.de/viewtopic.php?f=18&t=5944&sid=dcef5b899ab4aa9b40e3a3d158c13aed#p33454)
the Index Export operation can fails, notably when the Solr index
contains one or more documents with empty (despite required)
"load_date_dt" field.
This fixes the export failure when the situation finally occurs, but
more should be done to harden verifications on minimum required fields.
to avoid repeat of tokenized url as description,
continuation of 7e09bff4a11409cabe8b
Add some javadoc, and not needed remove of omitted fields in postprocessing.
As discussed in PR #93 with @JeremyRand and @reger24 this new advanced
settings page includes:
- a new setting to control remote Solr responses encoding
- some existing debug settings which could not be set through the admin
user interface
index schema.
To assemble the original link url for out-/inboundlinks, icons and pictures
the *_protocol_sxt and *_urlstub_sxt is needed (due to the used data-reduced
storage methode). Auto-enable *_protocol_sxt if *_urlstub_sxt is enabled.
to be able to correctly assemble the original link url.
As mentioned in issue #103, control settings over YaCy disk usage
already existed but lacked a user-friendly way to set them.
I added it to the Performance_p.html administration page with a little
refactoring on the "Resource Observer" fieldset for improved
accessibility and HTML standards respect.
Also added the possibility to enable/disable the autoregulation fonction
from this page.
As reported by @tglman on issue #90, when searching images on the local
index only, pages next to the first were always empty. This was a
regression from commit c25e48e969.
This ensure consistent implementation of the url host hash generation
and easier usage finding in source code.
Also added a unit test for this function.
to use javax.servlet.http.Cookie parameters.
Depreciate now obsolete getHeaderCookies.
Adjust setting of MaxAge to spec if >= 0 otherwise keep default.
The main shutdown hook thread was not properly waiting for the main
thread termination which consequently could not properly close resources
and threads. After terminating a running YaCy peer this way (Ctrl+C in
console, or kill <pid> for example), you could see the still existing
DATA/yacy.running file.
Tested with :
- Debian Jessie openjdk 7 and 8 : regular shutdown, Ctrl+C, kill
command, system restart while yacy is running
- Windows 10 Oracle JDK 7 and 8 : non regression on regular shutdown
Applied rules :
- when the FTP URL denotes a file resource, stack it as any start URL :
eventually embedded links can be followed applying the usual depth rules
- when the FTP URL denotes a directory, list files under this directory
and stack them for crawl, and repeat the process on sub folders until
crawl depth is reached
to not already actives.
Dht results are now included in count this might over shoot on redundant
dht and solr, while the previous solr facet based was always low.
Fixes issue #90 for local queries only: Stealth mode, Portal mode or
Intranet mode.
For P2p mode, the issue would probably be difficult to solve with
reasonable performance. This is still to dig.
Also switched some InterreputedException catch log messages to warn
level as this is normal behavior when shutting down a peer.
Fixed yacysearch buttons navbar behavior to deal correctly with total
results count or offset over 1000. Also improved the buttons navbar to
be able to navigate over 10th page for local queries.
When a downloaded archive release is corrupted, empty, or can not be
opened for any reason, the update script must not be launched because it
erases the existing lib/*.jar libraries.
NullPointerException occurred when using and Identificator instance
which encountered and error in its constructor.
This error could be caused by a missing "langdetect" folder in the
current folder of the main process, or by simultaneous first calls to
the constructor, initializing concurrently the DetectorFactory.langlist.
Fixes the mantis 714 (http://mantis.tokeek.de/view.php?id=714)
When starting a crawl from a file containing thousands of links,
configuration setting "crawler.MaxActiveThreads" is effective to prevent
saturating the system with too many outgoing HTTP connections threads
launched by the crawler.
But robots.txt was not affected by this setting and was indefinitely
increasing the number of concurrently loading threads until most ot the
connections timed out.
To improve performance control, added a pool of threads for Robots.txt,
consistently used in its ensureExist() and massCrawlCheck() methods.
The Robots.txt threads pool max size can now be configured in the
/PerformanceQueus_p.html page, or with the new
"robots.txt.MaxActiveThreads" setting, initialized with the same default
value as the crawler.
It can take any Date field of the index and displays a list of year strings
in reverse order by the year (not the score/count).
To allow to define the index field to use, the fieldname (and title can be
appended to the navi's name "year" e.g. year:load_date_dt:LoadDate
It works also with dates_in_content_dts field (from the graphical date
navigator). Here the query parameter from: to: are used on selection as
Query modifier (for other dates currently no query parameter available, so
selection won't work to filter search results).
Not included in the UI Searchpage layout config so far (for experiment with
it manual change to conf needed).
the navigator to include counts all matches (rwi+fulltext).
Fixing also unresolved_pattern in navigators title (of the counter)
The use of inurl: query modifier as filter has not been changed keeping
it as soft (unsharp) filter facet.
Upd StringNavigator to prevent empty string form multivalued solr fields,
removed date value conversion (better handled elsewhere, not need here).
Prepared the first basic navigators (for authors and collections) for the
list of SearchEvent.navigatorPlugins and adjusted servlet to use these.
- this allows to configure display order of these navigators (by ordering config string)
- eventually allows for additional and/or custom navigators using any
available index field without need for changing servlets
- the Collection navigation has been adjusted to exclude the internal,
default robot_* and dht collections from displaying
- rwi results are now also checked for navigatior by the refactored navi's
So far no config options were added to customize or add navigators (may
come later if route of upcoming modularization/plugin system is defined).