On some conditions (especially when reaching timeout), concurrent Solr
query tasks used by the /HostBrowser.html and /api/linkstructure.json
never terminated, thus leaking resources, as reported by @Vort in issue
#246
New "Media Type detection" section in the advanced crawl start page
allow to choose between :
- not loading URLs with unknown or unsupported file extension without
checking the actual Media Type (relying Content-Type header for now).
This was the old default behavior, faster, but not really accurate.
- always cross check URL file extension against the actual Media Type.
This lets properly parse URLs ending with an apparently odd file
extension, but which have actually a supported Media Type such as
text/html.
Sample URLs with misleading file extensions added as documentation in
the crawl start page.
fixes issue #244
Not using the JDK URLDecoder.decode() function, as it strips '+'
characters when they occur after '?' (both characters having regular
expression semantics when used in blacklist path patterns)
Normalize blacklist path patterns using percent-encoding, at pattern
edition in web interface and at loading from configuration files.
Fixes issue #237
- To meet current browsers security rules, which prevent selecting a
full file path with an html input field of type 'file'
- As it does not make sense to select a local file path when a the
administered YaCy server is remote (not on the same computer as the
browser)
Considering that the sliders usage on that page is very basic, using
standard HTML5 inputs of type "range" has here the following advantages
:
- better keyboard accessibility
- remove not very necessary additional jquery dependencies
Today browsers suport for range inputs is good, and even on old
unsupporting browsers such as IE < 10 they nicely fall back to text
inputs.
When searching images, thumbnails that could not be rendered (because of
a load error such as HTTP 404, networking issue or an internal error on
the rendering servlet) are now hidden as default. But can be revealed
with a button if desired.
Fix for issue #217
Manually replacing '+' character or "%20" by a space character in the
search query parameter was necessary in YaCy a long time ago to properly
decode application/x-www-form-urlencoded format (commit
9842fab6e4 in 2010).
Since the introduction of Jetty as the embedded HTTP server (commit
4b77733e59 in 2013), this is no more
necessary as Jetty internals already do this for us in
org.eclipse.jetty.util.UrlEncoded.decodeUtf8To().
So we can remove now this duplicated decoding as it prevents a proper
use of the '+' character in search requests, as reported in issue #216.
The status of the library in the DictionaryLoader_p.html page now also
advertises the user that an upgrade can be applied when an older dump is
already loaded.
Upgrade applied as suggested by Niklas Andrus @fapth_gitlab on Gitter
chat.
SimpleDateFormat must not be used by concurrent threads without
synchronization for parsing or formating dates as it is not thread-safe
(internally holds a calendar instance that is not synchronized).
Prefer now DateTimeFormatter when possible as it is thread-safe without
concurrent access performance bottleneck (does not internally use
synchronization locks).
SimpleDateFormat must not be used by concurrent threads without
synchronization for parsing or formating dates as it is not thread-safe
(internally holds a calendar instance that is not synchronized).
Prefer now DateTimeFormatter when possible as it is thread-safe without
concurrent access performance bottleneck (does not internally use
synchronization locks).
This makes possbile to set up much more advanced document crawl filters,
by filtering on one or more document indexed fields before inserting in
the index.
If not interested in displaying this on your search results and notably
on a peer with limited resources this can help saving some CPU and
outgoing network connections.
By not generating MD5 hashes on all words of indexed texts, processing
time is reduced by 30 to 50% on indexed documents with more than 1Mbytes
of plain text.
Ensuring http post method is used for operations with server-side
effects (in respect of http semantics), and a valid transaction token is
provided by the user-agent.
- to prevent unwanted exposure of index entries about private
local/intranet documents when switching from "Intranet Indexing" mode
while attached to remote Solr instance(s)
- to warn user about remote Solr instance(s) still attached when
switching from modes other than "Intranet Indexing"
- properly handle IPv6 loopback address replacement
- replace loopback address or host only when accessing peer remotely
- replace loopback part with the peer hostname as requested rather than
with its seed public IP as this works better for Intranet mode and when
peer is behind a reverse proxy.
Otherwise once this operation is applied, the remote Solr(s) instances
are deconnected and the embedded Solr is connected even if disabled by
setting "core.service.fulltext".
Also use constants for related default setting values.
This is necessary when you want to attach to a dedicated external Solr
server protected with basic http authentication and requested over https
but having only a self-signed certificate.
When inlined (for example in the CrawlProfileEditor_p.html page) :
search only on the comment, as the url is not visible
On regular display : search on comment OR url, instead of comment AND
url. Otherwise searching on comments terms is almost useless as these
terms are not necessarily present in the url.
Applied the same default results page size as when a type filter is
defined for proper and consistend page navigation when combining type
filter and search query.
Otherwise, when interested in viewing `Link List` for example, each time
you typed a new URL, `Parsed Sentences` view mode was selected as
default and you had to selected again the view mode you are insterested
in.
As discussed in issue #160 , blacklist entries can indeed currently not
be "complete" regular expressions, but must be structured as a domain
part, a separator character ('/'), and a path part.
- Fixes issue #160 : handle properly syntax exceptions with a user
friendly message
- Fixes loss of information on multiple blacklist entries editions
- Fixes loss of entries when moving entries from one list to another
Previously, when clicking a selected facet in the search results page to
unselect it, all other eventually selected modifiers/facets were also
removed.
With the appropriate vocabulary settings in Vocabulary_p.html page, this
can produce Vocabulary search facets displaying item types referenced in
html documents by microdata annotation.
Tested notably, but not limited to, vocabulary classes/types defined by
Schema.org and Dublin Core.
Thus allowing to choose at configuration or per search request, whether
extending or not results beyond strict content domain filter (image,
video, audio or application).
Related graphical controls to be added to user interface.
Required for proper operation when the default system locale is Turkish,
as dottless and dotted i characters have specific case conversion rules
in this language.
Introduced through the new configurable setting
network.unit.protocol.https.preferred, defaulting to false for now.
Let choose to prefer using https when available on remote peers to
perform YaCy protocol operations including notably hello or transferRWI.
Not yet implemented for every YaCy protocol operations.
When a crawl is started, a new field to exclude content from scraping is
available. The field can be identified with the class name of div tags.
All text contained in such a div tag where the configured class name(s)
match are not indexed, while the remaining page is indexed.
Upgraded to InetAccessHandler.
Added InetPathAccessHandler extension to InetAccessHandler to maintain
path patterns capability previously available in IPAccessHandler but
lost in InetAccessHandler.
Filtering on IPv6 addresses is now supported.
Support for deprecated pattern formats such as "192.168." and
"192.168.1.1/path" has been removed, but startup automated migration
should convert such patterns eventually present in serverClient.
Previously named with their ISO 3166-1 country code : this way, when
setting language to "Browser" in ConfigBasic.html, it didn't work
properly when browser preferred language was Chinese or Greek as their
respective language codes are "zh" and "el" (not "cn" and "gr" which are
their country codes)
Also ensure authentication is not lost by Digest timeout when navigating
between index.html and search results page.
This way, running searches with extended features on a remote peer or a
password protected peer works with a regular user (with "Extended
search" rights).
When authenticating on the search page with a user without "Extended
search" rights, it appears as authenticated, but has just its usual
access to the public search features.
Otherwise, when authenticated as admin and navigating from search
results or admin pages to the search start page (/index.html), if
nothing is done on that page within HTTP Digest Auth timeout (about
2mn), then search is performed without authentication and so without
extended search features.
Thus eventually including the same optional login link/status in the
search start page than in the results page, for the same convenient
login without the need to use the Administration section.
Restores the behavior introduced eleven years ago (see commit
479861a3cf) and lost by mistake 3 years
ago (see commit 617dd9c97b), when the
click handler started referencing a missing HTML id.
Thus allowing a more convenient way (wihout the need to go to the admin
section) to login when searching on your remote or password protected
peer and benefit from extended search features such as Heuristics,
Bookmarking or JavasScript resorting.
Can be disabled using the ConfigSearchPage_p.html.
Resizing JPEG snapshot images through /api/snapshot.jpg failed when
running on OpenJDK, but rendered successfully with a Oracle JDK.
Details in mantis 772 ( http://mantis.tokeek.de/view.php?id=772 ).
Removing any alpha component (useless in snapshot images) from the
rendered resized image solves the issue.
Previously rendered as a broken URL containing the absolute file path of
a snapshot on the search server.
Now rendered as a valid URL linking to the /api/snapshot API to provide
available snapshot content. Snapshot format is selected among the
available ones in the following order of preference : JPG/PNG, PDF, and
XML.
The SearchEvent listen to changes on each of its navigators, and the
information about their overall state is sent with each fetched search
item (as a "data-nav-generation" attribute). Then the browser can
regularly fetch a fresh version of yacysearchtrailer.html only if
necessary (when that nav-generation value change).
When the limit is reached, a button allow expanding/collapsing remaining
tags.
When this feature is activated without a limit to the number of
displayed tags, when encountering search results with a very large
number of keywords, the results page can become almost unusable (very
long vertical scrollbar)
This is a fix for mantis 766 ( http://mantis.tokeek.de/view.php?id=766 )
Since the upgrade to Digest authentication, access to protected search
features was indeed disabled once the Digest nonce timed out.
After Digest auth timeout the browser no more sent authentication
information and as the search results page is not private, protected
features were simply be hidden without asking browser again for
authentication.
Adding a supplementary parameter when accessing the search results as
authenticated fixes this.
If swichted on link (click) to the tag adds the keyword to the search query.
If a keyword navigator is active the selected keyword adds or replaces
a query keyword: modifier (currently replace was choosen as multiple
keywords are not fully supported yet)