Otherwise once this operation is applied, the remote Solr(s) instances
are deconnected and the embedded Solr is connected even if disabled by
setting "core.service.fulltext".
Also use constants for related default setting values.
This is necessary when you want to attach to a dedicated external Solr
server protected with basic http authentication and requested over https
but having only a self-signed certificate.
The current default boost function
(`recip(ms(NOW,last_modified),3.16e-11,1,1)`) for the Date ranking
profile is indeed working fine.
What can trigger the error `unexpected docvalues type NUMERIC for field
'last_modified'` is the previous default boost function (quite old now)
or any custom one using the Solr `ord` or `rord` functions on the
last_modified field.
Then the problem was that the migration code in the Switchboard supposed
to detect the old date boost function was incorrect (one trailing right
parenthesis in excess), so the deprecated function remained.
This fixes issue #169.
With the appropriate vocabulary settings in Vocabulary_p.html page, this
can produce Vocabulary search facets displaying item types referenced in
html documents by microdata annotation.
Tested notably, but not limited to, vocabulary classes/types defined by
Schema.org and Dublin Core.
When a crawl is started, a new field to exclude content from scraping is
available. The field can be identified with the class name of div tags.
All text contained in such a div tag where the configured class name(s)
match are not indexed, while the remaining page is indexed.
Also ensure authentication is not lost by Digest timeout when navigating
between index.html and search results page.
This way, running searches with extended features on a remote peer or a
password protected peer works with a regular user (with "Extended
search" rights).
When authenticating on the search page with a user without "Extended
search" rights, it appears as authenticated, but has just its usual
access to the public search features.
Thus allowing a more convenient way (wihout the need to go to the admin
section) to login when searching on your remote or password protected
peer and benefit from extended search features such as Heuristics,
Bookmarking or JavasScript resorting.
Can be disabled using the ConfigSearchPage_p.html.
Using a Reentrant lock instead of the intrinsic synchronization lock
permits limiting the blocking time to acquire a lock.
Useful on a very busy Cache concurrently accessed by many threads : when
the time to acquire a lock is too high, getting/storing content on the
cache becomes inefficient, and it is then better to fall back to loading
remote resources.
Illustrated by the CacheTest stress test and some traces reported in
mantis 751 ( http://mantis.tokeek.de/view.php?id=751 )
Also add when possible a warning level log message on input stream
closing error instead of failing silently. This could help understanding
some IO exceptions such as "too many files open".
warc = Web ARChive File Format.
Warc files with extension .warc or compressed warc.gz can be placed in the
DATA/surrogate/in and contained responses are imported to the index.
The used library is stream based so we can easily extend it later to use
and load warc's from the net.
A port value of -1 will disable this option.
If set to a value greater 0, YaCy listens on this of on the local loopback
address (127.0.0.1) for a shutdown or restart signal.
E.g. connect to http://localhost:8005/shutdown will stop the YaCy server.
http://localhost:8005/restart will restart it.
This option allows to stop YaCy locally independant from the web web
frontend (which might be configured for password protected remote access).
- Added a new method to check activation of mandatory fields on
Collection Configuration commit, consistently with checks previously
performed in Switchboard startup and with mandatory fields in the
default schema.
- Reorganized default schema and CollectionConfiguration enumeration :
moved no more mandatory fields in a specific section, and moved fields
enabled at startup to the mandatory section.
- Marked mandatory fields as required and with stronger font in the
IndexSchema_p.html page
This ensure consistent implementation of the url host hash generation
and easier usage finding in source code.
Also added a unit test for this function.
to use javax.servlet.http.Cookie parameters.
Depreciate now obsolete getHeaderCookies.
Adjust setting of MaxAge to spec if >= 0 otherwise keep default.
The main shutdown hook thread was not properly waiting for the main
thread termination which consequently could not properly close resources
and threads. After terminating a running YaCy peer this way (Ctrl+C in
console, or kill <pid> for example), you could see the still existing
DATA/yacy.running file.
Tested with :
- Debian Jessie openjdk 7 and 8 : regular shutdown, Ctrl+C, kill
command, system restart while yacy is running
- Windows 10 Oracle JDK 7 and 8 : non regression on regular shutdown
Applied rules :
- when the FTP URL denotes a file resource, stack it as any start URL :
eventually embedded links can be followed applying the usual depth rules
- when the FTP URL denotes a directory, list files under this directory
and stack them for crawl, and repeat the process on sub folders until
crawl depth is reached
When a downloaded archive release is corrupted, empty, or can not be
opened for any reason, the update script must not be launched because it
erases the existing lib/*.jar libraries.
NullPointerException occurred when using and Identificator instance
which encountered and error in its constructor.
This error could be caused by a missing "langdetect" folder in the
current folder of the main process, or by simultaneous first calls to
the constructor, initializing concurrently the DetectorFactory.langlist.
Fixes the mantis 714 (http://mantis.tokeek.de/view.php?id=714)
When starting a crawl from a file containing thousands of links,
configuration setting "crawler.MaxActiveThreads" is effective to prevent
saturating the system with too many outgoing HTTP connections threads
launched by the crawler.
But robots.txt was not affected by this setting and was indefinitely
increasing the number of concurrently loading threads until most ot the
connections timed out.
To improve performance control, added a pool of threads for Robots.txt,
consistently used in its ensureExist() and massCrawlCheck() methods.
The Robots.txt threads pool max size can now be configured in the
/PerformanceQueus_p.html page, or with the new
"robots.txt.MaxActiveThreads" setting, initialized with the same default
value as the crawler.
- after last_exec_date is altered, next_exec_date should be recalculated
- makes the recalculation of next_exec in advance (without api call surely made) in Switchbard.schedulerJob() obsolete
Slightly modify next_exec calc. on missed event to now+schedule_time (from fix 10min)
fix for http://mantis.tokeek.de/view.php?id=677
The difference is on scheduling a large number of rss feeds and loading
is not finished before shutdown of YaCy. The change makes sure not already
loaded RSS will be loaded by the scheduler on next startup.
to save the resources and keep handler chain small if the feature is not used.
+add a warning message on settingsack_p page to restart on first activation
- Above brought up that parser start url parameter, declared as AnchorURL uses only methodes of parent object DigestURL (changed parameter declaration accordingly).
remote crawl.
On startup we save the resources for remote crawler if disabled. Once started
threads are running idle after disable remote crawl. Now threads are terminated
to save the resources also while disabeling during runtime.
+ remove empty class Channels
(regardless if these fields part of update).
Switch partial update option off in postprocessing if schema contains *_dts (multivalued date field).
see http://mantis.tokeek.de/view.php?id=601
bayesian filters. This can be used to classify documents during
indexing-time using a pre-definied bayesian filter.
New wordings:
- a context is a class where different categories are possible. The
context name is equal to a facet name.
- a category is a facet type within a facet navigation. Each context
must have several categories, at least one custom name (things you want
to discover) and one with the exact name "negative".
To use this, you must do:
- for each context, you must create a directory within
DATA/CLASSIFICATION with the name of the context (the facet name)
- within each context directory, you must create text files with one
document each per line for every categroy. One of these categories MUST
have the name 'negative.txt'.
Then, each new document is classified to match within one of the given
categories for each context.
during surrogate reading: those attributes from the dump are removed
during the import process and replaced by new detected attributes
according to the setting of the YaCy peer.
This may cause that all such attributes are removed if the importing
peer has no synonyms and/or no vocabularies defined.
This is a very complex migration: many classes had been renamed or
removed, dependencies changed and the solr index type is now aligned to
be a solr cloud repository.
Together with the Solr 5.2 library update, one other dependent library
had been updated as well: httpclient 4.4->4.4.1
Older indexes are migrated from 4_10 to 5_2. However, the new index
structure is more efficient and we recommend to re-index everything.
Please use the index export before you do the update to a large
surrogate xml file. After the update, start with an empty index and then
initialize this with your dump.
If remote crawl option is not activated, skip init of remoteCrawlJob to save the resources of queue and ideling thread.
Deploy of the remoteCrawlJob deferred on activation of the option.
keeping surrogates after processing is essential for some users. If the
space they are taking is too high, please set up an automatic deletion
process (like a cronjob).
as we flag if a peer is accesible via https, we need to know the port if we want to use is (e.g. for interYaCy communication)
start to provide / tansport the port by recording it in peers dna.
- add https link on the Network.html lock symbol
to support the new time parser and search functions in YaCy a high
precision detection of date and time on the day is necessary. That
requires that the time zone of the document content and the time zone of
the user, doing a search, is detected. The time zone of the search
request is done automatically using the browsers time zone offset which
is delivered to the search request automatically and invisible to the
user. The time zone for the content of web pages cannot be detected
automatically and must be an attribute of crawl starts. The advanced
crawl start now provides an input field to set the time zone in minutes
as an offset number. All parsers must get a time zone offset passed, so
this required the change of the parser java api. A lot of other changes
had been made which corrects the wrong handling of dates in YaCy which
was to add a correction based on the time zone of the server. Now no
correction is added and all dates in YaCy are UTC/GMT time zone, a
normalized time zone for all peers.
- date navigation
The date is taken from the CONTENT of the documents / web pages, NOT
from a date submitted in the context of metadata (i.e. http header or
html head form). This makes it possible to search for documents in the
future, i.e. when documents contain event descriptions for future
events.
The date is written to an index field which is now enabled by default.
All documents are scanned for contained date mentions.
To visualize the dates for a specific search results, a histogram
showing the number of documents for each day is displayed. To render
these histograms the morris.js library is used. Morris.js requires also
raphael.js which is now also integrated in YaCy.
The histogram is now also displayed in the index browser by default.
To select a specific range from a search result, the following modifiers
had been introduced:
from:<date>
to:<date>
These modifiers can be used separately (i.e. only 'from' or only 'to')
to describe an open interval or combined to have a closed interval. Both
dates are inclusive. To select a specific single date only, use the
'to:' - modifier.
The histogram shows blue and green lines; the green lines denot weekend
days (saturday and sunday).
Clicking on bars in the histogram has the following reaction:
1st click: add a from:<date> modifier for the date of the bar
2nd click: add a to:<date> modifier for the date of the bar
3rd click: remove from and date modifier and set a on:<date> for the bar
When the on:<date> modifier is used, the histogram shows an unlimited
time period. This makes it possible to click again (4th click) which is
then interpreted as a 1st click again (sets a from modifier).
The display feature is NOT switched on by default; to switch it on use
the /ConfigSearchPage_p.html servlet.
given css class and extends a given vocabulary with a term consisting
with the text content of the html class tag. Additionally, the term is
included into the semantic facet of the document. This allows the
creation of faceted search to documents without the pre-creation of
vocabularies; instead, the vocabulary is created on-the-fly, possibly
for use in other crawls. If any of the term scraping for a specific
vocabulary is successful on a document, this vocabulary is excluded for
auto-annotation on the page.
To use this feature, do the following:
- create a vocabulary on /Vocabulary_p.html (if not existent)
- in /CrawlStartExpert.html you will now see the vocabularies as column
in a table. The second column provides text fields where you can name
the class of html entities where the literal of the corresponding
vocabulary shall be scraped out
- when doing a search, you will see the content of the scraped fields in
a navigation facet for the given vocabulary
introduce FederateSearchManager handling search heuristic to external systems via specific FederateSearchConnectors,
which provide the query() functionallity, the translation to YaCy schema .toYaCySchema() and the search() routine to deliver results to searchevents, which is generally implemented in Abstract connector.
The manager enforces now a min 15s delay between calls to external systems.
Besides the OpensearchConnector a SolrFederateSearchConnector is available. It uses a additional config file for fieldname translation.
default heuristicopensearch.conf:
- openbdb.com removed - seems not longer to deliver results
- config via solrconnector to datacite.org added (large technical library archive)