to make all readily available information from the original ServletRequest
available to YaCy servlets (without converting data to internal structures).
The implementation of the common interface allows easier integration of
YaCy servlets with the servlet standard (e.g. shared login service with
the servlet container etc.)
When the peer is behind a reverse proxy providing SSL/TLS encryption,
the rendered absolute URLs should start with https when the user browser
requested https : added limited support to the X-Forwarded-Proto HTTP
header notably provided on Heroku platform.
Also added some unit tests.
When WikiCode inserted in a peer hosted Blog, Wiki, Messages or Profile
contains relative links (images or any content, hosted in DATA/HTDOCS),
it is more reliable to keep these links relative, especially when the
peer is behind any kind of reverse Proxy.
the navigator to include counts all matches (rwi+fulltext).
Fixing also unresolved_pattern in navigators title (of the counter)
The use of inurl: query modifier as filter has not been changed keeping
it as soft (unsharp) filter facet.
Upd StringNavigator to prevent empty string form multivalued solr fields,
removed date value conversion (better handled elsewhere, not need here).
Prepared the first basic navigators (for authors and collections) for the
list of SearchEvent.navigatorPlugins and adjusted servlet to use these.
- this allows to configure display order of these navigators (by ordering config string)
- eventually allows for additional and/or custom navigators using any
available index field without need for changing servlets
- the Collection navigation has been adjusted to exclude the internal,
default robot_* and dht collections from displaying
- rwi results are now also checked for navigatior by the refactored navi's
So far no config options were added to customize or add navigators (may
come later if route of upcoming modularization/plugin system is defined).
used for rwi ranking.
Main changes:
- introduce a posintext() to access the stored value. This reduces also mem alloc of position array for WordReferenceRow (index access)
- use the positions() array for joined references on multi-word queries if needed (otherwise allow positions() to be null
- adjust assignments and the min() max() and distance() calculation accordingly
during rwi search result processing worddistance calculation is effected
by concurrent update (normalization) of min/max ranking parameter for
wordpositions. On update of min/max the exception is raised in distance calc
and now catched.
This concurrent update and change of ranking results is needed for speed
but should be further checked for optimization
Applied strategy : when there is no restriction on domains or
sub-path(s), stack anchor links once discovered by the content scraper
instead of waiting the complete parsing of the file.
This makes it possible to handle a crawling start file with thousands of
links in a reasonable amount of time.
Performance limitation : even if the crawl start faster with a large
file, the content of the parsed file still is fully loaded in memory.
(again, description in http://mantis.tokeek.de/view.php?id=698)
as root cause was not seen, added just workaround reducing in favour over a
try catch (for easier followup).
Issue was the calculation in AbstractReference with positions.clear() call,
this made distance result always 0 (distance needs min 2 positions) and created concurrency issues.
+ unit test of changes
refactor to using long in URIMetadataNode too (and related call parameters)
As remote rwi score's are not used (since v1.83) skip reading float-score ,
but keep in toString() for communication with older versions.
- correct WordReferenceVars.toRowEntry posintext parameter
to set expected min posintext (the difference is on multi-word queries,
while positions are ordered by search word order).
- modified posofphrase/posinphrase join operation
- to set min posofphrase
- and keep posinphrase if not same posofphrase (was set to 0, no differentiation during ranking)
+ fix compiler msg (missing type declaration)
to make sure current dates are recognized (was fixed to 2014 - 2016)
+ adjust holiday date parser from pattern.match to pattern.find to deal with leading and trailing text
+ moved relative date recognition (morgen, tomorrow) to parseline (used by query parser only), as not working and problematic for indexing
+ add test case for parseline (used by query parser)
Shutdown was hanging in CrawlQueues.close() at
this.workerQueue.put(POISON_REQUEST) when config value
crawler.MaxActiveThreads was greater than 200.
Revealed by "Collision" Threads dumps in mantis 689
(http://mantis.tokeek.de/view.php?id=689#c1312)
Fixed consistency between this.worker.length and this.workerQueue
capacity, and made the process more reliable using non-blocking offer()
function.
First fix for mantis 689 (http://mantis.tokeek.de/view.php?id=689).
On Debian Linux, with a headless jre and no open browser,
browser.openBrowserClassic() was called and waited forever the browser
process end (p.waitFor()). YaCy shutdown was therefore not working until
the browser was closed.
Also modified browser opening command for Unix platform to open the
default the browser (with xdg-open util) instead of Firefox.
xdg-open also has the advantage to be asynchronous (not blocking).
file.separator to compute equal hashes (by normalizing path for computation)
+ expand test case for to check mixed java / windows file url notation
like e.g. file:///c:/test/file.html vs. file:///c:\test/file.html
- relates partially to http://mantis.tokeek.de/view.php?id=692
Even after network switch, ErroCache was still holding a reference to
the previous Solr cores, thus becoming useless until next YaCy restart.
Initial error cache filling with recent errors from the index was also
missing after the swtich.
The embedded core holds a lock on the index and must be closed. Earlier commit
comment states that core should be closed with solr instance instead on close
of connector.
Adjusted the InstanceMirror.close() to take care of closing the embedded
instance to release the lock.
In 2 routines of fulltext this was already explicite implemented (disconnectLocalSolr).
Now this disconnect is part of the InstanceMirror.close().
This file is used by Bootstrap documentation website
(http://getbootstrap.com/) but is not part of the Bootstrap distribution
and has not be included in a Bootstrap based application.
Purpose of the test case is to be able to (controlled) analyse the rwi ranking for
multi word searches (with focus on posintext and word-distance ranking)
version numbers are expressed in a different way as we expect. That
could cause that YaCy does not run on systems which are appropriate but
we simply do not understand the version string.
including a small change, word posintext counting.
We remember/store 1st posintext. Previously following words got a handle (posintext)
excluding found. Now it just counts and assigns true posintext as handle (posintext)
This is needed and enables existing word position ranking for RWI.
The upcoming concurrency issue in word position min/max calculation were eliminated
by iterator.hasHext check before next() access.
- move the maxcount limit restriction completely to getTopicNavigator (as there not used in getTopics)
- let search servlet use getTopics by default (w/o RWI connected check, as of now, Topics are available w/o any additional index interaction)
+ changed the postRanking to add one score only if word appears more as one time.
+ getTopics() unused code block rem'd (save performace)-> routine needs rework !
htroot is a supposed to be a subfolder of appPath and not of dataPath,
as assumed in other places where htroot is loaded. This issue was not
visible when dataPath and appPath are equals.
Added Javadocs to refactored methods.
Added log warnings instead of silently failing some errors.
Only fill collection1hosts when required ( shallComputeCR true).
New or modified translation (via /Translator_p.html) can be shared/distributed
via the YaCy internal news service. Remote peers can see and vote on the
translation via the new http://localhost:8090/TransNews_p.html servlet.
A positive vote will add the received translation to the local translation
list and post a voting message to the news service.
(at this no processing of received votings is implemented)
+ fixed the msg service retention time check (NewsPool.automaticProcessP)
If language is set to "browser" the client/user browser language is used to choose from
available translation.
simply: one users browser speaks English -> YaCy responds in English, other users browser speaks French -> YaCy responds in French.
! To make a translation/language available you have to activate the language once !
(or manually use the utility class TranslateAll)
In ConfigBasic.html availabel translations are marked green on setting language=Browser
The client language is determined by http header Accept-Language (checked in DefaultServlet)
use directly HttpServletRequest. This is used to get the http protocol version
in HTTPDProxyHandler.fulfillRequestFromWeb() for error response to client.
- adjust YaCyProxyServlet and UrlProxyServlet accordingly
- use more http_version constants in headerframework and httpdeamon
- equalize servlets (3) use of HeaderFramework.CONNECTION_PROP_HOST to HeaderFramework.HOST
to also support handling of urls w/o corresponding file-extension.
For this refactor use of document.getParserObject() to alway return a Parser (for clean logic)
and define/move the scraperObject as local var of AbstractParser.
Adjust related calls to getParserObject (where actually a scraperObject is wanted).
Addionally skip appending url token to parsed text for dht metadata entries
(by default returned as result by rwi index).
- after last_exec_date is altered, next_exec_date should be recalculated
- makes the recalculation of next_exec in advance (without api call surely made) in Switchbard.schedulerJob() obsolete
Slightly modify next_exec calc. on missed event to now+schedule_time (from fix 10min)
fix for http://mantis.tokeek.de/view.php?id=677
The difference is on scheduling a large number of rss feeds and loading
is not finished before shutdown of YaCy. The change makes sure not already
loaded RSS will be loaded by the scheduler on next startup.
after restart, put hosthash in queue's filename (which is used as primary
key for crawl queue. Hint: initial hosthash from url and recalculated hosthash
from just hostname:port are not the same.
fixes http://mantis.tokeek.de/view.php?id=668 (partially)
(by using the resultcontainer.size instead of input docList.size)
skip waiting for write-search-result-to-local-index
(by removing the Thread.join - which will bring a small performance increase)
be translated. To avoid key="TEST" sourcetext="this is a myTESTcase for it"
translation of partial terms/words.
Add check of word boundary before and after sourcetext (incl. take care
of current praxis for key to be delimetered by > <
+ add test case
process 1. load default from locales/*.*
2. load and merge(overwrite) from DATA/LOCALE/*.* (can be partial translation as it is merged)
- include all entries from DATA/LOCAL to be edited in Translator servlet
and save just modifications (instead of full list) to DATA/LOCALE
This shall make it easy to share modifications.
This is the 1st rudimentary approach to support the translatio utilities.
It allows currently to edit untranslated text and save it in a local translation file
in the DATA/LOCALE directory.
+ refactor Translator (less static's) to leverage on class overrides and support garbage collection for this 1 time routine
+ adjust TranslatorXliff to check for local translations in DATA/LOCALE,
this includes storing manually downloaded translation files in DATA as well
(to keep default untouched)
+ on 1st call of Translator_p a master tanslation file is generated, checking
the supported languages for missing translation text (later this masterfile is planned to part of the distribution, to harmonize translation key text between the languages)
Outlook: the local modifications (possibly as translation fragments instead of complete file) to be shared with maintainer using xlif features.
- in intranet mode getip returns null causing a NPE
- adjust starturl (which was set to http://localip/repository) which is never the start url for the Mediawiki
+ correct javadoc for seed.getIP()
to split query into multiple parameter on line separator in input query.
e.g. split "crawldepth_i_0^10.0 \n crawldepth_i:1^5.0"
but allow "url_file_ext_s:jpg OR url_file_ext_s:png" to be unsplitted
translation master as source to harmonize individual translation files
Included a main to create masters in YaCy an xliff format for testing
+ restrict TranslatorXliff to use only entries with State=translated
P.S. used https://open-language-tools.java.net/editor/about-xliff-editor.html to
experiement with xlf output (haven't a Pootle avail.)
This eases up suggested initatives from http://mantis.tokeek.de/view.php?id=649
Allows longer term also to store translation maps for the htroot files
in standardized/reuseable xliff format ( http://docs.oasis-open.org/xliff/xliff-core/xliff-core.html ).
+ added test case creating and comparing xliff file with internal custom prop file.
(currently the introduced class is not used in core code)
to save the resources and keep handler chain small if the feature is not used.
+add a warning message on settingsack_p page to restart on first activation
mainly to reduce the frequent metadat checks like
> EmbeddedSolrConnector.query QUERY: q={!cache=false raw f=id}xXxXxX&rows=1&start=0&fl=id,load_date_dt
(p.s. direct servlet queries logged via AccessTracker.addToDump)
been differently - and wrong for several files. also: base64-encoding
for gzipped push files because our data structures currently only
supports ASCII POST pushes..
of minutes in the past and reverted latest change. The export file dump
will now contain four data elements: f - first date of index entry write
date, l - last date of index write date, n - now-date of index dump
time, c - count of numbers inside the dump. '0N' denotes a series of
changes which will lead to the opportunity to exchange index data dumps
in a way that is needed to integrate ZeroNet index data. This will be
based on index dump sharing; that causes this commit.
- Above brought up that parser start url parameter, declared as AnchorURL uses only methodes of parent object DigestURL (changed parameter declaration accordingly).
- DefaultServlet includes already a class cache "templateMethodCache" which is emptied
on low mem status
- avoid classloader cache gets has no hits but over time holds all (used) servlet classes
Main image processing is now in ImageViewer, used by both ViewImage and
ViewFavicon.
Fixed URIMetadataNode.getFavicon to use non-standard icons with no size
ass fallback.
Language identification may show poor performance on documents with short or no
title but clear lang indication in text content. Using content text too
improves lang detection.
+ remove double caching of text in Identificator
In the 2 cases where servlet calls servlet the jvm classloader chain is
invoked and servlet class loaded by jvm loader (successful while requiring
htroot in system classpath). This patch uses the standard override design
for loaders to handle these cases (making in not longer crucial to have htroot
in system classpath, as this classLoader is mainly used for servlets and
looks in this case for the class in the configured path).
+ As the default classloader is parallelcapable we should register this too.
remote crawl.
On startup we save the resources for remote crawler if disabled. Once started
threads are running idle after disable remote crawl. Now threads are terminated
to save the resources also while disabeling during runtime.
+ remove empty class Channels
- add filename to parameter fieldname
- add filecontent to special parameter fieldname$file
(some servlets use this $file parameter)
fix for http://mantis.tokeek.de/view.php?id=542
in worker thread.
Writer of importer keeps needs a poison to close the file. On exception (e.g. OOM)
add a poison marker in outer most try/catch to assure output queue will terminate
in this condition too (and closes+renames the surrogate/in/xxx.prt file)
Differentiate mime() and getContentType() which gives the raw header field.
This improves parser detection if charsets are included in http content-type field.
calculated doc hash if different.
Testing showed that in some cases delivered url doesn't match the local
calculated hash. In this case replace doc.id (and host_id_s) with calculation
from url.
Update the result score result field with the result queue ranking value to reflect
the actual calculated/used score,
for rwi & solr stack results.
(calc. etc. is unchanged, it's just that result entry carries the latest val
as api retrieves the number from it)
Collection is not available in pure rwi entries (but in local solr metadata)
But if user wishes to filter by query constraint also rwi shall adhere to this
(even if only rwi entries with parsed or solr received metadata may fit)
In rare cases hostname may not be a valid filesystem directory name,
which can't be created (e.g. containing '*' char). To prevent crawl queue
looping on this invalid entry by throwing a malformedurlexception.
JVM registers each file in a list regardless of already deleted and never
cleans up the list during runtime.
This accumulates to a considerable amount of mem during large crawls and/or
long uptime.
To tackle this, all temp files are now created in a subdir of java.io.tmpdir
and the jvm tmpdir property is set to this subdir, which is deleted by
code on shutdown.
Additionally let pdfParser use this tmp subdir too.