list of latest/oldest entries in the snapshot database. This is an
example:
http://localhost:8090/api/snapshot.rss?depth=2&order=LATESTFIRST&host=yacy.net&maxcount=100
The properties depth, order, host and maxcount can be omited. The
meaning of the fields are:
host: select only urls from this host or all, if not given
depth: select only urls at that crawl depth or all, if not given
maxcount: select at most the given number of urls or 10, if not given
order: either LATESTFIRST to select the youngest entries, OLDESTFIRST to
select the first entries or ANY to select any
The rss feed needs administration rights to work, a call to this servlet
with rss extension must attach login credentials.
so viewed text and metadata (stored) info is similar
- to archive it, use request with profile to allow indexing (defaultglobaltext) and update index
(the resource is loaded, parsed anyway, so it's not a expensive operation)
Request: remove 2 unused init parameter
- number of anchors of the parent
- forkfactor sum of anchors of all ancestors
thread pools will flush their cached (dead) threads after 60 seconds.
This will cause that YaCy now runs constantly withl about 50 threads,
about 100 at peak times. Previously, about 400 threads had been cached
and kept in a hibernation state, which caused that the numproc counter
in /proc/user_beancounters (exists only in VM-hosted linux) was as high
as the cached number of threads. This caused that VM supervisors
terminated whole VM sessions if a limit was reached. Many VM providers
have limits of numproc=96 which made it virtually impossible to run YaCy
on such machines. With this change, it will be possible to run many YaCy
instances even on VM hosts.
TimeoutRequests. The purpose is to test if YaCy runs better on VMs where
there is a limitation of concurrent processes; see
/proc/user_beancounters in row numproc; this value is limited and should
be low. Try to set timeoutrequests to keep this low. (works only after
restart)
be transcoded into jpg for image previews. To create such pdfs you must
do:
Add wkhtmltopdf and imagemagick to your OS, which you can do:
On a Mac download wkhtmltox-0.12.1_osx-cocoa-x86-64.pkg from
http://wkhtmltopdf.org/downloads.html and downloadh
ttp://cactuslab.com/imagemagick/assets/ImageMagick-6.8.9-9.pkg.zip
In Debian do "apt-get install wkhtmltopdf imagemagick"
Then check in /Settings_p.html?page=ProxyAccess: "Transparent Proxy" and
"Always Fresh" - this is used by wkhtmltopdf to fetch web pages using
the YaCy proxy. Using "Always Fresh" it is possible to get all pages
from the proxy cache.
Finally, you will see a new option when starting an expert web crawl.
You can set a maximum depth for crawling which should cause a pdf
generation. The resulting pdfs are then available in
DATA/HTCACHE/SNAPSHOTS/<host>.<port>/<depth>/<shard>/<urlhash>.<date>.pdf
concurrently. When a RWI search result is flushed into the result set,
id does Solr Queries (which replaced the old-style Metadata Queries) and
they are possibly running concurrently to a previously startet Solr
search. Both methods may block each other with IO. To enhance the speed,
they are now serialized. Because the Solr search results may result in
better results using the more advanced and configurable Ranking methods,
this result is preverred over the RWI search result. However, remote RWI
search results are still feeded concurrently into the search result as
well.
set to 'Always Fresh' the cache is always used if the entry in the cache
exist. This is a good way to archive web content and access it without
going online again in case the documents exist.
To do so, open /Settings_p.html?page=ProxyAccess and check the "Always
Fresh" checkbox.
This is set do false which behave as set before.
If you set this to true, then you have your web archive in DATA/HTCACHE.
Copy this to carry around your private copy of the internet!
if file in DATA/SETTINGS it is loaded otherwise file in ./defaults is loaded
(if locale ./defaults/stopwords.xx doesn't exist take solr/lang/stopwords_xx.txt as default)
move yacy.stopwords, yacy.stopwords.de and yacy.badwords.example out of root directory to ./defaults directory
encodings (preselected windows-type character encoding which is typical
for CSV files). Fixed also other problems with character encoding in
dictionary files. Automatically generated vocabularies are now also
noted in the API steering.
to always get fresh lists of documents. This is necessary since the
postprocessing changes the same documents which the
postprocessing-collection query selects.
hold a date for each URL to record when a url was first seen. This is
then used to overwrite the modification date for urls upon recrawl in
case that the first-seen date is before the latest document date. This
behaviour is necessary due to the common behaviour of content management
systems which attach always the current date to all documents. Using the
firstSeen database it is possible to approximate a real first document
creation date in case that the crawler starts frequently for the same
domain. As a result the search results ordered by date have a much
better quality and the usage of YaCy as search agent for latest news has
a better quality.
postprocessing the solr documents are now not completely retrieved.
instead, only fiels, needed for the postprocessing are extracted. When
Solr document are written, this is done using partial updates.
This increases postprocessing speed by about 50% for embedded Solr
configurations. For external Solr configurations the enhancement should
be much higher because the postprocessing with remote Solr is very slow.
When doing partial updates to a remote Solr, this method should perform
much better than before, it is expected that this is even much higher
than the increase with local Solr.
This is the first element of a new 'decoration' component which may hold
switches for different external appearance parameters.
The first switch in that context is decoration.audio (as usual in
yacy.init). This value is set to false by default, that means the audio
feedback element is switched off by default. To switch it on, set
decoration.audio = true (using /ConfigProperties_p.html). You will then
hear sounds for the following events:
- remote searches
- incoming dht transmissions
- new documents from the crawler
Sound clips are stored in htroot/env/soundclips/ which is done so
because a future implementation will read these files using the http
client and with configurable urls which will make it very easy for the
user to replace the given sounds with own sounds.
*) will try other ports if YaCy standard ports are not available
*) distinguish between internal and external port (not sure if this
works 100%)
Still to add: propery in config to enter own external port (in case of
manually configured NAT)
this extracts clickable links in pdf and adds it to the list of links
include a test case for this function
this is the corrected comment for commit:
aa2e15d846