Michael Peter Christen
97f6089a41
YaCy can now create web page snapshots as pdf documents which can later
...
be transcoded into jpg for image previews. To create such pdfs you must
do:
Add wkhtmltopdf and imagemagick to your OS, which you can do:
On a Mac download wkhtmltox-0.12.1_osx-cocoa-x86-64.pkg from
http://wkhtmltopdf.org/downloads.html and downloadh
ttp://cactuslab.com/imagemagick/assets/ImageMagick-6.8.9-9.pkg.zip
In Debian do "apt-get install wkhtmltopdf imagemagick"
Then check in /Settings_p.html?page=ProxyAccess: "Transparent Proxy" and
"Always Fresh" - this is used by wkhtmltopdf to fetch web pages using
the YaCy proxy. Using "Always Fresh" it is possible to get all pages
from the proxy cache.
Finally, you will see a new option when starting an expert web crawl.
You can set a maximum depth for crawling which should cause a pdf
generation. The resulting pdfs are then available in
DATA/HTCACHE/SNAPSHOTS/<host>.<port>/<depth>/<shard>/<urlhash>.<date>.pdf
10 years ago
reger
ff80700aff
replace depreciated Solr DateField.formatExternal with recommended TrieDateField.formatExternal
10 years ago
Michael Peter Christen
9ea120dbe5
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
10 years ago
reger
0c97cc2440
skip unused call parameter for hashSentence()
10 years ago
reger
5790c7242e
skip to tokenize punktuation as word in WordTokenizer
...
remove unused variables in condenser related to Tokenizer
10 years ago
reger
f07392ff17
add. use host port parameter in YaCyApp
10 years ago
Michael Peter Christen
09d2867050
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
10 years ago
Michael Peter Christen
ad0da5f246
added new web page snapshot infrastructure which will lead to the
...
ability to have web page previews in the search results.
(This is a stub, no function available with this yet...)
10 years ago
Michael Peter Christen
5f5c7d69d1
added image screenshot generator
10 years ago
Michael Peter Christen
1d45d9405a
security bugfix
10 years ago
Michael Peter Christen
ff728b4aa5
ignore url errors during search
10 years ago
Michael Peter Christen
8317914ce3
changed vocabulary navigator object type to TreeMap to get a specific
...
order into the vocabularies. This is now lexicographic which is not so
much random as a hashed order
10 years ago
Michael Peter Christen
d5c1b07768
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
10 years ago
Michael Peter Christen
c0f9f6ac66
added option to change the navbar-default, i.e. usable for dark skins
10 years ago
Michael Peter Christen
10794e8efd
trying facet.method fc instead of fcs to handle large facets
10 years ago
Michael Peter Christen
041b605cfe
Merge branch 'master' of git@gitorious.org:yacy/rc1.git
10 years ago
Michael Peter Christen
f1f74e8626
toString fix
10 years ago
Michael Peter Christen
30276a2b48
prevent that a local Solr search and a local RWI search are running
...
concurrently. When a RWI search result is flushed into the result set,
id does Solr Queries (which replaced the old-style Metadata Queries) and
they are possibly running concurrently to a previously startet Solr
search. Both methods may block each other with IO. To enhance the speed,
they are now serialized. Because the Solr search results may result in
better results using the more advanced and configurable Ranking methods,
this result is preverred over the RWI search result. However, remote RWI
search results are still feeded concurrently into the search result as
well.
10 years ago
Michael Peter Christen
84763126e0
added option to make the YaCy proxy act as the cache is never stale. If
...
set to 'Always Fresh' the cache is always used if the entry in the cache
exist. This is a good way to archive web content and access it without
going online again in case the documents exist.
To do so, open /Settings_p.html?page=ProxyAccess and check the "Always
Fresh" checkbox.
This is set do false which behave as set before.
If you set this to true, then you have your web archive in DATA/HTCACHE.
Copy this to carry around your private copy of the internet!
10 years ago
reger
1e7ee72240
fix path lookup to ./defaults/yacy.badwords
...
(fix of commit ee277b9b3e
)
10 years ago
reger
7d863d6254
fix empty text facet entry
...
(noticed on Author facet)
10 years ago
Michael Peter Christen
a39419f2ef
more stacks shall be considered for on-demand loading, not only
...
deep-depth stacks to prevent "too many open files" problem
10 years ago
Michael Peter Christen
5bb52f79be
reduce number of calls to queue.size() because that may be a bottleneck
...
during crawling
10 years ago
Michael Peter Christen
4920ab7b76
optimize usage of size() cache
10 years ago
reger
ee277b9b3e
allow for local yacy.stopwords and yacy.badwords list (in DATA/SETTINGS/)
...
if file in DATA/SETTINGS it is loaded otherwise file in ./defaults is loaded
(if locale ./defaults/stopwords.xx doesn't exist take solr/lang/stopwords_xx.txt as default)
move yacy.stopwords, yacy.stopwords.de and yacy.badwords.example out of root directory to ./defaults directory
10 years ago
reger
de56266bcb
remove redundant toLower for topwords
10 years ago
Michael Peter Christen
a34f837592
better delete all files in path when removing host crawl stack
10 years ago
Michael Peter Christen
10b1db430a
if we have many hosts, use on-demand earlier
10 years ago
Michael Peter Christen
1324927e66
prevent division by zero
10 years ago
Michael Peter Christen
2beb6abeb6
disabled crazy sleep loop
10 years ago
Michael Peter Christen
70f03f7c8e
do not cache search requests to Solr if the result is used for
...
doublechecking. If a double-check comes from cached results the
doublecheck fails.
10 years ago
Michael Peter Christen
a0b84e4def
use a LinkedHashMap for factes to maintain facet order as given by solr
10 years ago
reger
ef5dc68313
include domtype to searcheventcache id
...
to differenciate between local / global events for reuse of cached events
fix for http://mantis.tokeek.de/view.php?id=493
10 years ago
Michael Peter Christen
0dc6e0a5f2
added option to enrich vocabularies with synonyms from synonym database
10 years ago
Michael Peter Christen
6a2a669db4
added loading of the synonyms file from addon/synonyms into the
...
knowledge loader
10 years ago
Michael Peter Christen
c67c5c0709
added new solr schema fields which record the occurences of vocabulary
...
matchings. These matches can be used for result boosting, i.e. if a
document contains words from a specific vocabulary, boost it.
10 years ago
Michael Peter Christen
a67a465415
fix field counter for multi-fields in html writer for the solr servlet
10 years ago
Michael Peter Christen
ec9d021568
added option in vocabulary editor to import CSV files with different
...
encodings (preselected windows-type character encoding which is typical
for CSV files). Fixed also other problems with character encoding in
dictionary files. Automatically generated vocabularies are now also
noted in the API steering.
10 years ago
reger
3c818fc912
add a check of java version string >=1.7 to startup class
...
stopping start with error msg on version < 1.7
10 years ago
Michael Peter Christen
0550b54d56
added fix to postprocessing: avoid caching of postprocessing collection
...
to always get fresh lists of documents. This is necessary since the
postprocessing changes the same documents which the
postprocessing-collection query selects.
10 years ago
Michael Peter Christen
68e8039fd1
added high-precision scheduler for API processes. This allows also to
...
make the execution in dependency of available RAM or CPU load. The
default value for CPU load is 4.0 and the check runs once a minute.
10 years ago
Michael Peter Christen
8aee7f940e
added missing class for latest changes
10 years ago
Michael Peter Christen
97039049e4
fix in key enumeration methods for cases where the enumeration is done
...
in reverse order.
10 years ago
Michael Peter Christen
7e1b0b6712
fix for wildcard patch in search queries
10 years ago
Michael Peter Christen
0a879c98e7
added new 'firstSeen' database table and necessary data structures which
...
hold a date for each URL to record when a url was first seen. This is
then used to overwrite the modification date for urls upon recrawl in
case that the first-seen date is before the latest document date. This
behaviour is necessary due to the common behaviour of content management
systems which attach always the current date to all documents. Using the
firstSeen database it is possible to approximate a real first document
creation date in case that the crawler starts frequently for the same
domain. As a result the search results ordered by date have a much
better quality and the usage of YaCy as search agent for latest news has
a better quality.
10 years ago
Michael Peter Christen
421ee64f33
another fix to ordering of table indexes; fixes also network stats
...
graphics
10 years ago
Michael Peter Christen
1db476c67e
fix for bad table iteration
10 years ago
reger
e4316e2d74
skip creation of local var in proxyhandler.storetocache
10 years ago
sixcooler
9c6e3a6b1c
fix assertation-failure in version-string for Solr-4.10.2 by changing
...
the assert - hope that is ok
+ add forgotten NB-Projekt-changes
10 years ago
sixcooler
725b206fb4
update to solr-/lucene-4.10.2
10 years ago
Michael Peter Christen
5c97ecb30f
fix of bad query generation for search facets
10 years ago
Michael Peter Christen
95d87f00b3
fix for bad query generation in doublecheck in postprocessing
10 years ago
orbiter
72c2bc5189
fix for search in case where local peer has no local seed address in
...
portal mode
10 years ago
orbiter
5be352da99
Merge branch 'master' of git@gitorious.org:yacy/rc1.git
10 years ago
orbiter
0fcd8097a3
removed unused options from BusyThreads
10 years ago
Michael Peter Christen
fe8b1d137d
emergency bugfix for 100% CPU in image drawing
10 years ago
Michael Peter Christen
92007e5d2d
more enhancements to posprocessing speed
10 years ago
Michael Peter Christen
9a7fe9e0d1
fix for bad timing computation in postprocessing
10 years ago
Michael Peter Christen
bd16119a00
another fix for postprocessing (the query for "" on numeric field did
...
not work in external solr)
10 years ago
Michael Peter Christen
327e83bfe7
more fixes in postprocessing: partitioning of the complete queue to
...
enable smaller queries
10 years ago
orbiter
2bc6199408
more concurrency for postprocessing
10 years ago
orbiter
a83cf26c38
more fixes and enhancements to postprocessing
10 years ago
orbiter
71758f0d62
enhanced postprocessing by usage of a field-list generation to prevent
...
lazy initialization of the documents. This is useful because the
documents must be read completely anyway.
10 years ago
orbiter
7856fbdbe8
fix for npe (in rare cases)
10 years ago
orbiter
8a2b569d7c
fix for literal computation
10 years ago
orbiter
856da2712b
Merge branch 'master' of git@gitorious.org:yacy/rc1.git
10 years ago
orbiter
ca9cd7b58a
more IPv6 fixes
10 years ago
Michael Peter Christen
b4585e9546
added new index size history image in /Status.html page
10 years ago
Michael Peter Christen
167c5a51f0
IPv6 fix
10 years ago
Michael Peter Christen
fe537679de
fix for exact_signature_unique_b, exact_signature_copycount_i,
...
fuzzy_signature_unique_b and fuzzy_signature_copycount_i: apply same
criteria for 'valid document' as for title and description uniqueness
test.
10 years ago
sixcooler
eb9d2705d2
fix for ConnectionInfo.cleanup of server-connections
10 years ago
Michael Peter Christen
2e5214eb21
added field postprocessing.partialUpdate to settings which can be used
...
to switch on or off partial updates. Both options should cause the same
result. Default is on.
10 years ago
Michael Peter Christen
11074d8d24
fix for a ssl bug that appear only in java 7.
...
The bug was reported in
http://forum.yacy-websuche.de/viewtopic.php?f=23&t=5407&p=30956#p30956
a solution was described in
http://teknosrc.com/javax-net-ssl-sslprotocolexception-handshake-alert-unrecognized_name-solved/
which worked for this example given in the yacy forum
10 years ago
Michael Peter Christen
e96490e3a1
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
10 years ago
Michael Peter Christen
77662e08e1
concurrently initialize the error cache; extended also the cache by
...
factor 10 up to 1000 entries. This error cache is only used to catch up
paused crawls between shutdown+startup
10 years ago
sixcooler
d8fcc4a2f5
added a timeout on Jetty connectors
10 years ago
Michael Peter Christen
0f0b60404b
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
10 years ago
sixcooler
72561926aa
do not overwrite yacy.conf in case of an exception
...
may be a fix for http://mantis.tokeek.de/view.php?id=180
10 years ago
Michael Peter Christen
07c5b57953
removed warnings
10 years ago
orbiter
fa2ad101ec
enhanced graphics computation (avoiding long string parsing for colours)
10 years ago
orbiter
ef813cec91
added proper copyright notice to OSM tiles presented at the search
...
result page
10 years ago
Michael Peter Christen
fca11701f0
better profiling of solr queries
10 years ago
Michael Peter Christen
2e09da9832
npe fix
10 years ago
Michael Peter Christen
d80418f1b1
added partial updates to solr during postprocessing: during
...
postprocessing the solr documents are now not completely retrieved.
instead, only fiels, needed for the postprocessing are extracted. When
Solr document are written, this is done using partial updates.
This increases postprocessing speed by about 50% for embedded Solr
configurations. For external Solr configurations the enhancement should
be much higher because the postprocessing with remote Solr is very slow.
When doing partial updates to a remote Solr, this method should perform
much better than before, it is expected that this is even much higher
than the increase with local Solr.
10 years ago
Michael Peter Christen
b1cfbc4a04
added new solr field url_paths_count_i which can be used to enhance the
...
index browser and maybe also for ranking; possibly also for
SEO-with-YaCy applications.
10 years ago
Michael Peter Christen
e69883d5ab
fix-fix for
...
30d4402cd1
10 years ago
Michael Peter Christen
30d4402cd1
fixed location search
10 years ago
Michael Peter Christen
6983dff334
explain crawl denial when not switched to intranet mode
10 years ago
Michael Peter Christen
f818f84adb
more ipv6 fixes
10 years ago
Michael Peter Christen
afd5bd5f5f
slightly enhanced Network table computation by using a lazy initialized
...
bitfield for peer flags
10 years ago
Michael Peter Christen
2c2b50e65d
refactoring (class name should start with uppercase letter)
10 years ago
Michael Peter Christen
bc275dca07
added network history graph image /NetworkHistory.png which can show
...
many different statistics about the history of the peer.
10 years ago
Marc Nause
ce9368246b
Merge branch 'master' of gitorious.org:yacy/rc1
10 years ago
Marc Nause
5603809deb
Minor changes:
...
*) reduced visibility of a method
*) updated comments
10 years ago
Michael Peter Christen
d8beafba3a
fix for values in CrawlProfileEditor table and xml; now the full profile
...
is available in the xml.
10 years ago
Michael Peter Christen
ec95dfa2e6
fixed crawl profile xml result which did not show the correct crawl
...
status.
10 years ago
Michael Peter Christen
8c1a89cb34
added another decoration flag to switch off network graphics in crawler
...
monitor and index browser: decoration.grafics.linkstructure
Please set this to false to remove the graphics from the interface.
10 years ago
Michael Peter Christen
ee27be3399
misc bugfixes (concurrency, memory protection)
10 years ago
Michael Peter Christen
9b1958e8ca
more ipv6 bugfixes
10 years ago
Michael Peter Christen
7817fc50c9
added a high cpu cycle monitor to PerformanceQueues
10 years ago