Michael Peter Christen
7db2888336
fixed font size and print page generation in pdf snapshots
10 years ago
reger
24f68a4eb7
refactor opensearch heuristic
...
introduce FederateSearchManager handling search heuristic to external systems via specific FederateSearchConnectors,
which provide the query() functionallity, the translation to YaCy schema .toYaCySchema() and the search() routine to deliver results to searchevents, which is generally implemented in Abstract connector.
The manager enforces now a min 15s delay between calls to external systems.
Besides the OpensearchConnector a SolrFederateSearchConnector is available. It uses a additional config file for fieldname translation.
default heuristicopensearch.conf:
- openbdb.com removed - seems not longer to deliver results
- config via solrconnector to datacite.org added (large technical library archive)
10 years ago
Michael Peter Christen
3b51636ecb
fix for mediawiki import
10 years ago
Michael Peter Christen
8cafdb989a
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
10 years ago
reger
4214f250d0
Add option for extended search (Autosearch) to Bookmark.html asking all connected peers for the searchterm added as description to the bookmark created by the bookmark icon.
...
Intended for searches/research projects with not sufficient results from local and DHT selected remote target peers.
Function: the process checks newly created bookmarks for description starting with "query=..." and takes this to ask every peer for 20 search results and adds it to the local index in a background job.
link to start/stop the process added to /Bookmarks.html
10 years ago
reger
bb37cb32e4
Add title import for bookmark icon
...
if avail in index
10 years ago
reger
8e751d754a
- add javadoc to busythread with hint about the init parameter useage
...
- remove obsolete 10_httpd config parameter
10 years ago
Michael Peter Christen
0871e43fcc
better scale
10 years ago
Michael Peter Christen
35c24608cc
fix for division by zero (rare cases)
10 years ago
reger
4eb89d7f15
revert clickservlet
...
(default was indeed a mistakenly)
10 years ago
reger
ebe5faeb01
added url to bookmark icon link
...
url is anyway needed, saves index lookup and works w/o commited url.
Removed unused order parameter
10 years ago
reger
d44d8996d0
Added a “don't store remote search results” option
...
This is intended for peers who want to participate in the P2P network but don't wish to load/fill-up their index with metadata of every received search result.
The DHT transfer is not effected by this option (and will work as usual, so that a peer disabling the new store to index switch still receives and holds the metadata according to DHT rules).
Downside for the local peer is that search speed will not improve if search terms are only avail. remote or by quick hits in local index.
To be able to improve the local index a Click-Servlet option was added additionally.
If switched on, all search result links point to this servlet, which forwards the users browser (by html header) to the desired page and feeds the page to the fulltext-index.
The servlet accepts a parameter defining the action to perform (see defaults/web.xml, index, crawl, crawllinks)
The option check-boxes are placed in ConfigPortal.html
10 years ago
reger
d729386787
fix NPE in viewimage
...
Caused by: java.lang.NullPointerException
at net.yacy.peers.graphics.EncodedImage.<init>(EncodedImage.java:73)
at ViewImage.respond(ViewImage.java:156)
10 years ago
reger
4ff018c9e4
fix ConfigPortal jumps to iframe focus
...
add focus parameter to yacysearch.html too
10 years ago
Michael Peter Christen
5b810f6d70
Merge branch 'master' of gitorious.org:yacy/whitrs-rc1
10 years ago
Ryszard Goń
3cdbd5f5c6
Fix for progress table background not resizing
...
when the post-processing started/ended.
10 years ago
reger
0dfeee154a
adjustments for Bookmark icon to act on BookmarkDB,
...
it acts on YMarks but YMark interface seems not maintained,
for future features (e.g. query memory) BookmarkDB is the likely choice to expand, besides the crawlstart bookmark also the result bookmark icon now adds to BookmarkDB.
The YMark related code is (for now) left untouched so both tables are updated.
10 years ago
Michael Peter Christen
513e9259f5
Merge branch 'master' of git@gitorious.org:yacy/rc1.git
10 years ago
reger
e177d69387
remove obsolete config footer option (ConfigPortal user.login)
...
no footer or footer-option in use
remove unused yacy.init item allowUnlimitedReceiveIndexFrom
10 years ago
Michael Peter Christen
5d4167f977
reacivated clear stacks code for termination of all crawls because this
...
did not work wihtout that part of the code
10 years ago
Michael Peter Christen
ecb6a59e9e
do not translate gif images into png images for thumbnails. Instead,
...
stream the original to the search result thumb viewer. This has two
reasons:
- animated gifs cause 100% cpu and deadlocks in the jvm gif parser; a
known bug which is obviously not yet fixed
- animated gifs now appear in the search result also as animation
10 years ago
Michael Peter Christen
d9603039ff
automatically set the Q flag for smb/ftp start urls (split pdf support)
10 years ago
Michael Peter Christen
8600ea01dd
automatically swith on query option in case intranet protocols (smb/ftp)
...
are used. This supports the new split-pdf option.
10 years ago
Ryszard Goń
3144313974
Postprocessing progress bar fix
...
(Make it work as [probably] actually intended)
10 years ago
reger
7e4e9f7e32
improve yacysearchitem,
...
prevent allocation of String (modifyURL) if feature not used
10 years ago
Michael Peter Christen
8ef56eda90
Merge branch 'master' of git@gitorious.org:yacy/rc1.git
10 years ago
Michael Peter Christen
9fce8bf2a5
crawling of multi-page pdfs with artificial post part on smb or ftp
...
shares is not possible with the disabled setting; this is not temporary
disabled until a better solution is on the hand.
10 years ago
reger
682dd94925
fix div by 0 in hello
...
Caused by: java.lang.ArithmeticException: / by zero
at hello.respond(hello.java:159)
10 years ago
Michael Peter Christen
003ec43bee
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
10 years ago
Michael Peter Christen
bef689d0a2
NPE fix
10 years ago
reger
1de33c6a53
add hint to Heuristics Config on "Greedy Learning Mode" in portal config,
...
to point to a option to make this setting permanent.
10 years ago
Michael Peter Christen
84e2cccab4
fix to prevent assertion error in ranking servlet if no vocabularies are
...
present that could be evaluated
10 years ago
Michael Peter Christen
9e588944fa
prevent NPE during initialization of very large vocabularies
10 years ago
Michael Peter Christen
aaf7d4775a
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
10 years ago
Michael Peter Christen
85773ebd4f
removed debug lines
10 years ago
reger
198102304b
refactor size() -> filesize() of URIMetadataNode
...
(harmonize with ResultEntry and to not get confused with Collection.size())
10 years ago
Michael Peter Christen
445fafeb7c
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
10 years ago
Michael Peter Christen
0d69089c61
fix for division by zero
10 years ago
reger
ac61a39828
use peeraddress for link in remote crawl list
...
to make link work without enabled proxy
upd pom for Jetty (missing in last commit)
10 years ago
Michael Peter Christen
5516819354
preventing the use of no-cache and expires in case that images are
...
generated dynamically which will stay static in the future. This applies
mainly to the search result favicon in front of search hits. These icons
will now be generated once, but then caches in the browser. There is
also a YaCy-internal cache for these icons which had prevented the
re-generation of the icons in YaCy, but this cache is now superfluous
since the browser should not call the servlet ViewImage again.
10 years ago
Michael Peter Christen
d3e71ed070
fixes for searches when initialization of large autotagging libraries
...
have not been finished
10 years ago
Michael Peter Christen
28683530cd
fixes to usage of no-cache: use and recognize also the no-store
...
directive
10 years ago
Michael Peter Christen
932faafffe
reactivated on-demand snapshot loading
10 years ago
Michael Peter Christen
2362ad7c34
fix for a count issue in snapshot api
10 years ago
Michael Peter Christen
9971e197e0
Added a transaction interface to the snapshots: all documents in the
...
snapshots can now be processed with transactions using commit and
rollback commands. Furthermore, a large number of monitoring methods had
been added to check the success of transactions.
The transactions for snapshots have two main components: a rss search
API to get information about latest/oldest entries and a commit/rollback
API to move entries away from the rss results. This is done by usage of
two storage locations for the snapshots, INVENTORY and ARCHIVE. New
snapshots are placed to INVENTORY, commited snapshots move to ARCHIVE,
rollback snapshots move to INVENTORY again.
Normal Workflow:
Beside all these options below, usually it is sufficient to process data
like this:
- call
http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=LATESTFIRST
- process the rss result and use the <guid> value as <urlhash> (see next
command)
- for each processed result call
http://localhost:8090/api/snapshot.json?command=commit&urlhash= <urlhash>
- then you can call the rss feed again and the commited urls are omited
from the next set of items.
These are the commands to control this:
The rss feed:
http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=LATESTFIRST
http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=OLDESTFIRST
http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=ANY
http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=LATESTFIRST
http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=OLDESTFIRST
http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=LATESTFIRST
The feed will return a <urlhash> in the <guid> - field of the rss. This
must be used for commit/rollback:
Commit/Rollback:
http://localhost:8090/api/snapshot.json?command=commit&urlhash= <urlhash>
http://localhost:8090/api/snapshot.json?command=rollback&urlhash= <urlhash>
The json will return a property list containing the property "result"
with possible values "success" or "fail", according of the result. If an
"fail" occurs, please look into the log for further info.
Monitoring:
http://localhost:8090/api/snapshot.json?command=status
This shows the total number of entries in the INVENTORY and the ARCHIVE
http://localhost:8090/api/snapshot.json?command=list
This will result a list of all hosts which have snapshots and the number
of entries for the hosts. Counts for INVENTORY and ARCHIVE are listed in
the porperties for "count.INVENTORY" and "count.ARCHIVE"
http://localhost:8090/api/snapshot.json?command=list&depth=2
The list can be restricted to such which have a specific depth. The list
contains then the same host names, but the count values change because
only documents at that specific crawl depth are listed
http://localhost:8090/api/snapshot.json?command=list&host=yacy.net.80
This lists all urlhashes for the given host, not only an accumulated
list of the number of entries
http://localhost:8090/api/snapshot.json?command=list&host=yacy.net.80&depth=0
This restricts the list of urlhashes for that host for the given depth
http://localhost:8090/api/snapshot.json?command=list&state=INVENTORY
http://localhost:8090/api/snapshot.json?command=list&state=ARCHIVE
This selects either the INVENTORY or ARCHIVE for all list commands,
default is ALL which means that from both snapshot directories the host
information is collected and combined. You can use the state option for
all the commands as listed above
Detailed Information:
http://localhost:8090/api/snapshot.json?command=metadata&urlhash=upiFJ7Fh1hyQ
This collects metadata information for the given urlhash. This can also
be restricted with state=INVENTORY and state=ARCHIVE to test if the
document is either in one of these snapshot directories. If an urlhash
is not found, an empty result is returned. If an entry was found and the
state was not restricted, then the result contains a state property
containing the name of the location where the document is, either
INVENTORY or ARCHIVE.
Hint:
If a very large number of documents is inside of INVENTORY, then it
could be better to call the rss feed with
http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=ANY
because that is very efficient.
10 years ago
reger
6c3f36def1
- fix path to default heuristic.cfg
...
- deprecate unused ProxyServlet
10 years ago
Michael Peter Christen
c3c2b6999b
fixes on wkhtmltopdf
10 years ago
Michael Peter Christen
ff035a20e7
fix for vocabulary import (double term detection)
10 years ago
Michael Peter Christen
e6650050fe
fix for Is Facet checkbox
10 years ago
Michael Peter Christen
bd3ed5cae5
added charset detection to vocabulary reader
10 years ago