Michael Peter Christen
9fce8bf2a5
crawling of multi-page pdfs with artificial post part on smb or ftp
...
shares is not possible with the disabled setting; this is not temporary
disabled until a better solution is on the hand.
10 years ago
reger
682dd94925
fix div by 0 in hello
...
Caused by: java.lang.ArithmeticException: / by zero
at hello.respond(hello.java:159)
10 years ago
reger
17808898c6
update to SLF4J 1.7.9
10 years ago
reger
f856edecb6
fix proxy redirect (http status 302) response
...
fixes http://mantis.tokeek.de/view.php?id=517
The url given in bug report uses a gzip input stream which causes the HTTPClient.writeto() throw an IOException due to incomplete input stream. This in turn prevents the 302 reponse to the client browser.
By limiting to serve target content just on httpstatus=200 will proxy the header reponse and client browsers redirect settings can be honored.
10 years ago
Michael Peter Christen
cc090bcb01
enhanced initialization of autotagging
10 years ago
Michael Peter Christen
003ec43bee
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
10 years ago
Michael Peter Christen
bef689d0a2
NPE fix
10 years ago
reger
1de33c6a53
add hint to Heuristics Config on "Greedy Learning Mode" in portal config,
...
to point to a option to make this setting permanent.
10 years ago
reger
5332c9df21
update to commons-fileupload-1.3.1.jar
...
(includes a security fix)
10 years ago
Michael Peter Christen
a0576ec737
fix for pdf sub-page result preparation
10 years ago
Michael Peter Christen
6ad43c4a8b
removed debug code
10 years ago
Michael Peter Christen
407cfff010
fix to wkhtmltopdf usage
10 years ago
Michael Peter Christen
5d321d3dc5
fixes to wkhtmltopdf call
10 years ago
Michael Peter Christen
eb78388a98
changed prefer strategy for http unique in such a way that http is
...
preferred over https. While this is a bad idea from the standpoint of
security it is more common applicable for environments where http and
https mix and for some domains https is not available. Then the
double-check is possible even if no postprocessing is performed.
10 years ago
Michael Peter Christen
84e2cccab4
fix to prevent assertion error in ranking servlet if no vocabularies are
...
present that could be evaluated
10 years ago
Michael Peter Christen
9e588944fa
prevent NPE during initialization of very large vocabularies
10 years ago
Michael Peter Christen
aaf7d4775a
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
10 years ago
Michael Peter Christen
8c3e5b7b6d
added experimental pdf splitting which enables YaCy to split pdfs during
...
parsing into individual pages and add them all using different URLs.
These constructed urls are generated from the source url with an
appended page=<pagenumber> attribute to the url get/post properties.
This will distinguish the different page entries. The search result list
will then replace the post parameter with a url anchor # mark which
causes that the original url is presented in the search result. These
URLs can be opened directly on the correct page using pdf.js which is
now built-in into firefox. That means: if you find a search hit on page
5 and click on the search result, firefox will open the pdf viewer and
shows page 5.
10 years ago
Michael Peter Christen
85773ebd4f
removed debug lines
10 years ago
Michael Peter Christen
d14114697c
the miss cache does not seem to work, it sometimes contains urlhashes
...
from documents which actually are inside the index. This can be
reproduced using the crawl result table at
http://localhost:8090/CrawlResults.html?process=5
The cache is temporary disabled to remove the bad behaviour, however a
later reactivation of that feater may be possible.
10 years ago
reger
deb75a1dbe
fix refactored size() -> filesize() in YMarkMetadata
10 years ago
reger
198102304b
refactor size() -> filesize() of URIMetadataNode
...
(harmonize with ResultEntry and to not get confused with Collection.size())
10 years ago
reger
c6f634a4f2
remove redundant caching of urlhash in URIMetadataNode
...
(is already cached in underlaying DigestURL .url)
upd pom keyword for maven-antrun-plugin
10 years ago
Michael Peter Christen
445fafeb7c
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
10 years ago
Michael Peter Christen
0d69089c61
fix for division by zero
10 years ago
reger
ac61a39828
use peeraddress for link in remote crawl list
...
to make link work without enabled proxy
upd pom for Jetty (missing in last commit)
10 years ago
reger
fe5d4e6c7b
update to Jetty 9.2.6
10 years ago
Michael Peter Christen
5516819354
preventing the use of no-cache and expires in case that images are
...
generated dynamically which will stay static in the future. This applies
mainly to the search result favicon in front of search hits. These icons
will now be generated once, but then caches in the browser. There is
also a YaCy-internal cache for these icons which had prevented the
re-generation of the icons in YaCy, but this cache is now superfluous
since the browser should not call the servlet ViewImage again.
10 years ago
Michael Peter Christen
d3e71ed070
fixes for searches when initialization of large autotagging libraries
...
have not been finished
10 years ago
Michael Peter Christen
28683530cd
fixes to usage of no-cache: use and recognize also the no-store
...
directive
10 years ago
Michael Peter Christen
c9c700b510
reduction of http requests to YaCy using the correct cache-control,
...
expires and last-modified headers in http response.
10 years ago
reger
eca578a5fa
update to PDFBox 1.8.8
10 years ago
reger
13cca2b114
fix missing AppPath
...
upd Maven plugin versionid
10 years ago
Michael Peter Christen
d7e2f08a89
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
10 years ago
reger
0f7d4c42e9
include xmpcore.jar in classpath
...
used by metadata-extractor
10 years ago
malykhin.dmitry
bd39e009ac
Update russian translation
10 years ago
Michael Peter Christen
65125439fe
added query modifier 'on'. This makes it possible to search for date
...
occurrences within the (web) page documents (not the document
last-modified!). This works only if the solr field dates_in_content_sxt
is enabled. A search request may then have the form "term on:<date>",
like
gift on:24.12.2014
gift on:2014/12/24
* on:2014/12/31
For the date format you may use any kind of human-readable date
representation(!yes!) - the on:<date> parser tries to identify language
and also knows event names, like:
bunny on:eastern
.. as long as the date term has no spaces inside (use a dot). Further
enhancement will be made to accept also strings encapsulated with
quotes.
10 years ago
Michael Peter Christen
1cfddea578
added (very experimental) Solr response writer for snapshot image
...
results
10 years ago
Michael Peter Christen
7287dd764e
added url, date, time and page number on pdf snapshot footer
10 years ago
Michael Peter Christen
8b5d074715
fix for image parser (there is a class missing!)
10 years ago
Michael Peter Christen
932faafffe
reactivated on-demand snapshot loading
10 years ago
Michael Peter Christen
2362ad7c34
fix for a count issue in snapshot api
10 years ago
Michael Peter Christen
3354cd63be
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
10 years ago
Michael Peter Christen
9971e197e0
Added a transaction interface to the snapshots: all documents in the
...
snapshots can now be processed with transactions using commit and
rollback commands. Furthermore, a large number of monitoring methods had
been added to check the success of transactions.
The transactions for snapshots have two main components: a rss search
API to get information about latest/oldest entries and a commit/rollback
API to move entries away from the rss results. This is done by usage of
two storage locations for the snapshots, INVENTORY and ARCHIVE. New
snapshots are placed to INVENTORY, commited snapshots move to ARCHIVE,
rollback snapshots move to INVENTORY again.
Normal Workflow:
Beside all these options below, usually it is sufficient to process data
like this:
- call
http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=LATESTFIRST
- process the rss result and use the <guid> value as <urlhash> (see next
command)
- for each processed result call
http://localhost:8090/api/snapshot.json?command=commit&urlhash= <urlhash>
- then you can call the rss feed again and the commited urls are omited
from the next set of items.
These are the commands to control this:
The rss feed:
http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=LATESTFIRST
http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=OLDESTFIRST
http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=ANY
http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=LATESTFIRST
http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=OLDESTFIRST
http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=LATESTFIRST
The feed will return a <urlhash> in the <guid> - field of the rss. This
must be used for commit/rollback:
Commit/Rollback:
http://localhost:8090/api/snapshot.json?command=commit&urlhash= <urlhash>
http://localhost:8090/api/snapshot.json?command=rollback&urlhash= <urlhash>
The json will return a property list containing the property "result"
with possible values "success" or "fail", according of the result. If an
"fail" occurs, please look into the log for further info.
Monitoring:
http://localhost:8090/api/snapshot.json?command=status
This shows the total number of entries in the INVENTORY and the ARCHIVE
http://localhost:8090/api/snapshot.json?command=list
This will result a list of all hosts which have snapshots and the number
of entries for the hosts. Counts for INVENTORY and ARCHIVE are listed in
the porperties for "count.INVENTORY" and "count.ARCHIVE"
http://localhost:8090/api/snapshot.json?command=list&depth=2
The list can be restricted to such which have a specific depth. The list
contains then the same host names, but the count values change because
only documents at that specific crawl depth are listed
http://localhost:8090/api/snapshot.json?command=list&host=yacy.net.80
This lists all urlhashes for the given host, not only an accumulated
list of the number of entries
http://localhost:8090/api/snapshot.json?command=list&host=yacy.net.80&depth=0
This restricts the list of urlhashes for that host for the given depth
http://localhost:8090/api/snapshot.json?command=list&state=INVENTORY
http://localhost:8090/api/snapshot.json?command=list&state=ARCHIVE
This selects either the INVENTORY or ARCHIVE for all list commands,
default is ALL which means that from both snapshot directories the host
information is collected and combined. You can use the state option for
all the commands as listed above
Detailed Information:
http://localhost:8090/api/snapshot.json?command=metadata&urlhash=upiFJ7Fh1hyQ
This collects metadata information for the given urlhash. This can also
be restricted with state=INVENTORY and state=ARCHIVE to test if the
document is either in one of these snapshot directories. If an urlhash
is not found, an empty result is returned. If an entry was found and the
state was not restricted, then the result contains a state property
containing the name of the location where the document is, either
INVENTORY or ARCHIVE.
Hint:
If a very large number of documents is inside of INVENTORY, then it
could be better to call the rss feed with
http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=ANY
because that is very efficient.
10 years ago
reger
63846ddb89
add final SolrQueryRequest.close to SolrServlet
10 years ago
reger
9edc7308aa
update to metadata-extractor-2.7.0.jar
...
add 2 simple JUnit test cases for jpeg and tif parsing
10 years ago
Michael Peter Christen
578ae29f1e
added a note that the servlet is linked using web.xml
10 years ago
reger
6c3f36def1
- fix path to default heuristic.cfg
...
- deprecate unused ProxyServlet
10 years ago
reger
00113dcfbd
add chardet.jar to Maven dependencies
10 years ago
reger
446f374ba9
fix yacy.init comment
...
http://mantis.tokeek.de/view.php?id=513
10 years ago