reger
9619b8743c
add Solr Servlet
12 years ago
reger
13fc86c960
Merge remote-tracking branch 'origin/master' into jetty
12 years ago
reger
850609937f
update Info.plist for Jetty 9 jars
12 years ago
reger
f7f86d8a5d
update to Jetty 9 jars
...
- include javax.servlet 3.0
12 years ago
reger
603368fc3e
remove redundant declaration of USER_AGENT
12 years ago
reger
bd71b14d25
add mandatory p2p parameter to templatePattern
12 years ago
reger
b8da176c5d
adjust setHandled to request of call parameter
12 years ago
reger
127adbf5cf
remove references to 10_http thread (legacy http server)
...
and add needed get/set function to jetty http server wrapper
12 years ago
Michael Peter Christen
1a8c64117f
decreased the responseHeaderDB database which is now flushed more
...
frequently. This will preserve more documents in the cache in case of a
crash.
12 years ago
Michael Peter Christen
3e22d05290
added option for daterange properties in GSA interface to use an left-
...
or right-open date range;
i.e. using daterange=..2013-09-09 or daterange=2013-09-02.. additional
to daterange=2013-09-02..2013-09-09
12 years ago
reger
36b7159282
- remove double initialization of jetty
...
- refactor some var assignments
12 years ago
reger
8e52271491
- delete not needed old jetty jars from libt
...
- add jetty to Info.plist
12 years ago
reger
63ed04260a
Merge remote-tracking branch 'origin/master' into jetty
12 years ago
reger
fe87fb638a
adjust test/ParserTest to dc_description data type
12 years ago
Michael Peter Christen
35ab2cef7b
added parsing of 'date', 'dc:date', 'dc.date' and 'last-modified' in
...
html meta fields to get a correct (or: better) date timestamp. The
http:last-modified mostly does not work because it is set to the current
date from most CMS.
12 years ago
reger
2ee68f76f6
added read parameter from multi-part form fields (to nasty quick-fix)
12 years ago
Michael Peter Christen
9cc8468b30
added tools to visualize image generation (i.e. during testing)
12 years ago
reger
105cf8f593
changes to adjust jetty to recent code changes
12 years ago
reger
aafef72a8a
merged current rc1/master into jetty branch to allow further development with latest version
...
ServerSideIncludes and servlet return values need further work (for working jetty integration)
- TODO: added nasty quickfix to allow SSI - needs further work
- TODO: YaCy servlet return values/parameters are not handled
12 years ago
Michael Peter Christen
dbef8ccfcb
forced deletion of ZURL entries for a specific host for each host that
...
appears in the crawl url list
12 years ago
Michael Peter Christen
e137ff4171
refactoring (im preparation for new removeHost method)
12 years ago
Michael Peter Christen
7a5574cd51
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
12 years ago
Michael Peter Christen
85456f46b2
added two new fields, exact_signature_copycount_i and
...
fuzzy_signature_copycount_i, which count the number of copies of
non-unique documents and assigns this to each document. Thus, each
document there is a number assigned which shows how many copies of this
document exists.
These fields are disabled by default.
12 years ago
orbiter
26366596d9
fix for a problem which ocurres when a site is crawled where the start
...
url is redirected.
12 years ago
Michael Peter Christen
a2511b5600
turned images_alt_txt back to images_alt_sxt because it is not necessary
...
to index the alt text. Indexed image Text is in images_text_t
12 years ago
Michael Peter Christen
85b1922244
activated image type navigation for image search
12 years ago
Michael Peter Christen
9e12fdff23
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
12 years ago
Michael Peter Christen
ab1201fdfd
fixed wrong facet count
12 years ago
Michael Peter Christen
049c3b3f2e
added an option to exclude image search results from text search. This
...
is on by default.
12 years ago
Michael Peter Christen
69f85265e1
added an option to put image links to the crawl queue and handle these
...
like normal documents. Using this option (by default on at this moment;
this might change soon) it is possible to get the exif data into the
search index to be used in image search.
12 years ago
Michael Peter Christen
e8e558a9b7
fix for content domain classification in URIMetadataNode
12 years ago
Michael Peter Christen
a8c5bfcf58
avoid to create unnecessary objects
12 years ago
Michael Peter Christen
5a0de1b77d
moving image description text to image text field
12 years ago
Michael Peter Christen
dc179bd61f
fix for catchall query goal for image search
12 years ago
Michael Peter Christen
5d71a4c8bc
fix for dc:description field
12 years ago
reger
392174de8c
remove all_words, all_strings lists from QueryGoal
...
- only used for text highlighting in parser text (ViewFile.html) which can be done with include_strings only
12 years ago
Michael Peter Christen
169ef8963d
one more fix for image search
12 years ago
Michael Peter Christen
cb85b22725
redesign of the image search process (with much better results,
...
unfortunately the index schema has changed and p2p image search will not
be muchmuch better until many people update)
12 years ago
Michael Peter Christen
6184fd9d9a
fix for solr/gsa result logging
12 years ago
reger
29967102a2
optimized QueryGoal (reducing mem and computation by removing all_hashes)
...
- all_hashes used for text highlighting and word distance computation which can be done with include_hashes only
12 years ago
orbiter
f106345eef
link strings should not be tokenized
12 years ago
orbiter
deadeb406e
image alt tag strings should be tokenized
12 years ago
orbiter
5b14bdfffd
npe fix
12 years ago
orbiter
3e5f8e29e2
next development release step to reflect the extension of the solr api
...
with javabin format capability
12 years ago
orbiter
1ca4b9612c
added special handling of the BinaryResponseWriter in the solr interface
...
which makes it possible to use solrj with the javabin format which is
much better (compressed, no xml overhead, java object streams) and
faster. Furthermore, this enables the 'shards' option in the solr
interface which connects one solr (YaCy) to another solr (YaCy) ad-hoc.
12 years ago
reger
d0e78082d1
return field names in index instead of in schema for SolrServerConnector.getFields
12 years ago
Michael Peter Christen
1a3e42eca4
index migration to lucene 4.4
12 years ago
Michael Peter Christen
a88a62f7aa
added a feature to set a collection for a crawl result based on a
...
regular expression on th url: the collection attribut for a crawl start
may be now either a token or a list of tokens, seperated by ',' where a
token is either a string or a pair <string,pattern> where the string is
separated to the pattern with a ':' and the string is assigned to the
document as collection only if the pattern matches with the url.
12 years ago
Michael Peter Christen
3c5abedabf
NPE during shutdown fix
12 years ago
Michael Peter Christen
e4cbe9232d
fixed a crawler bug where a double-occurring url was not re-crawled
...
because the double-check error was written to the error-db and never
deleted. No the error-db is cleared on every start and these
double-messages are not written to the error-db any more.
12 years ago