Michael Peter Christen
07cee6b99c
removed more unused code
11 years ago
Michael Peter Christen
20b48f894f
refactoring: moving all servlets to the same package (the solr servlet
...
is currently actually a filter which should be changed somehow)
11 years ago
Michael Peter Christen
84167adb49
removed unused anomichttpd code after migration to jetty
11 years ago
Michael Peter Christen
b461a27abb
fixed the SolrServlet
11 years ago
Michael Peter Christen
7603e879dc
Merge branch 'master' into HEAD
...
Conflicts:
.classpath
source/net/yacy/cora/federate/solr/SolrServlet.java
11 years ago
Michael Peter Christen
25250405f1
solr servlet preparation for join with jetty branch
11 years ago
Michael Peter Christen
2f16770681
migrated to solr 4.6.0
11 years ago
Michael Peter Christen
57f0f71ac6
added patch to allow binary response writer
11 years ago
orbiter
937273d4e3
added parsing of metadata to surrogate reading:
...
a dublin core record inside of surrogate input files may now contain
tokens within the namespace 'md' (short for: metadata). The token names
must be valid withing the namespace of the solr field names. All
md-tokens inside of surrogate files then overwrite values within solr
documents before they are written to the solr index. This makes it
possible to assign collection names to each surrogate entry and also
ranking information can be added. Please see the example file.
11 years ago
reger
18497f6475
remove unused init parameter from DefaultServlet
...
- remove "RelativeResourceBase" parameter
11 years ago
orbiter
4de3fefdb5
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
orbiter
7e346e1d79
using stringbuilder in query construction
11 years ago
reger
c84c313fe1
Merge origin/master into jetty
11 years ago
Michael Peter Christen
2702d9e56b
- added a SolrQueryResponse2SolrDocumentList method which is able to
...
work around the unfolding process in Solr's BinaryResponseWriter.
This was a huge performance bottleneck in the embedded solr connector
and the problem is actually on Solr side, but we have now a workaround.
- This made it possible to abstract a high-performance index access
method which is implemented as method getDocumentListByParams. That
method is also implemented in the SolrServerConnector and provides a
very efficient access to a solr index if the index is embedded.
- a popular use of the document list retrieval is a result count which
can now also make use of the new method, via getDocumentCountByParams.
- enhanced the Error cache which now does not store error documents
within the ram cache if the document is also written to solr. When
documents are retrieved from the cache, they are partly read from the
ram cache and if not existent there, from the Solr index.
11 years ago
Michael Peter Christen
74466d731a
use pre-compiled patterns in ymark
11 years ago
Michael Peter Christen
34633044b4
made pattern computation static
11 years ago
Michael Peter Christen
ef7ddbc933
added date parser caches to prevent re-calculation of costly date
...
parsing
11 years ago
Michael Peter Christen
552ef9f18e
fix for bad ErrorCache.exists test (bug from latest commit)
11 years ago
Michael Peter Christen
09412ea3a4
counting search requests in solr interface
11 years ago
Michael Peter Christen
303f5694ba
avoid usage of existsByQuery. If a document can be loaded by the ID
...
before testing other fields from the existsByQuery request, then a
document cache fills and queries after that one can be avoided.
11 years ago
reger
b43bbd3cc4
join DefaultServlet and Jetty8 implementation
...
- removing Jetty 8 specific dependencies
11 years ago
reger
089c5007ee
move conditionalHeader to DefaultServlet
...
- by removing Jetty specific implementation detail
11 years ago
Michael Peter Christen
79771c60c0
IPv6 fixes
11 years ago
reger
92d9c56f9f
Merge origin/master into jetty
11 years ago
Michael Peter Christen
78eac85161
better calibration of caches and queue maximum sizes
11 years ago
Michael Peter Christen
c8af19bd37
removed unnecessary check which causes a NPE when searching with empty
...
search string
11 years ago
Michael Peter Christen
e3c2f09de9
- reduce computation in case that specific postprocessing fields are not
...
selected
- de-select citation rank computation
11 years ago
Michael Peter Christen
cfa08024c7
removed optimization bevore postprocessing because that may cause a
...
time-out which will cause that postprocessing fails.
11 years ago
Michael Peter Christen
6f3a923691
fixed urlmask which was not able to combine several constraints
11 years ago
Michael Peter Christen
9a27bf6e82
removed filter computation in Protocol class for remote searches because
...
that is already done in the QueryParams class
11 years ago
Michael Peter Christen
f1b5db2c45
- performance graph does not shop peer ping in memory monitor any more
...
- after a forced GC, the PerformanceMemory view switches to automatic
update by default
11 years ago
Michael Peter Christen
a125904a1c
fixed a NPE in surrogat processing
11 years ago
Michael Peter Christen
0db8e34625
enhanced webgraph processing
11 years ago
reger
ac067b5236
clean-up Jetty handler classes
11 years ago
reger
b75e92aac3
add read queryparameter in gsaservlet
11 years ago
reger
1e94719084
fix NPE on mime detection of unknown file extension
11 years ago
reger
effea4bca0
Merge origin/master into jetty
...
Conflicts:
source/net/yacy/cora/federate/solr/SolrServlet.java
11 years ago
sixcooler
2c2ebb0d92
tried some hardening in order not letting any Solr-Searchers open
11 years ago
Michael Peter Christen
a16534cb0a
tried to fix timeout and connection-lost problems when using an outside
...
solr.
11 years ago
Michael Peter Christen
c3dcbdc8d5
try to recover from an OOM during citation index reading and fail-over
...
to second solr core in case of unrecoverable OOM.
11 years ago
Michael Peter Christen
9932c441c8
fixed a problem with Date fields parsing Solr results if a remote Solr
...
is attached.
11 years ago
sixcooler
94db054aff
memory-leak-fix: the DocListSearcher fires an query in its constructor
...
and it is highly recommend to close every SolrRequest.
Every Request, which is not closed leaves a Searcher with its Chaches an
can not be garbage-collectet.
11 years ago
reger
26bb1e37b7
implement core selection in SolrServlet
...
- making initcore() obsolete
11 years ago
Michael Peter Christen
ae55d69ef6
include/exclude size NPE fix (recently added)
11 years ago
Michael Peter Christen
2c39b65409
fixes for searches containing stopwords. The fix was done using a
...
reconstruction of the search word set access method to protect that
words are deleted from the sets from the outside of the QueryGoal class.
11 years ago
Michael Peter Christen
5592ea57f0
hack to remove compiler warnings about deprecated classes. It would be
...
better to remove the deprecated usage but to do this the Solr core must
adopt the latest apache http core changes as well .. this is not our
fault.
11 years ago
orbiter
037cd0a57c
using the BinaryResponseWriter which is supported within the YaCy solr
...
servlet since YaCy 1.63. This is much more performant for the client
than using the XMLResponseWriter because parsing of XML data is very CPU
intensive. Older YaCy peers are still requested using the
XMLResponseWriter but the majority of YaCy peers already respond with
the binary writer. This makes remote searches much faster and less CPU
intensive.
11 years ago
orbiter
61409788eb
less word hash computations (removing some overhead because of MD5
...
calcs) using the clear word in a normalized form.
11 years ago
reger
f23471c471
add check to prevent index entries containing url_file_ext_s with ";jsession=xyz"
...
note: check could be implemented in MultiProtocolURL (but at this time didn't oversee possible implication)
11 years ago
reger
5c4a3d1c01
Merge origin/master into jetty
11 years ago
reger
444a9ae674
remove unused options and attributes from DefaultServlet
...
cleanup obsolete class files
11 years ago
reger
8da75a4b0c
fix contentType definition for Solr html responswriter
...
from xml to html
(hint: value is currently not used, but is in SolrServlet)
11 years ago
Michael Peter Christen
ccf2f4e43b
refactoring of seed attributes (introduced more constants)
11 years ago
Michael Peter Christen
1f0bfa8fec
added test to Base64Order (runs successfully!)
11 years ago
orbiter
b7f1e5af51
added new servlet which generates the same file as the principal peers
...
upload to a bootstrap position
you can call it either with
http://localhost:8090/yacy/seedlist.html
or to generate json (or jsonp) with
http://localhost:8090/yacy/seedlist.json
http://localhost:8090/yacy/seedlist.json?callback=seedlist
11 years ago
orbiter
3e552550d1
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
orbiter
c2d720cdaf
purge a lucene cache - possible memory leak fix
11 years ago
reger
e4f49fb175
for searchresults with empty title use filename as title
...
- to not store a title in index which isn't extracted from source
the title is empty check only added to ResultEntry class
11 years ago
reger
b1dc9a6f52
- disable Jetty servlet defaultUseCache (prevent double caching)
...
- include short memory status check for class cache in DefaultServlet
- remove obsolete Resource interface for Jetty8YaCyDefaultServlet
11 years ago
reger
f111f30ace
Merge origin/master into jetty
11 years ago
reger
94293176a3
use writeOptionHeaders with ServletResponse parameter only
11 years ago
orbiter
ff86cb683f
fixed some XSS bugs reported by Marius from http://ctf365.com/
11 years ago
orbiter
da33ee0d77
extended also timeout fr webgraph postprocessing
11 years ago
orbiter
74f9e40747
extended timeout during postprocessing of 30 minutes.
11 years ago
orbiter
19a051bec8
more monitoring for postprocessing and enhanced layout in Crawler
...
monitor page
11 years ago
Michael Peter Christen
9cf9727685
fix for wrong counter
11 years ago
Michael Peter Christen
fceac8cffd
more monitoring for postprocessing
11 years ago
Michael Peter Christen
6842783761
fixed and enhanced postprocessing
11 years ago
Michael Peter Christen
219d5934a4
fixed termination bug in Solr Connector
11 years ago
Michael Peter Christen
bf1bdd52a6
prevent requesting of 0-facets (which actually exist)
11 years ago
Michael Peter Christen
9d5895f643
enhanced and fixed postprocessing
11 years ago
Michael Peter Christen
f86fe90eda
enhanced mass storage speed to remote solr servers
11 years ago
Michael Peter Christen
6ed9821209
fixed several problems in solr connectors
11 years ago
Michael Peter Christen
191fd3d7e7
added an optimization option to HandleSet mass data storage structure
11 years ago
Michael Peter Christen
94b565ea0d
fixed keepalive min value
11 years ago
reger
b26787dc2d
- DefaultServlet: remove static gzip option
...
YaCy doesn't use pre-gzip'ed static html pages
- ProxyServlet: remove not neede procedure
- Server init: skip one overlaping servlet context
11 years ago
Michael Peter Christen
24a052ecb9
removed debug code for existsByIds
11 years ago
Michael Peter Christen
087df05e24
added option to Config_Network_p.html to enable remote search while
...
DHT-Receive is switched off.
11 years ago
Michael Peter Christen
1a4a69c226
set more logger to 'final static'
11 years ago
Michael Peter Christen
c60947360d
logger should be static
11 years ago
Michael Peter Christen
69b8d61c47
fix for search requests in GSA interface which contain 'funny'
...
characters (like ':' etc.)
11 years ago
orbiter
b085cb522b
replaced old existsByIds for embedded Solr with obviously much faster
...
new selection method (including stil existing debug code to test that
this is in fact better)
11 years ago
reger
b29d262e70
implement Jetty8HttpServerImpl.generateSocketAddress
...
(code 1:1 copied from serverCore)
11 years ago
orbiter
4234b0ed6c
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
orbiter
909bbb49d8
added (partly commented) test code for url rewrite methods .. to be
...
completed
11 years ago
reger
066a1ecf0a
add highlight queryparams to solrservlet if missing
...
- modify query params in Solr parameter map (instead of querystring)
11 years ago
Michael Peter Christen
899e7e92b0
added debug code
11 years ago
reger
4684330505
Merge origin/master into jetty
...
Conflicts:
source/net/yacy/cora/federate/solr/responsewriter/HTMLResponseWriter.java
11 years ago
reger
1437c45383
merge rc1/master
11 years ago
Michael Peter Christen
87a956e881
calculating and showing the number of files and the average size of a
...
file in the HTCACHE in ConfigHTCache_p.html
11 years ago
Michael Peter Christen
acc1f8a749
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
Michael Peter Christen
81d9e23532
fixed another memory leak in the PDF parser:
...
the class org.apache.pdfbox.pdmodel.font.PDFont occupies 8MB of space
which cannot be cleaned if PDFont.clearResources is called.
The attempt to clean the class cache therefore causes that the class is
loaded and this cache is initialized with some rubbish. I tried to
prevent to instantiate this class by usage of a hacked findLoadedClass
call to the SystemClassLoader (which is protected ...).
Now, without using the PDF parser at all, 8MB of RAM space is not
occupied, however, when the first PDF arrives this space will be taked
and never given back to GC.
WAKE UP YOU LAZY PDFBOX HACKER AND FIX THIS SHIT!
11 years ago
Michael Peter Christen
c152d996e6
reduced footprint of BookmarksDB which can take quite a lot of memory if
...
the number of bookmarks is high (i.e. > 2000 URLs)
11 years ago
Michael Peter Christen
81bb50118e
found and fixed a huge memory leak in solr caching (inside Solr). The
...
not-flushed Solr cache is now handled in this way:
- it is smaller by default
- an Solr-internal process is started to flush the cache periodically
(this does NOT clean the cache, just removes old objects)
- a Solr-external process (the standard YaCy cleanup-process) now has
direct access to the solr internal cache and flushes them completely.
The time frame for such a flush is defined by the cleanup-process
frequency, by default 10 minutes.
11 years ago
reger
7b17cdf6dd
add content_type:image/* to image search
...
- see numerous idx entries with content_type image without url_file_ext_s (for various reason) which should be included in result
- try it yourself with following sample query
/solr/select?q=content_type:image/* AND -url_file_ext_s:[* TO *]&defType=edismax&fl=sku,url_file_ext_s,content_type
adresses also possible url without or deviating extension.
11 years ago
reger
082c9a98c1
move writeHeaders from Jetty8 servlet to YaCyDefaultServlet
...
- after removing Jetty server dependency (of Response using HttpServletResponse only)
11 years ago
sixcooler
987f410011
URL-export:add query and fix for cast-class-exception
11 years ago
Michael Peter Christen
a8253ca49c
added missing unicode transformation in href link contents during
...
parsing
11 years ago
Michael Peter Christen
0cf9e9580b
added clickdepth and CR computation debug code to verify that the
...
process is complete
11 years ago
reger
b85f702f22
add AccessTracker logging to SolrServlet
11 years ago
reger
de1f02420b
implement HtmlResponseWriter to solrServlet (and rss / opensearch responswriter) as in yacy select servlet.
...
- set contenttype of HTLM/GrepHTML-Reponsewriter to "text/html"
- set a contenttype to GSAsearchServlet
11 years ago
Michael Peter Christen
234a974955
load image only if their parser flag is activated
11 years ago
Michael Peter Christen
b2c329929f
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
Michael Peter Christen
60187a4ec2
fix in html parser
11 years ago
Michael Peter Christen
e1c1e57877
less overhead calling exist() with only one hash
11 years ago
reger
3d5d366f1c
fix html header in Solr HTMLResponseWriter
...
- move 1st body content after </head> tag
- add closing <span> tag
11 years ago
reger
bfdb404867
implement a Jetty reconnect to work with Configbasic_p.html port change
...
- instead of shutting down the server it should be sufficient to manipulate the Jetty http connector
11 years ago
Michael Peter Christen
5a02d650ee
avoid cloning
11 years ago
reger
d6760df3e5
fix servlet class exist check to use default path only (in Jetty8YaCyDefaultServlet)
...
- del redundant doget code in yacydefaultservlet
- small declaration code opts
- del obsolete libt/proxyservlet.java
11 years ago
reger
b38de92a16
Merge origin/master into jetty
11 years ago
Michael Peter Christen
cc39667399
Speed enhancements and less CPU usage during Solr searches when using
...
the embedded Solr (the default). This was obtained by cirumventing solrj
search encapsulation and the implementation of direct index access
methods to Solr.
The effect will not only be seen during search, but this has also a
strong effect on suggestions (much more) and less CPU power usage during
index distribution (which needs many search requests)
11 years ago
Michael Peter Christen
434e13b46d
in host browser also show the properties of failed documents including
...
referrer urls (this is a VERY USEFUL SEO and Web Admin feature!!)
11 years ago
reger
6944225037
- add GSA search /gsa/search servlet for Jetty to Server init
...
- include SecurityHandler check for /gsa/ /solr/
- change one more YaCyDefaultServlet dependency from Jetty to std. javax.Servlet
11 years ago
reger
53cb30a221
reduce logging (by assigning logger to existing logger)
...
- small additional cleanups
11 years ago
reger
332c6d4fe1
reactivate Domain handler for .yacy / .yacyh handling
11 years ago
reger
b1ce70434e
resolve merge conflict
...
- add missing import statement
11 years ago
reger
7869a4c070
Merge origin/master into jetty
...
- merge conflict resolve
11 years ago
reger
f017066197
Merge origin/master into jetty
11 years ago
reger
06da6f517c
add YaCyProxyServlet to handle /proxy.html?url=proxyurl
...
- based on Jetty ProxyServlet
- at this time use existing HTTPD ProxyHandler for url rewrite
- add jetty-client jar (dependency in Jetty ProxyServlet)
reuse ProxyHandler.convertHeaderFromJetty in YaCyDefaultServlet
11 years ago
reger
69599566f9
catch one more malformed url in proxy url rewrite
11 years ago
reger
605530fec5
catch proxy url rewrite exception
...
malformed url (" http:\/\/" ) may cause error response
testcase http://localhost:8090/proxy.html?url=http://dictionary.reference.com/browse/test
11 years ago
Michael Peter Christen
9bb7eab389
hacks to prevent storage of data longer than necessary during search and
...
some speed enhancements. This should reduce the memory usage during
heavy-load search a bit.
11 years ago
orbiter
3c3cb78555
- removed a lot of garbage and bloated code from GuiHandler.
...
- transformed log lines to String before they are stored because the
storage space is about 1:250 (45kb for one line before transformation,
180 bytes afterwards)
- this saves up to 10MB RAM so we can increase the number of lines to
1000 again.
11 years ago
Michael Peter Christen
5afa6e3aee
Automatically flush the log cache if a short memory status is reached.
...
For the default of 200 lines this can flush about 10MB.
11 years ago
Michael Peter Christen
030d0776ff
Enhanced crawl start for very, very large crawl lists (i.e. > 5000)
...
which had a problem because of badly used concurrency.
This fix also caused a redesign of the whole host deletion process.
This should fix bug http://bugs.yacy.net/view.php?id=250
11 years ago
Michael Peter Christen
6aabc4e5c8
reduced logging line memory, 10000 lines had filled up 450MB! grrr.
...
(thank you, a bomb from the past)
11 years ago
Michael Peter Christen
1a8783147b
enhanced computation of number of solr documents.
11 years ago
Michael Peter Christen
4948c39e48
added concurrency for mass crawl check
11 years ago
Michael Peter Christen
1b4fa2947d
- fixed a problem which ocurred when a document was not recognized with
...
the right content domain (i.e. identifying that it is an image, text
etc.) because it used the file extension and not an existing mime type
assignment.
- fixed the new setting that images shall be loaded for a better image
search.
- both fixes together makes it now possible to crawl
commons.wikimedia.org which makes use of 'funny' document names (i.e.
ending with .jpg while the document is html)
11 years ago
Michael Peter Christen
82621bead0
When doing bootstraping, always accept one seedlist-File without
...
checking the date of the file. This should help to start the peer in
case that the user has a completely wrong date setting.
11 years ago
Michael Peter Christen
691d7e70fa
added hint to development/commit rss feed
11 years ago
orbiter
20bbde8665
fix for mustmatch regex computation: result had correct semantic, but
...
may have contained multiple same expressions within the disjunction of
domain-restrictions. This fix removes the redundant restrictions and
makes the regex shorter.
11 years ago
reger
cb2dbcb843
add graceful Jetty shutdown option
...
- as Jetty stop is not synced, yet
- include jetty jars and servlet-3.0 api jar in Eclipse .classpath
11 years ago
reger
f46c723398
allow to choose used http server, YaCy-Anomic or Jetty
...
- defaults to Jetty (in this branch)
- add server version info & config option -> Admin Console -> Advanced Settings -> Http Networking
11 years ago
reger
da4ff5aefa
add YaCy HttpCommand "authenticate" check to DefaultServlet
11 years ago
Michael Peter Christen
c833d02cf5
fixed webgraph postprocessing (did nothing and repeated to do this...)
11 years ago
Michael Peter Christen
74d0256e93
enhanced postprocessing: fixed bugs, enable proper postprocessing also
...
without the harvestingkey, remove crawl profiles after postprocessing,
speed-up for clickdepth computation.
11 years ago
reger
1adb4b8741
merge rc1/master
11 years ago
reger
77a73c7475
add YaCy HttpCommand "location" check to DefaultServlet
11 years ago
Michael Peter Christen
7b69c438f7
more methods for the table class
11 years ago
Michael Peter Christen
820b896146
Replaced the inframe loading from yacy.net for donations with the
...
loading of this iframe from the local host. To make this more flexible,
this iframe is loaded once after startup from yacy.net.
11 years ago
reger
cc223b14a4
remove wrong content mod in SSI parser for virtual path /currentyacypeer/
...
(is handled on start of request handling)
11 years ago
reger
5606291574
fix last commit (not needed test of GZipInputStream)
11 years ago
reger
f9eed8cb44
add support for gzip encoded multipart forms (needed for transferRWI.html)
...
- quick and dirty reuse of existing HTTPDemon implementation
11 years ago
reger
cf32a92629
- add size check to multipart form data handling of YaCyDefaultServlet (same as in HTTPDemon.parseMultipart)
...
- reduce Jetty logging
- give build.run a bit more memory (set to YaCy.default 600m from 512m)
11 years ago
reger
705f147820
- add localpeername.yacy to list of local address detection for AbstractRemoteHandler
...
- use proxy via header info as in legacy proxy handler
11 years ago
reger
0d4efabaa8
fix YaCy version string in proxy headers
...
(config parameter vString not longer used)
11 years ago
reger
2226189743
disable domainhandler due to error
...
- domainhandler causes closed response output stream in following handlers
on addresses resolved to local peer (like in hello protocoll preventing peer to switch to senior peer)
11 years ago
reger
eea504c117
update Info.plist
...
small DefaultServlet refactoring
11 years ago
reger
a44eede8b8
merge rc1/master
11 years ago
sixcooler
d9a02ed277
NPE fix for my last commit
11 years ago
reger
54a0272338
searchpage javascript (latestinfo) causes reset of search statistic after moving to next page
...
- disabled call via setTimeout in yacysearch.html
11 years ago
sixcooler
61f627eb85
fix for ssl-connections from proxy-usage staying in close-wait-state
...
+ some extra 'close' in HttpClient
11 years ago
Michael Peter Christen
d328cc4a83
fix for didyoumean, added also more asian alphabets
11 years ago
Michael Peter Christen
90c8577840
enhanced ranking; patches to replace old ranking
11 years ago
reger
e74f548551
make legacy http server (serverCore) implement YaCyHttpServer interface
11 years ago
reger
71d2655c02
downgrade to Jetty 8 to assure support of JRE 1.6
...
- introduce a YaCyHttp interface to modulize/separate http server
- adjust the Jetty version specific implementation part (in package net.yacy.http)
- putting the version specific code in classes starting with Jetty8xxxx
- moved existing Jetty9xxx implementation into a test class (to keep the code)
- adjust build to the changed jars
- make use of the introduced YaCyHttpServer interface in related htroot servlets
- adjust other test cases/classes
11 years ago
Michael Peter Christen
1b61bd40ed
- Added new solr field url_file_name_tokens_t which stores the file name
...
tokens. This can be used to enhance the ranking.
- Added also a rating_i field as basis for later usage.
- enhanced the tokenization process.
11 years ago
orbiter
6efa7532d2
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
orbiter
5f5a97bafc
added the anchor text within web pages to the searcheable entities of a
...
web page. This can be of benefit for the ranking if these fields are
used for boosts.
11 years ago
orbiter
705b3338ee
list more fields available for search and for ranking boosts
11 years ago
sixcooler
d536092fe4
fix false fill NAME_CACHE_MISS-DNS-Cache in case of a timeout
...
for eg. caused by massive requests when crawl from file
11 years ago
Michael Peter Christen
78e7aadb26
removed unused initialization method
11 years ago
Michael Peter Christen
4fbc4740df
removed warnings
11 years ago
Michael Peter Christen
21aa6a0321
migration to Solr 4.5.0
11 years ago
Michael Peter Christen
ef31d0f279
fix for rss reader, see http://bugs.yacy.net/view.php?id=294
11 years ago
Michael Peter Christen
101a6e6e14
Patch the citation index for links with canonical tags.
...
This shall fulfill the following requirement:
If a document A links to B and B contains a 'canonical C', then the
citation rank computation shall consider that A links to C and B does
not link to C.
To do so, we first must collect all canonical links, find all references
to them, get the anchor list of the documents and patch the citation
reference of these links.
11 years ago
reger
daebeb93aa
add call to AccessTracker to jetty security handler
11 years ago
reger
172aefaeeb
adjust YaCySecurityHandler to Jetty 9 conventions
...
- mainly adjust prepareConstraintInfo to use the RoleInfo.setChecked as in Jetty Source distribution
- use constraint check behavior as in ConstraintSecurityHandler
see http://git.eclipse.org/c/jetty/org.eclipse.jetty.project.git/tree/jetty-security/src/main/java/org/eclipse/jetty/security/ConstraintSecurityHandler.java?id=jetty-9.0.5.v20130813
11 years ago
reger
6f9ed439d3
- expand localHostName check of AbstractRemoteHandler
...
to pevent request is handled as proxy request
- make domain handler not relay on included path in resolved .yacy address
11 years ago
reger
561ea135af
fix : forgot adding security handler
11 years ago
reger
c7c706fd9f
merge with rc1/master
11 years ago
reger
272b196d05
update Jetty server init() to activate yacy-domain and transparent proxy handler
...
- adding domain & proxy handler to a context (as it was in inital design)
(context required for dispatcher)
- make handler context and servlet context parallel available
(to allow use of YaCyDefaultServlet to handle legacyServlets)
- set transparent proxy request handled after dispatch.forward to skip further handling for .yacy domain requests
11 years ago
reger
fd119deb00
fix NPE on modified since check ( Response.requestHeader allowed to be null)
11 years ago
reger
66145a0410
- add welcome file (index.html) support to YaCyDefaultServlet
...
- change SolrServlet default search field (&df) to text_t
11 years ago
Michael Peter Christen
b28d43decc
added two more fields source_cr_host_norm_i,target_cr_host_norm_i in
...
webgraph and an addition to postprocessing to copy all cr ranking
attributes to the link edges associated to the postprocessing documents
11 years ago
Michael Peter Christen
a52f3a597e
fix for canonical-from-http-header feature
11 years ago
Michael Peter Christen
2dd7c5be44
added parsing of http-canonical tags (untested, could not find an
...
example page)
11 years ago
Michael Peter Christen
4476dea5ba
do not fail if a wrong boost key is used; instead, print only a warning
...
See also: http://bugs.yacy.net/view.php?id=293
11 years ago
reger
ab9583d429
add default field (&df) to SolrServlet query if missing
11 years ago
Michael Peter Christen
3bf0104199
fix for crawl domain counter limitation (limit was reached too early)
11 years ago
Michael Peter Christen
82bfd9e00a
- crawl profiles shall be deleted from active and passive stacks if they
...
are deleted to terminate the crawl because otherwise the crawl will go
on after the load-from-passive stack policy.
- better check if a crawl is terminated using the loader queue.
11 years ago
Michael Peter Christen
1b3d26dd23
hack to remove most of the warning: deprecated messages (but not all,
...
one is left)
11 years ago
Michael Peter Christen
a496313248
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
sixcooler
3c48fc65fd
reverted RemoteInstance to deprecated methods of httpClient-4.2
...
this should work with current remote-Solr-Instances
11 years ago
Michael Peter Christen
91a875dff5
self-healing of mistakenly deactivated crawl profiles. This fixes a bug
...
which can happen in rare cases when a crawl start and a cleanup process
happen at the same time.
11 years ago
Michael Peter Christen
095053a9b4
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
sixcooler
0cae420d8e
some dns-timing changes:
...
since httpclient uses the domain-cache it is useful not to clean the
domain cache until crawling is running (domains are filled into this
cache)
On huge crawl-starts (eg. from file) my DNS did not follow the high
rates - so I reduced the rate and give some more time(-out)
11 years ago
sixcooler
15b1bb2513
bump to httpClient-4.3
11 years ago
Michael Peter Christen
4f83d5f18c
added the new field harvestkey_s to the collection index and the
...
webgraph index which is temporary filled with the crawl profile key.
This is used to select a set of documents for post-processing as soon as
a crawl is finished. Now the postprocessing for a specific crawl is
started when that specific crawl is finished and not at the end of all
post-processing steps.
11 years ago
orbiter
14442efa6d
when profiles are cleaned, there shall be first a callback showing which
...
profiles are cleaned. This shall enable a profile-termination-driven
postprocessing. To do this, index writings must carry the profile key
which will be implemented in another (next) step.
11 years ago
orbiter
0013d0d0bb
removed superfluous class
11 years ago
orbiter
f90d5296cb
Added new data structure to be used by the balancer (not used yet).
...
These data structures will enable the balancer to store the crawl queue
into individual queues, one each for a single host.
11 years ago
orbiter
0e8d752462
refactoring
11 years ago
orbiter
8ac2e8c8c9
added location navigator which causes that the image to the map search
...
is visible whenever a location is available in the search result.
To activate this, the search.navigation property in yacy.conf must be
modified to the new default values.
11 years ago
orbiter
d86d2be5c3
automatically removed Places autotagging if no location library is
...
wanted
11 years ago
orbiter
214a087cdf
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
Michael Peter Christen
96ed0c980e
- added hosthash to all documents (also fail documents which is needed
...
there for deletion), this fixes a problem for the deletion of old
documents for new crawl starts
- added clickdepth and citation computation for fail documents
11 years ago
Michael Peter Christen
179ad281f9
close include byte buffer after usage
11 years ago
reger
52dd491c04
fix not necessary use of DigestURL
11 years ago
reger
6b9a624808
remove double declaration of TLD_any_zone_filter
11 years ago
reger
5111841e5b
- reduce Jetty debug logging
...
- fix Context path initialization
11 years ago
reger
bc6ebb3c06
adjust to DigestURI changes from master to DigestURL
11 years ago
reger
561cbc7ee2
use more YaCy HeaderFramework constants (instead of Jetty's)
11 years ago
reger
5c4ba9b5db
merge rc1 master
11 years ago
reger
70c51775ae
Merge remote-tracking branch 'origin/master' into jetty
11 years ago
reger
4b77733e59
implement a YaCyDefaultServlet to handle YaCy-servlets within Jetty server
...
- the implementation is inspired by Jetty's DefaultServlet
- handles static html content and YaCy servlets
- translates between standard servlet request/response and YaCy request/response specification
With the implementation of YaCy-servlets as servlet instead via a jetty handler it's closer to servlet standard and carries less jetty specific dependencies.
11 years ago
orbiter
d2effd21db
fix for npe during location search
11 years ago
orbiter
828603e4f1
fix for 100%CPU problem in error cache cleaning process
11 years ago
orbiter
c64b51134e
hack to add all tokens from the url to text_t. This was working for the
...
RWI index (and still is working) but not for solr-only search indexes.
Maybe we should find a solution using a separate search field instead.
11 years ago
orbiter
6e8377b8ad
do not check all words with synonym library if the library is empty
11 years ago
orbiter
70ba74b23a
disabled ipv4 preference to enable ipv6-only networks like freifunk
11 years ago
orbiter
f3be1930cb
CPU problem when pusing to the error cache; wrong class,
...
ConcurrentHashMap needed for concurrency
11 years ago
Michael Peter Christen
e40671ddb7
better and consistent deletions for error urls
11 years ago
Michael Peter Christen
2602be8d1e
- removed ZURL data structure; removed also the ZURL data file
...
- replaced load failure logging by information which is stored in Solr
- fixed a bug with crawling of feeds: added must-match pattern
application to feed urls to filter out such urls which shall not be in a
wanted domain
- delegatedURLs, which also used ZURLs are now temporary objects in
memory
11 years ago
Michael Peter Christen
31920385f7
set anchor rel attribute of all links to "nofollow" if the html meta
...
contains a robots:nofollow or if the http header contains a
"X-Robots-Tag: nofollow"
11 years ago
reger
9619b8743c
add Solr Servlet
11 years ago
Michael Peter Christen
57e00baf26
fix for parsing of image links inside of anchor links (image-links)
11 years ago
Michael Peter Christen
61c5e40687
- replaced the properties object in AnchorURL with distinct variables
...
for anchor attributes.
- this caused that large portions of the parser code had to be adopted
as well
- added a counter target_order_i for anchor links in webgraph
computation
11 years ago
Michael Peter Christen
3ea9bb4427
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
Michael Peter Christen
5e31bad711
- the webgraph shall store all links which appear on a web page and not
...
all unique links! This made it necessary, that a large portion of the
parser and link processing classes must be adopted to carry a different
type of link collection which carry a property attribute which are
attached to web anchors.
- introduction of a new URL class, AnchorURL
- the other url classes, DigestURI and MultiProtocolURI had been renamed
and refactored to fit into a new document package schema, document.id
- cleanup of net.yacy.cora.document package and refactoring
11 years ago
reger
13fc86c960
Merge remote-tracking branch 'origin/master' into jetty
11 years ago
reger
f7f86d8a5d
update to Jetty 9 jars
...
- include javax.servlet 3.0
11 years ago
reger
603368fc3e
remove redundant declaration of USER_AGENT
11 years ago
reger
bd71b14d25
add mandatory p2p parameter to templatePattern
11 years ago
reger
b8da176c5d
adjust setHandled to request of call parameter
11 years ago
reger
127adbf5cf
remove references to 10_http thread (legacy http server)
...
and add needed get/set function to jetty http server wrapper
11 years ago
Michael Peter Christen
1a8c64117f
decreased the responseHeaderDB database which is now flushed more
...
frequently. This will preserve more documents in the cache in case of a
crash.
11 years ago
reger
36b7159282
- remove double initialization of jetty
...
- refactor some var assignments
11 years ago
reger
63ed04260a
Merge remote-tracking branch 'origin/master' into jetty
11 years ago
Michael Peter Christen
35ab2cef7b
added parsing of 'date', 'dc:date', 'dc.date' and 'last-modified' in
...
html meta fields to get a correct (or: better) date timestamp. The
http:last-modified mostly does not work because it is set to the current
date from most CMS.
11 years ago
reger
2ee68f76f6
added read parameter from multi-part form fields (to nasty quick-fix)
11 years ago
Michael Peter Christen
9cc8468b30
added tools to visualize image generation (i.e. during testing)
11 years ago
reger
105cf8f593
changes to adjust jetty to recent code changes
11 years ago
reger
aafef72a8a
merged current rc1/master into jetty branch to allow further development with latest version
...
ServerSideIncludes and servlet return values need further work (for working jetty integration)
- TODO: added nasty quickfix to allow SSI - needs further work
- TODO: YaCy servlet return values/parameters are not handled
11 years ago
Michael Peter Christen
dbef8ccfcb
forced deletion of ZURL entries for a specific host for each host that
...
appears in the crawl url list
11 years ago
Michael Peter Christen
e137ff4171
refactoring (im preparation for new removeHost method)
11 years ago
Michael Peter Christen
7a5574cd51
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
Michael Peter Christen
85456f46b2
added two new fields, exact_signature_copycount_i and
...
fuzzy_signature_copycount_i, which count the number of copies of
non-unique documents and assigns this to each document. Thus, each
document there is a number assigned which shows how many copies of this
document exists.
These fields are disabled by default.
11 years ago
orbiter
26366596d9
fix for a problem which ocurres when a site is crawled where the start
...
url is redirected.
11 years ago
Michael Peter Christen
a2511b5600
turned images_alt_txt back to images_alt_sxt because it is not necessary
...
to index the alt text. Indexed image Text is in images_text_t
11 years ago
Michael Peter Christen
85b1922244
activated image type navigation for image search
11 years ago
Michael Peter Christen
9e12fdff23
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
Michael Peter Christen
ab1201fdfd
fixed wrong facet count
11 years ago
Michael Peter Christen
049c3b3f2e
added an option to exclude image search results from text search. This
...
is on by default.
11 years ago
Michael Peter Christen
69f85265e1
added an option to put image links to the crawl queue and handle these
...
like normal documents. Using this option (by default on at this moment;
this might change soon) it is possible to get the exif data into the
search index to be used in image search.
11 years ago
Michael Peter Christen
e8e558a9b7
fix for content domain classification in URIMetadataNode
11 years ago
Michael Peter Christen
a8c5bfcf58
avoid to create unnecessary objects
11 years ago
Michael Peter Christen
5a0de1b77d
moving image description text to image text field
11 years ago
Michael Peter Christen
dc179bd61f
fix for catchall query goal for image search
11 years ago