orbiter
f4f6551c66
better handling of time-out at solrj in case that a commit is done in a
...
fail-over case during add
12 years ago
Michael Peter Christen
07261fe274
Merge remote-tracking branch 'nutomics/blacklist_structure'
12 years ago
Michael Peter Christen
dea71851d2
- better concurrency for network scanner
...
- network scanner can now start from the list of all hosts in the search
index
12 years ago
Michael Peter Christen
a34e137e27
fix for citation index generation in case that entry.referrerhash() is
...
null. This is especially the case if ftp sites are crawled
12 years ago
Michael Peter Christen
a2c8116a8f
accept (but ignore) a '+' sign in front of search words
12 years ago
orbiter
9f0cc9b401
enhanced network scanner
...
- textarea input field can now be used to paste in a large list of hosts
- /31er subnet is possible (only one host)
- auto-detect subdomains for ftp and www subdomains
12 years ago
sixcooler
308d73f855
do not use remote proxy if not switched on - regardless of the proto
12 years ago
sixcooler
69906b1d2e
Revert "do not use remote proxy if not switched on - regardless of the proto"
...
This reverts commit 20f452d228
.
12 years ago
sixcooler
20f452d228
do not use remote proxy if not switched on - regardless of the proto
12 years ago
sixcooler
9551720d5c
re-enable saved setting for proxy-crawl-profile
12 years ago
sixcooler
d5d8936f9d
For indexes that are changing rapidly in NRT situations, fcs (stands for
...
Field Cache per Segment) may be a better choice than the default fc.
(saves memory)
see: http://wiki.apache.org/solr/SimpleFacetParameters#facet.method
12 years ago
Felix Ableitner
44f8fcf62e
Changed class structure of Blacklist.
12 years ago
Michael Peter Christen
57ffdfad4c
added a crawl option to obey html-meta-robots-noindex. This is on by
...
default.
12 years ago
Michael Peter Christen
5a5d411ec0
new robots_i attribute fields
12 years ago
Michael Peter Christen
fa08bd9d5a
hack to prevent long waiting times in crawler
12 years ago
Michael Peter Christen
f1c5338210
prepartion for greedy crawl profiles and refactoring
12 years ago
Michael Peter Christen
e6f361f474
adding the canonical tag to crawl queues
12 years ago
reger
a6bf44212e
bugfix: location (lat/lon) meta data retrival (Double.NaN check)
12 years ago
Michael Peter Christen
203921006a
redesign of citation index storage
12 years ago
reger
83763ee4a4
jpeg parser: extract GPS location from meta data
12 years ago
Michael Peter Christen
32aa1d4569
removed unused option for queries
12 years ago
Michael Peter Christen
9d291764d1
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
12 years ago
sixcooler
e5abccdfe4
added optimize-option
12 years ago
Michael Peter Christen
64140f35cd
fix for solr requests if no query part is given (prevent npe)
12 years ago
Michael Peter Christen
8caaf6203a
fixed false multiple-generation of remote facet search which
...
caused high cpu usage on remote side.
12 years ago
Michael Peter Christen
823ae4d6a7
added url_protocol_s to error documents
12 years ago
Michael Peter Christen
660a196989
refactoring
12 years ago
Michael Peter Christen
c4538d8d91
added metadata-extractor-2.6.2.jar to eclipse classpath, removed old lib
12 years ago
reger
3760e2616b
bump up lib/metadata-extractor-2.6.2.jar (used for image parser) with needed code adjustments
12 years ago
Michael Peter Christen
9a6fcdf597
npe fix
12 years ago
Michael Peter Christen
16d1d744fa
added url_file_name_s in default collection schema for the file name
...
without the file extension. This part of the file path is removed from
the multi-field url_paths_sxt, which has now not the file name as last
part of the path list.
The same applies to the new fields source_file_name_s and
target_file_name_s in the webgraph schema.
12 years ago
reger
8d1c4c423d
make imageparser fileextension detection case insensitive (extensions are often upper case)
12 years ago
Michael Peter Christen
f9d859f5dc
now writing image alt texts and (camelcase-)parsed urls into a text
...
search field for a better image retrieval
12 years ago
Michael Peter Christen
e441a9d4c8
to avoid confusion, the gsa api is available at /search? and
...
/searchresult?
12 years ago
orbiter
8792e6c6e9
stub for better image indexing
12 years ago
orbiter
97f2ac9091
added hint to gsa response writer that the result comes from a yacy peer
12 years ago
Michael Peter Christen
14186e815e
npe fix
12 years ago
Michael Peter Christen
bdf306e0a7
increased time-out for loading of seed-lists
12 years ago
Michael Peter Christen
374d2e2a52
removed warning message during crawling
12 years ago
Michael Peter Christen
570511f3c8
removed fields references_internal_id_sxt and
...
references_internal_url_sxt because they had been shown to be
superfluous. The citation of referrer in the host browser is possible
without them. Therefore now the host browser does not only show
internal, but also external referrer to each link.
12 years ago
Michael Peter Christen
fd1776a3b0
added a new 'Citations' function: each search result item can now be
...
explored for citations within other documents. A click on the
'Citations' link shows an analysis with all text lines in the document
each with a complete list of documents which contain the same line. A
second section shows the linking documents in ascending order of number
of citations from the original document. Because documents from
different hosts are most interesting here, they are listed at the top of
the page as possible 'copypasta' source.
12 years ago
Michael Peter Christen
fc3ff92c69
npe fix
12 years ago
Michael Peter Christen
1762911f57
added synchronizations and timeouts in solr api; missing
...
synchronizations in index modification methods causes deadlocks inside
solr.
12 years ago
Michael Peter Christen
3e1e358fdc
calling pdf cache flush on class initialization because calling of the
...
methods during runtime can conflict with dynamic solr class loader and
cause a deadlock (seriously!)
12 years ago
Michael Peter Christen
291912ee52
removed misleading http accessGranted message (this is only for
...
debugging)
12 years ago
Michael Peter Christen
2fd7bbb450
reduced load on solr; no seed update in Status and no exists-check in
...
HTTPLoader in case of redirects, that can be done using the htcache.
12 years ago
Michael Peter Christen
2648b42b27
added fixed clear method as public method
12 years ago
Michael Peter Christen
ffc570f95f
removed forced soft commit since this may be the cause for a performance
...
problem
12 years ago
Michael Peter Christen
6115bef335
added a 'greedy learning' mechanismn which will cause that a 'fresh'
...
yacy will load linked web pages from search results until the total
number of web pages reaches 15000. This shall give fresh peers a 'boost'
to get faster a personalized search index.
12 years ago
Michael Peter Christen
f24574b3da
use s greeting line which does not sound so beta
12 years ago