Michael Peter Christen
9c6228d948
fix for deadlocks in crawler
11 years ago
Michael Peter Christen
10cf8215bd
added crawl depth for failed documents
11 years ago
Michael Peter Christen
7fefebaeca
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
Michael Peter Christen
c2f62e783f
- better subgraph handling, less overhead for crawls without the
...
webgraph
- usage of crawler crawldepth cache for the linkgraph target depth
computation
11 years ago
Michael Peter Christen
06afb568e2
new Strategies in Balancer:
...
- doublecheck cache now records the crawl depth as well
- doublecheck cache is available from the outside (made static)
- no more need to crawl hosts with lowest depth first, instead all hosts
which have only singleton entries are preferred to reduce the number of
files.
11 years ago
Michael Peter Christen
1aea01fe5b
fix for Table in case that requested file does not exist and paths also
...
do not exist
11 years ago
reger
710054bb37
implement gzip input handling directly in defaultservlet
...
(making reference to legacy httpdemon obsolete)
11 years ago
Michael Peter Christen
9a5ab4e2c1
removed clickdepth_i field and related postprocessing. This information
...
is now available in the crawldepth_i field which is identical to
clickdepth_i because of a specific crawler strategy.
11 years ago
Michael Peter Christen
da86f150ab
- added a new Crawler Balancer: HostBalancer and HostQueues:
...
This organizes all urls to be loaded in separate queues for each host.
Each host separates the crawl depth into it's own queue. The primary
rule for urls taken from any queue is, that the crawl depth is minimal.
This produces a crawl depth which is identical to the clickdepth.
Furthermorem the crawl is able to create a much better balancing over
all hosts which is fair to all hosts that are in the queue.
This process will create a very large number of files for wide crawls in
the QUEUES folder: for each host a directory, for each crawl depth a
file inside the directory. A crawl with maxdepth = 4 will be able to
create 10.000s of files. To be able to use that many file readers, it
was necessary to implement a new index data structure which opens the
file only if an access is wanted (OnDemandOpenFileIndex). The usage of
such on-demand file reader shall prevent that the number of file
pointers is over the system limit, which is usually about 10.000 open
files. Some parts of YaCy had to be adopted to handle the crawl depth
number correctly. The logging and the IndexCreateQueues servlet had to
be adopted to show the crawl queues differently, because the host name
is attached to the port on the host to differentiate between http,
https, and ftp services.
11 years ago
Michael Peter Christen
075b6f9278
refactoring of the crawl balancer: the balancer is turned into an
...
interface and the old balancer class is moved into LegacyBalancer to
make room for a fresh implementation of a crawl balancer.
11 years ago
Michael Peter Christen
8470dfe3f8
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
reger
46016fa153
autoupdate fails to download latest release (1.71) due to default release blacklist
...
- removed the default version blacklist regex from init (for future versions)
!!! left existing update blacklist setting untouched !!!
(existing installation wanting autoupdate for 1.71 need to change blacklist in ConfigUpdate_p.html)
- moved old blacklist patch to migration.java
11 years ago
Michael Peter Christen
8aeef73d49
fix for virtual root nodes
11 years ago
Michael Peter Christen
7c7fbb9818
find depth-matches also for edge targets
11 years ago
Michael Peter Christen
dd12dd392f
introduction of a data structure for HyperlinkEdges which should use
...
less memory as it does no double-storage of source links for each edge
of the graph.
11 years ago
Michael Peter Christen
6ea8bb7348
using MultiProtocolURL for edge data which is faster (hash computation
...
is now much easier) and smaller in size
11 years ago
Michael Peter Christen
b21c208b4d
enhanced hashcode computation for MultiProtocolURL
11 years ago
Michael Peter Christen
ce1d1b2fa0
fix for maximum tag length in parser
11 years ago
Michael Peter Christen
17e0956312
refactoring of SystemLoad calls (only one backend tool)
11 years ago
Michael Peter Christen
a37d067692
refactoring
11 years ago
orbiter
95780eed32
Merge branch 'master' of git@gitorious.org:yacy/rc1.git
11 years ago
Michael Peter Christen
67beef657f
strong redesign of html parser: object recursion is now made using a
...
stack on html tag objects, not using a recursive parse-again method
which may cause bad performance and huge memory allocation. The new
method also produced better parsed image objects with exact anchor text
references.
11 years ago
Michael Peter Christen
6bd8c6f195
fix for wrong status codes of error pages
11 years ago
Michael Peter Christen
9e503b3376
also delete the robots.txt file from the cache when a new crawl is
...
started
11 years ago
orbiter
67501c9dda
Merge branch 'master' of git@gitorious.org:yacy/rc1.git
11 years ago
Michael Peter Christen
1c21b3256d
fix for robots.txt handling: delete old entry before starting a new
...
crawl.
11 years ago
orbiter
c250fac9f4
linkstructure refactoring to get more options for clickdepth analysis
11 years ago
Michael Peter Christen
8068e68474
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
Michael Peter Christen
bd886054cb
new structure and enhancements for link graph computation:
...
- added order option to solr queries to be able to retrieve document
lists in specific order, here: link length
- added HyperlinkEdge class which manages the link structure
- integrated the HyperlinkEdge class into clickdepth computation
- extended the linkstructure.json servlet to show also the clickdepth
and other statistic information
11 years ago
reger
f326a67561
fix: typo in default charset in metadata2solr
...
update pom and NB build to Solr 4.7.1 libs
11 years ago
Michael Peter Christen
df138084c0
do solr optimization independently from memory and load constraints:
...
- not doing an optimization will likely cause a too many files exception
- without optimization performance will be even worse which would
prevent optimization in the future as well (prevent a deadlock
situation)
11 years ago
Michael Peter Christen
ebd44a7080
replaced solr 4.6.1 with solr 4.7.1 and added index migration to
...
lucene_47
11 years ago
Michael Peter Christen
734778c0c8
fixed a time-out problem in the default servlet which is also a logging
...
problem because the error log showed the wrong reason (file not found)
instead the actual reason (time-out).
11 years ago
Michael Peter Christen
466d90ad42
fixed a problem with resource observer; probably coming from uncatched
...
exceptions within the apache library which appear only in concurrency
environments.
11 years ago
Michael Peter Christen
e8ddd415a8
enhanced the new link structure graph
11 years ago
Michael Peter Christen
926d28dd3f
fixed a bug which prevented crawl starts after a network switch
11 years ago
Michael Peter Christen
3ce8eff21b
another fix for inbound/outbound detection
11 years ago
Michael Peter Christen
d4b5c457e4
NPE fix
11 years ago
Michael Peter Christen
36a66b0704
fix for parsing of numeric value in case that boolean values are given
11 years ago
orbiter
41730c8048
better logging in template engine: shows filename of servlets where
...
errors in templates occur
11 years ago
orbiter
3c1274057d
fixed thread dump in case of wrong seeds
11 years ago
orbiter
18f9c40302
moved Edge class out of linkstructure servlet as this does not work on
...
non-eclipse driven environments (all non-dev cases)
11 years ago
orbiter
de95e5e524
reduced search activity corona strength in network image
11 years ago
reger
da413af664
move baseurl after parsing orig source in urlproxyservlet
...
to calculate absolute href links for rewrite from unmodified source.
11 years ago
reger
af6ad20728
fix: remove obsolete ref to yacy.home
...
(use Switchboard instead)
11 years ago
Michael Peter Christen
74ab094587
fix for solr query size; too many documents had been retrieved in case
...
that less than _pagesize_ had been requested.
11 years ago
Michael Peter Christen
c64c10ef00
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
Michael Peter Christen
48fbfa60c1
bugfix to inbound/outbound identification
11 years ago
reger
227c42bc96
eleminate obsolete URIMetaDataRow class
...
by joining it with/into URIMetaDataNode.
11 years ago
Michael Peter Christen
cca851a417
introduced new solr field crawldepth_i which records the crawl depth of
...
a document. This is the upper limit for the clickdepth_i value which may
be shorter in case that the crawler did not take the shortest path to
the document.
11 years ago
orbiter
b1ba764d81
fix for first start options and added german translation for popup texts
11 years ago
orbiter
429a874222
- added COLS field in GSA response (non-gsa standard by customer
...
request)
- updated document link in GSA response writer
11 years ago
Michael Peter Christen
1b9ec9a1c5
- added popover to p2p/stealth mode button to explain the peer mode and
...
privacy issues.
- added popover to first-time use case to explain that specific servlets
are only visible after customization and/or crawl starts
11 years ago
Michael Peter Christen
62a36fa584
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
reger
c9f92abddc
fix: application link count
...
(URIMetadataNode)
11 years ago
Michael Peter Christen
a267c46e1a
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
Michael Peter Christen
5b83887da8
npe fix
11 years ago
Michael Peter Christen
63c9fcf3e0
free configuration of postprocessing clickdepth maximum depth and time
11 years ago
Michael Peter Christen
39b641d6cd
added tutorial mode - some menu items will only appear if you 'qualify'
...
for them. Thus, the first-time user will only see four menu items. The
other items will unfold as the user interacts.
11 years ago
sixcooler
f06775850f
fix receiving DHT / parse pultipart
...
+ another close to fix possible resource leak warning
11 years ago
reger
49e76a1c55
make use of detected charset in htmlParser if none is given.
11 years ago
reger
e11504309f
adding a hint to javascript browser short cut on Url-Proxy page (AugmentedBrowsing_p.html)
11 years ago
reger
b12200cafe
alternative UrlProxyServlet (for /proxy.html) using different url rewrite rules
...
- use JSoup parser for selective rewrite of html body <a href= links only,
instead of regex which rewrites also header href/src links
- this improves display of pages which use header <base> tag
- tags with src attribute are taken from original location (like css) improving display and are not routed trough the indexer
Disadvantage: scripting links will drop out of proxy
Setting of the servlet through web.xml exclusivly (in case one would like to quickly switch back to the YaCyProxyServlet,
leaving the existing code of YaCyProxyServlet untouched available)
11 years ago
reger
2953ebe701
fix: port in local target adress
...
& button style
11 years ago
Michael Peter Christen
fda591695c
fixed visibility of custom icon
11 years ago
Michael Peter Christen
a9b9950d7f
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
Michael Peter Christen
b488f33975
added close to fix possible resource leak warning
11 years ago
Michael Peter Christen
56710ecb26
prevent opening of new files as that could be a cause for the latest
...
too-many-open-files exception. The old file is just truncated if the
table is cleaned.
11 years ago
Michael Peter Christen
8b44fcf0f4
added missing @Override annotation
11 years ago
reger
d7055904a6
fix: proxyservlet path header setting
11 years ago
Michael Peter Christen
e515dd460d
added linkscount_i and linksnofollowcount_i to the default solr schema
11 years ago
Michael Peter Christen
1a764135be
one more Thread Dump fix for new bootstrap css style
11 years ago
Michael Peter Christen
bb21d825f9
fix for thread dump line spacing
11 years ago
Michael Peter Christen
cbdfef7ce1
changed protocol facet to show also all other counts if one facet is
...
selected
11 years ago
reger
b9056ef2db
remove unused private header entries (HeaderFramework)
...
X_YACY_ORIGINAL_REQUEST_LINE
X_YACY_KEEP_ALIVE_REQUEST_COUNT
CONNECTION_PROP_REQUESTLINE
11 years ago
sixcooler
6d16fa993d
make transparent proxy handle https-connections:
...
the implemented handle for connect did not work for me - so lets try the
connectHandler
11 years ago
Michael Peter Christen
61ad194065
fix for source and target clickdepth in webgraph index
11 years ago
Marc Nause
809b4e1fd9
Team added support for URLs with unicode characters in host part to
...
blacklist. Punycode is used to handle unicode characters.
11 years ago
reger
b126b9ba17
add some InputFileStream close at end of reads
...
to make sure file is released
11 years ago
reger
ca7444dbdf
limit filetype nav to known extension also on image/media search
...
- on text search we limit filetype nav already to known extension, apply filter to image search
11 years ago
reger
651d057e93
surrogate import translate dc:language 3-char codes
...
OAI records often use 3-char language codes, start converting some 3-char lang's to the internal ISO639-1 2-char code
11 years ago
orbiter
22618e3ba2
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
orbiter
01989f6af9
restrict write buffer size to a limit
11 years ago
Michael Peter Christen
d1091e79f8
- added stealth button to navigation menu
...
- more fixes to progress bar
11 years ago
reger
c297de5145
remove check for unused virtual path /currentyacypeer/
...
- del jqueryheader.template (not used)
11 years ago
orbiter
3c8d6e1eee
added adminAccount switch to ConfigAccounts_p servlet to switch on
...
protection of all pages; some refactoring as well
11 years ago
orbiter
7d24bcb98d
added flag to require that all web pages, even such without a "_p"
...
extension require authorization. (default off)
11 years ago
Michael Peter Christen
7a6658abec
removed synchronization in embedded solr connection (that was probably
...
a mistake?)
11 years ago
Michael Peter Christen
a7d4379ef9
fixed shutdown of solr cores in case that more than one local core is to
...
be closed (this happens if webgraph is enabled and the index is dumped
using /IndexControlURLs_p.html
11 years ago
Michael Peter Christen
453bfd0f17
removed unused variables and warnings
11 years ago
Michael Peter Christen
05655d98df
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
reger
9f02d2c47b
fix: remove link to triplestore in Vocabulary_p (triplestore does not longer exist)
...
- should be investigated in more detail to look for additional implications
Remove "yacyaction" from proxyservlet as it was only needed for removed interaction routines.
11 years ago
reger
81a846ec33
fix: set YaCy CONNECTION_PROP_HOST Header in ProxyServlet to host incl. port
11 years ago
reger
251be9ecfa
remove unused ProxySettings ref. from loader
...
clean unused whois test code
11 years ago
reger
82dc815af9
cleanup: remove unrelated and unused code
11 years ago
Michael Peter Christen
85a427ec54
support for multiple sitemaps in robots.txt
11 years ago
reger
a373fb717d
remove more unused from legacy server.http
...
- triggerOnlineAction not used
- useTemplateCache not used
11 years ago
reger
749d020aeb
remove redundant url string manipulation in HTTPDProxyHandler
...
(still used by ProxyServlet)
11 years ago
reger
612294cf84
use servletPath in ProxyServlet instead of fixed name
...
to allow servlet-mapping via web.xml
11 years ago
reger
1d01672bd3
fix DCEntry.getIdentifier
...
on successful url parameter
11 years ago
Michael Peter Christen
b08375da33
fix for bad/missing values of size_i
11 years ago
reger
6306d28a6a
OAI import get multivalued keywords (dc:subject)
11 years ago
reger
0a8c8102de
allow YaCy to start w/o ssl if JKS init fails
11 years ago
sixcooler
0b2101c59c
Speed up the ProxyHandler:
...
simplified cache-storing and make it concurrent in order to free the
clientconnection asap
let other prozesses wait on proxy-access like it was bevore
11 years ago
reger
516f8c2489
fix: to allow unix scripts (bin/*.sh) to allways submit http admin apicalls
...
using auth via config hash (legacy requirement)
11 years ago
Michael Peter Christen
ea3aa30593
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
reger
dd5bf0b71b
cleanup old reference to HTTPDemon.setAlternativeResolver
...
optimize .yacyh check in AbstractRemoteHandler
11 years ago
Michael Peter Christen
51800007c4
- added concurrency to postprocessing of webgraph document
...
- bundeled separate webgraph postprocesing steps into one
11 years ago
Michael Peter Christen
5f4a6892c1
enhanced RowSet re-sort limit for small sets
11 years ago
reger
351c2be68d
fix: make sure adminAccount changes made via ConfigAccounts_p are effective immediately
...
force to remove current credentials from knownuser cache
11 years ago
reger
5c9dcc269d
improve OAI-PMH import identifier recognition
...
- find best fittng identifier (url) by checking all given dc:identifier in record (many entries proviede several identifiers)
as identifier is currently a multivalued field use "getParams" in preference of splitting the 1st string by ";"
- add resolve DOI:... identifier via http://dx.doi.org/
11 years ago
Michael Peter Christen
0e7d249a69
fixed another shutdown problem (only occurs if webgraph core is enabled)
11 years ago
Michael Peter Christen
e485fbd0ce
- let crawl loader jobs die after 10 seconds without new jobs
...
- corrected shutdown order t prevent a deadlock during shutdown
11 years ago
Michael Peter Christen
bcd9dd9e1d
enhanced concurrent loading by using a fixed set of concurrent loader
...
processes in favor of throwaway-processes. The control mechanism does
less often report a 'queue full' message to the busy loop which then
does not perform a long busy waiting; instead all requests are queued
and new loader processes are started if necessary up to a given limit
(as set before)
11 years ago
orbiter
051328271c
bugfix-bugfix
11 years ago
orbiter
eedcbcd906
bugfix to proxy handler: recognize the own yacyh-host
11 years ago
orbiter
d68e5ad0c4
NPE fix for Thread name (just commited yesterday, sorry)
11 years ago
reger
6878c90f99
fix: IPv6 INTRANET_PATTERNS for local ip (see http://bugs.yacy.net/view.php?id=378 )
...
requiring following ":" for fc and fd prefix and made pattern match case insesitive
- add some more ipv6 test cases to MultiProtocolURLTest.java
11 years ago
reger
a2e5ea2026
status panel link to set max mem
...
+url proxy same error text as in transparent
11 years ago
Michael Peter Christen
6ed9c0164e
attaching names to all Threads to get a better view in profiling tools
...
like VisualVM
11 years ago
Michael Peter Christen
fdaeac374a
- enhanced postprocessing speed and memory footprint (by using HashMaps
...
instead of TreeMaps)
- enhanced memory footprint of database indexes (by introduction of
optimize calls)
- optimize calls shrink the amount of used memory for index sets if they
are not changed afterwards any more
11 years ago
reger
ba49ff81ed
little more verbose proxy 403 error message
11 years ago
Michael Peter Christen
d325cb8912
fixes and enhancements for postprocessing
11 years ago
Michael Peter Christen
7c1b968378
another fix for the shutdown exceptions
11 years ago
orbiter
133d41386c
(again) full redesign of ConcurrentUpdateSolrConnector to remove
...
out-of-order transactions regarding add and delete operations. Now all
operations (add and delete) are executed concurrently in-order.
11 years ago
Michael Peter Christen
a632b0d2a4
added a forced commit to index deletion to enable synchronized index
...
updates
11 years ago
Michael Peter Christen
1d069c5861
make sure that postprocessed documents are overwritten
11 years ago
Michael Peter Christen
0d2342575e
Merge branch 'master' of ssh://gitorious.org/yacy/rc1
11 years ago
Michael Peter Christen
3cc5c0ffdd
a concurrency enhancement which was not used because tests showed worse
...
indexing speed. I leave the code there since it may be useful in
SolrCloud environments.
11 years ago
Michael Peter Christen
e644981697
added one more postprocessing low memory check
11 years ago
reger
5e645f4449
Merge origin/master
11 years ago
reger
3b89176b9f
use config value htroot in Jetty init (was hardcoded)
...
- move htroot exist check from old httpdfilehandler to startup, remove from filehandler and legacy proxyhandler
- use SwitchboardConstant.htroot where appropriate
11 years ago
Michael Peter Christen
e1bf65c892
added short memory protection during postprocessing
11 years ago
Michael Peter Christen
90b47e83e6
fixed shutdown error when closing solr connectors
11 years ago
Michael Peter Christen
7640834b37
removed double concurrency to put Solr documents into the index. The
...
writings to the solr index are also buffered in
ConcurrentUpdateSolrConnector
11 years ago
Michael Peter Christen
0f6b72f24b
do not use luke requests for remote solr servers if the result is
...
different from normal requests. This happens if the remote solr is
actually a solrCloud; in such cases the luke request returns only the
result of the single solr peer, not the whole cloud.
also done: some refactoring.
11 years ago
Michael Peter Christen
c57026e242
recover from OOM
11 years ago
Michael Peter Christen
907db8b7a6
fix for bad query shortcut hack
11 years ago
Michael Peter Christen
a2b66fe2eb
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
Michael Peter Christen
9f6be762a6
- better logging for postprocessing
...
- fixed collection bug in postprocessing
11 years ago
orbiter
da5d4128bf
prevent npe
11 years ago
orbiter
a878c7982c
prevent npe
11 years ago
orbiter
e4eb87d924
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
orbiter
ced1a96f9c
fixed error cache
11 years ago
reger
3ba81bd08a
Merge origin/master
11 years ago
reger
4d896383db
fix: use timeout = proxy.ClientTimeout in ProxyHandler
...
(was 10sec fix) see http://bugs.yacy.net/view.php?id=236
11 years ago
orbiter
cfb647db6e
- introduced a miss cache in ConcurrentUpdateSolrConnector
...
- better usage of cache
- bugfix for postprocessing
11 years ago
orbiter
a87d8e4a8e
changed caching of ConcurrentUpdateSolrConnector: it caches now also the
...
url along with the load date. While this takes much more memory, it
eliminates database lookups for getURL() requests, which happen equally
often. This speeds up remote solr configurations.
11 years ago
orbiter
f6e441dd77
refactoring
11 years ago
orbiter
76c53faeb2
removed unused code (HostStat)
11 years ago