Michael Peter Christen
075b6f9278
refactoring of the crawl balancer: the balancer is turned into an
...
interface and the old balancer class is moved into LegacyBalancer to
make room for a fresh implementation of a crawl balancer.
11 years ago
Michael Peter Christen
8aeef73d49
fix for virtual root nodes
11 years ago
Michael Peter Christen
7c7fbb9818
find depth-matches also for edge targets
11 years ago
Michael Peter Christen
dd12dd392f
introduction of a data structure for HyperlinkEdges which should use
...
less memory as it does no double-storage of source links for each edge
of the graph.
11 years ago
Michael Peter Christen
6ea8bb7348
using MultiProtocolURL for edge data which is faster (hash computation
...
is now much easier) and smaller in size
11 years ago
Michael Peter Christen
a37d067692
refactoring
11 years ago
orbiter
95780eed32
Merge branch 'master' of git@gitorious.org:yacy/rc1.git
11 years ago
Michael Peter Christen
67beef657f
strong redesign of html parser: object recursion is now made using a
...
stack on html tag objects, not using a recursive parse-again method
which may cause bad performance and huge memory allocation. The new
method also produced better parsed image objects with exact anchor text
references.
11 years ago
Michael Peter Christen
6bd8c6f195
fix for wrong status codes of error pages
11 years ago
orbiter
67501c9dda
Merge branch 'master' of git@gitorious.org:yacy/rc1.git
11 years ago
Michael Peter Christen
1c21b3256d
fix for robots.txt handling: delete old entry before starting a new
...
crawl.
11 years ago
orbiter
c250fac9f4
linkstructure refactoring to get more options for clickdepth analysis
11 years ago
Michael Peter Christen
8068e68474
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
Michael Peter Christen
bd886054cb
new structure and enhancements for link graph computation:
...
- added order option to solr queries to be able to retrieve document
lists in specific order, here: link length
- added HyperlinkEdge class which manages the link structure
- integrated the HyperlinkEdge class into clickdepth computation
- extended the linkstructure.json servlet to show also the clickdepth
and other statistic information
11 years ago
reger
f326a67561
fix: typo in default charset in metadata2solr
...
update pom and NB build to Solr 4.7.1 libs
11 years ago
Michael Peter Christen
df138084c0
do solr optimization independently from memory and load constraints:
...
- not doing an optimization will likely cause a too many files exception
- without optimization performance will be even worse which would
prevent optimization in the future as well (prevent a deadlock
situation)
11 years ago
Michael Peter Christen
ebd44a7080
replaced solr 4.6.1 with solr 4.7.1 and added index migration to
...
lucene_47
11 years ago
Michael Peter Christen
466d90ad42
fixed a problem with resource observer; probably coming from uncatched
...
exceptions within the apache library which appear only in concurrency
environments.
11 years ago
Michael Peter Christen
e8ddd415a8
enhanced the new link structure graph
11 years ago
Michael Peter Christen
926d28dd3f
fixed a bug which prevented crawl starts after a network switch
11 years ago
Michael Peter Christen
3ce8eff21b
another fix for inbound/outbound detection
11 years ago
orbiter
3c1274057d
fixed thread dump in case of wrong seeds
11 years ago
orbiter
18f9c40302
moved Edge class out of linkstructure servlet as this does not work on
...
non-eclipse driven environments (all non-dev cases)
11 years ago
Michael Peter Christen
c64c10ef00
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
Michael Peter Christen
48fbfa60c1
bugfix to inbound/outbound identification
11 years ago
reger
227c42bc96
eleminate obsolete URIMetaDataRow class
...
by joining it with/into URIMetaDataNode.
11 years ago
Michael Peter Christen
cca851a417
introduced new solr field crawldepth_i which records the crawl depth of
...
a document. This is the upper limit for the clickdepth_i value which may
be shorter in case that the crawler did not take the shortest path to
the document.
11 years ago
Michael Peter Christen
63c9fcf3e0
free configuration of postprocessing clickdepth maximum depth and time
11 years ago
Michael Peter Christen
8b44fcf0f4
added missing @Override annotation
11 years ago
Michael Peter Christen
e515dd460d
added linkscount_i and linksnofollowcount_i to the default solr schema
11 years ago
Michael Peter Christen
cbdfef7ce1
changed protocol facet to show also all other counts if one facet is
...
selected
11 years ago
Michael Peter Christen
61ad194065
fix for source and target clickdepth in webgraph index
11 years ago
Marc Nause
809b4e1fd9
Team added support for URLs with unicode characters in host part to
...
blacklist. Punycode is used to handle unicode characters.
11 years ago
reger
ca7444dbdf
limit filetype nav to known extension also on image/media search
...
- on text search we limit filetype nav already to known extension, apply filter to image search
11 years ago
Michael Peter Christen
d1091e79f8
- added stealth button to navigation menu
...
- more fixes to progress bar
11 years ago
orbiter
3c8d6e1eee
added adminAccount switch to ConfigAccounts_p servlet to switch on
...
protection of all pages; some refactoring as well
11 years ago
orbiter
7d24bcb98d
added flag to require that all web pages, even such without a "_p"
...
extension require authorization. (default off)
11 years ago
Michael Peter Christen
b08375da33
fix for bad/missing values of size_i
11 years ago
Michael Peter Christen
51800007c4
- added concurrency to postprocessing of webgraph document
...
- bundeled separate webgraph postprocesing steps into one
11 years ago
Michael Peter Christen
e485fbd0ce
- let crawl loader jobs die after 10 seconds without new jobs
...
- corrected shutdown order t prevent a deadlock during shutdown
11 years ago
Michael Peter Christen
bcd9dd9e1d
enhanced concurrent loading by using a fixed set of concurrent loader
...
processes in favor of throwaway-processes. The control mechanism does
less often report a 'queue full' message to the busy loop which then
does not perform a long busy waiting; instead all requests are queued
and new loader processes are started if necessary up to a given limit
(as set before)
11 years ago
Michael Peter Christen
6ed9c0164e
attaching names to all Threads to get a better view in profiling tools
...
like VisualVM
11 years ago
Michael Peter Christen
fdaeac374a
- enhanced postprocessing speed and memory footprint (by using HashMaps
...
instead of TreeMaps)
- enhanced memory footprint of database indexes (by introduction of
optimize calls)
- optimize calls shrink the amount of used memory for index sets if they
are not changed afterwards any more
11 years ago
Michael Peter Christen
d325cb8912
fixes and enhancements for postprocessing
11 years ago
Michael Peter Christen
7c1b968378
another fix for the shutdown exceptions
11 years ago
Michael Peter Christen
1d069c5861
make sure that postprocessed documents are overwritten
11 years ago
Michael Peter Christen
e644981697
added one more postprocessing low memory check
11 years ago
Michael Peter Christen
e1bf65c892
added short memory protection during postprocessing
11 years ago
Michael Peter Christen
7640834b37
removed double concurrency to put Solr documents into the index. The
...
writings to the solr index are also buffered in
ConcurrentUpdateSolrConnector
11 years ago
Michael Peter Christen
0f6b72f24b
do not use luke requests for remote solr servers if the result is
...
different from normal requests. This happens if the remote solr is
actually a solrCloud; in such cases the luke request returns only the
result of the single solr peer, not the whole cloud.
also done: some refactoring.
11 years ago