orbiter
39e1913585
next development step: migration to java 1.7
...
This includes also a small code change to test generic type inference, a
java 1.7 feature
11 years ago
Michael Peter Christen
4e734815e8
enhanced snippets: remove lines which are identical to the title and
...
choose longer versions if possible. Prefer the description part.
11 years ago
sixcooler
390f03e041
o not check for segments-count on optimize:
...
this is also done in Solr and our getSegmentsCount() does not return
up-to-date values
11 years ago
reger
78d08998db
throw MalformedURLException on unknown protocol
...
on other than the supported http https ftp file smb \\ mailto
11 years ago
reger
bb8181b2be
fix: resolve url without path but searchpart
...
e.g. http://yacy.net?q=test was resolved as host "yacy.net?q=test" now host="yacy.net" path="/"
fixes http://mantis.tokeek.de/view.php?id=47
added test case for getHost
11 years ago
reger
81dc2aa536
add current css to HTMLResponseWriter to fix metadata view
...
(using css from metas.template except js links)
11 years ago
orbiter
0c88a32c36
do not apply lazy value instantiation for numeric or boolean values
...
because that is misleading and confusing in case of 0- or false-values
and may cause NPEs in retrieval functions.
11 years ago
reger
79e7947442
- remove empty http0_9 status text array
...
and unused default_charset = ISO-8859-1
11 years ago
Michael Peter Christen
9a5ab4e2c1
removed clickdepth_i field and related postprocessing. This information
...
is now available in the crawldepth_i field which is identical to
clickdepth_i because of a specific crawler strategy.
11 years ago
Michael Peter Christen
da86f150ab
- added a new Crawler Balancer: HostBalancer and HostQueues:
...
This organizes all urls to be loaded in separate queues for each host.
Each host separates the crawl depth into it's own queue. The primary
rule for urls taken from any queue is, that the crawl depth is minimal.
This produces a crawl depth which is identical to the clickdepth.
Furthermorem the crawl is able to create a much better balancing over
all hosts which is fair to all hosts that are in the queue.
This process will create a very large number of files for wide crawls in
the QUEUES folder: for each host a directory, for each crawl depth a
file inside the directory. A crawl with maxdepth = 4 will be able to
create 10.000s of files. To be able to use that many file readers, it
was necessary to implement a new index data structure which opens the
file only if an access is wanted (OnDemandOpenFileIndex). The usage of
such on-demand file reader shall prevent that the number of file
pointers is over the system limit, which is usually about 10.000 open
files. Some parts of YaCy had to be adopted to handle the crawl depth
number correctly. The logging and the IndexCreateQueues servlet had to
be adopted to show the crawl queues differently, because the host name
is attached to the port on the host to differentiate between http,
https, and ftp services.
11 years ago
Michael Peter Christen
b21c208b4d
enhanced hashcode computation for MultiProtocolURL
11 years ago
Michael Peter Christen
bd886054cb
new structure and enhancements for link graph computation:
...
- added order option to solr queries to be able to retrieve document
lists in specific order, here: link length
- added HyperlinkEdge class which manages the link structure
- integrated the HyperlinkEdge class into clickdepth computation
- extended the linkstructure.json servlet to show also the clickdepth
and other statistic information
11 years ago
Michael Peter Christen
74ab094587
fix for solr query size; too many documents had been retrieved in case
...
that less than _pagesize_ had been requested.
11 years ago
orbiter
429a874222
- added COLS field in GSA response (non-gsa standard by customer
...
request)
- updated document link in GSA response writer
11 years ago
Michael Peter Christen
63c9fcf3e0
free configuration of postprocessing clickdepth maximum depth and time
11 years ago
Michael Peter Christen
8b44fcf0f4
added missing @Override annotation
11 years ago
reger
b9056ef2db
remove unused private header entries (HeaderFramework)
...
X_YACY_ORIGINAL_REQUEST_LINE
X_YACY_KEEP_ALIVE_REQUEST_COUNT
CONNECTION_PROP_REQUESTLINE
11 years ago
Michael Peter Christen
61ad194065
fix for source and target clickdepth in webgraph index
11 years ago
Marc Nause
809b4e1fd9
Team added support for URLs with unicode characters in host part to
...
blacklist. Punycode is used to handle unicode characters.
11 years ago
reger
b126b9ba17
add some InputFileStream close at end of reads
...
to make sure file is released
11 years ago
orbiter
01989f6af9
restrict write buffer size to a limit
11 years ago
Michael Peter Christen
7a6658abec
removed synchronization in embedded solr connection (that was probably
...
a mistake?)
11 years ago
Michael Peter Christen
a7d4379ef9
fixed shutdown of solr cores in case that more than one local core is to
...
be closed (this happens if webgraph is enabled and the index is dumped
using /IndexControlURLs_p.html
11 years ago
reger
82dc815af9
cleanup: remove unrelated and unused code
11 years ago
Michael Peter Christen
b08375da33
fix for bad/missing values of size_i
11 years ago
Michael Peter Christen
51800007c4
- added concurrency to postprocessing of webgraph document
...
- bundeled separate webgraph postprocesing steps into one
11 years ago
Michael Peter Christen
0e7d249a69
fixed another shutdown problem (only occurs if webgraph core is enabled)
11 years ago
Michael Peter Christen
e485fbd0ce
- let crawl loader jobs die after 10 seconds without new jobs
...
- corrected shutdown order t prevent a deadlock during shutdown
11 years ago
reger
6878c90f99
fix: IPv6 INTRANET_PATTERNS for local ip (see http://bugs.yacy.net/view.php?id=378 )
...
requiring following ":" for fc and fd prefix and made pattern match case insesitive
- add some more ipv6 test cases to MultiProtocolURLTest.java
11 years ago
Michael Peter Christen
6ed9c0164e
attaching names to all Threads to get a better view in profiling tools
...
like VisualVM
11 years ago
Michael Peter Christen
fdaeac374a
- enhanced postprocessing speed and memory footprint (by using HashMaps
...
instead of TreeMaps)
- enhanced memory footprint of database indexes (by introduction of
optimize calls)
- optimize calls shrink the amount of used memory for index sets if they
are not changed afterwards any more
11 years ago
Michael Peter Christen
7c1b968378
another fix for the shutdown exceptions
11 years ago
orbiter
133d41386c
(again) full redesign of ConcurrentUpdateSolrConnector to remove
...
out-of-order transactions regarding add and delete operations. Now all
operations (add and delete) are executed concurrently in-order.
11 years ago
Michael Peter Christen
a632b0d2a4
added a forced commit to index deletion to enable synchronized index
...
updates
11 years ago
Michael Peter Christen
3cc5c0ffdd
a concurrency enhancement which was not used because tests showed worse
...
indexing speed. I leave the code there since it may be useful in
SolrCloud environments.
11 years ago
Michael Peter Christen
90b47e83e6
fixed shutdown error when closing solr connectors
11 years ago
Michael Peter Christen
7640834b37
removed double concurrency to put Solr documents into the index. The
...
writings to the solr index are also buffered in
ConcurrentUpdateSolrConnector
11 years ago
Michael Peter Christen
0f6b72f24b
do not use luke requests for remote solr servers if the result is
...
different from normal requests. This happens if the remote solr is
actually a solrCloud; in such cases the luke request returns only the
result of the single solr peer, not the whole cloud.
also done: some refactoring.
11 years ago
Michael Peter Christen
c57026e242
recover from OOM
11 years ago
Michael Peter Christen
907db8b7a6
fix for bad query shortcut hack
11 years ago
orbiter
cfb647db6e
- introduced a miss cache in ConcurrentUpdateSolrConnector
...
- better usage of cache
- bugfix for postprocessing
11 years ago
orbiter
a87d8e4a8e
changed caching of ConcurrentUpdateSolrConnector: it caches now also the
...
url along with the load date. While this takes much more memory, it
eliminates database lookups for getURL() requests, which happen equally
often. This speeds up remote solr configurations.
11 years ago
orbiter
d3a88eaecb
introducing ConcurrentUpdateSolrServer for remote solr servers.
...
Scaling of write buffers and update queue size is made according to
assigned memory.
11 years ago
Michael Peter Christen
254a7ac66c
fixed cleaning of index
11 years ago
Michael Peter Christen
28a7b42e6b
removed warning "sun.misc.BASE64Encoder is internal proprietary API and
...
may be removed in a future release"
11 years ago
Michael Peter Christen
046f5a03cb
one more SolrIndexSearcher bugfix
11 years ago
sixcooler
78c01b3eff
fix for 'AlreadyClosedException: this IndexReader is closed'
11 years ago
Michael Peter Christen
1b5e3d523a
better control over close-state of remote solr connections
11 years ago
Michael Peter Christen
1a364572a5
fix for
...
"org.apache.solr.core.SolrCore Too many close [count:-1] on
org.apache.solr.core.SolrCore@51af7c57"
-error
11 years ago
Michael Peter Christen
69391e5d9e
changed strategy to test existence of documents in Solr: using the
...
update time. The reason for that is a better caching for the crawler
double-check, which needs the update time for crawler steering.
11 years ago