1. Improved and fixed language detection:
1.1 Identificator.java - recognition fix (improved)
1.2 DCEntry.java - fix (changed detection order due to detection from
tld in many cases is incorrect)
1.3 MultiProtocolURI.java - fixed and enhanced language from tld
detection (all currently used top-level domains; ccTLD added but not
tested).
2. Ukrainian language update.
3. Main Slavic languages langstats (tested and works fine).
ready-prepared crawl list but at the stacks of the domains that are
stored for balanced crawling. This affects also the balancer since that
does not need to prepare the pre-selected crawl list for monitoring. As
a effect:
- it is no more possible to see the correct order of next to-be-crawled
links, since that depends on the actual state of the balancer stack the
next time another url is requested for loading
- the balancer works better since the next url can be selected according
to the current situation and not according to a pre-selected order.
but this did not solve all problems because a bug in the apache http
client prevented that it worked. Thread dump:
Caused by: java.lang.NumberFormatException: For input string:
"1450:400c:c01:0:0:0:69"
at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
at java.lang.Integer.parseInt(Integer.java:458)
at java.lang.Integer.parseInt(Integer.java:499)
at org.apache.http.client.utils.URIUtils.extractHost(URIUtils.java:310)
at
org.apache.http.impl.client.AbstractHttpClient.determineTarget(AbstractHttpClient.java:764)
at
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754)
at net.yacy.cora.protocol.http.HTTPClient.execute(HTTPClient.java:597)
at
net.yacy.cora.protocol.http.HTTPClient.getContentBytes(HTTPClient.java:558)
at net.yacy.cora.protocol.http.HTTPClient.GETbytes(HTTPClient.java:341)
at de.anomic.crawler.retrieval.HTTPLoader.load(HTTPLoader.java:131)
at de.anomic.crawler.retrieval.HTTPLoader.load(HTTPLoader.java:74)
at
net.yacy.repository.LoaderDispatcher.loadInternal(LoaderDispatcher.java:274)
at net.yacy.repository.LoaderDispatcher.load(LoaderDispatcher.java:164)
at net.yacy.repository.LoaderDispatcher.load(LoaderDispatcher.java:150)
at
net.yacy.repository.LoaderDispatcher.loadDocument(LoaderDispatcher.java:355)
at getpageinfo_p.respond(getpageinfo_p.java:97)
during receiving of DHT submissions and when answering remote search
requests. Both events together may have caused IO-deadlocking and this
commit shall fix that.