webgraph index which is temporary filled with the crawl profile key.
This is used to select a set of documents for post-processing as soon as
a crawl is finished. Now the postprocessing for a specific crawl is
started when that specific crawl is finished and not at the end of all
post-processing steps.
there for deletion), this fixes a problem for the deletion of old
documents for new crawl starts
- added clickdepth and citation computation for fail documents
- replaced load failure logging by information which is stored in Solr
- fixed a bug with crawling of feeds: added must-match pattern
application to feed urls to filter out such urls which shall not be in a
wanted domain
- delegatedURLs, which also used ZURLs are now temporary objects in
memory
for anchor attributes.
- this caused that large portions of the parser code had to be adopted
as well
- added a counter target_order_i for anchor links in webgraph
computation
all unique links! This made it necessary, that a large portion of the
parser and link processing classes must be adopted to carry a different
type of link collection which carry a property attribute which are
attached to web anchors.
- introduction of a new URL class, AnchorURL
- the other url classes, DigestURI and MultiProtocolURI had been renamed
and refactored to fit into a new document package schema, document.id
- cleanup of net.yacy.cora.document package and refactoring
html meta fields to get a correct (or: better) date timestamp. The
http:last-modified mostly does not work because it is set to the current
date from most CMS.
fuzzy_signature_copycount_i, which count the number of copies of
non-unique documents and assigns this to each document. Thus, each
document there is a number assigned which shows how many copies of this
document exists.
These fields are disabled by default.
regular expression on th url: the collection attribut for a crawl start
may be now either a token or a list of tokens, seperated by ',' where a
token is either a string or a pair <string,pattern> where the string is
separated to the pattern with a ':' and the string is assigned to the
document as collection only if the pattern matches with the url.
E 2013/07/26 20:29:29 BUSYTHREAD Runtime Error in
serverInstantThread.job, thread
'net.yacy.search.Switchboard.cleanupJob': null; target exception: null
java.lang.NullPointerException
at
net.yacy.search.schema.CollectionConfiguration.convergenceStep(CollectionConfiguration.java:1116)
at
net.yacy.search.schema.CollectionConfiguration.postprocessing(CollectionConfiguration.java:897)
at net.yacy.search.Switchboard.cleanupJob(Switchboard.java:2296)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
net.yacy.kelondro.workflow.InstantBusyThread.job(InstantBusyThread.java:107)
at
net.yacy.kelondro.workflow.AbstractBusyThread.run(AbstractBusyThread.java:165)
Conflicts:
source/net/yacy/search/schema/CollectionConfiguration.java
of the lazy instantiation rule this value was not actually written, but
if lazy instantiation is switched on, then this causes that all crawl
starts delete all crawl-start-hosts completely because this looks for
filled error reasons.
jdk-based logger tend to block
at java.util.logging.Logger.log(Logger.java:476) in concurrent
environments. This makes logging a main performance issue. To overcome
this problem, this is a add-on to jdk logging to put log entries on a
concurrent message queue and log the messages one by one using a
separate process.
- FTPClient uses the concurrent logging instead of the log4j logger
without the file extension. This part of the file path is removed from
the multi-field url_paths_sxt, which has now not the file name as last
part of the path list.
The same applies to the new fields source_file_name_s and
target_file_name_s in the webgraph schema.
references_internal_url_sxt because they had been shown to be
superfluous. The citation of referrer in the host browser is possible
without them. Therefore now the host browser does not only show
internal, but also external referrer to each link.
While the values for the reference evaluation are computed, also a
backlink-structure can be discovered and written to the index as well.
The host browser has been extended to show such backlinks to each
presented links. The host browser therefore can now show an information
where an document is linked. The new citation reference is computed as
likelyhood for a random click path with recursive usage of previously
computed likelyhood. This process is repeated until the likelyhood
converges to a specific number. This number is then normalized to a
ranking value CRn, 0<=CRn<=1. The value CRn can therefore be used to
rank popularity within intra-domain link structures.