This organizes all urls to be loaded in separate queues for each host.
Each host separates the crawl depth into it's own queue. The primary
rule for urls taken from any queue is, that the crawl depth is minimal.
This produces a crawl depth which is identical to the clickdepth.
Furthermorem the crawl is able to create a much better balancing over
all hosts which is fair to all hosts that are in the queue.
This process will create a very large number of files for wide crawls in
the QUEUES folder: for each host a directory, for each crawl depth a
file inside the directory. A crawl with maxdepth = 4 will be able to
create 10.000s of files. To be able to use that many file readers, it
was necessary to implement a new index data structure which opens the
file only if an access is wanted (OnDemandOpenFileIndex). The usage of
such on-demand file reader shall prevent that the number of file
pointers is over the system limit, which is usually about 10.000 open
files. Some parts of YaCy had to be adopted to handle the crawl depth
number correctly. The logging and the IndexCreateQueues servlet had to
be adopted to show the crawl queues differently, because the host name
is attached to the port on the host to differentiate between http,
https, and ftp services.
different from normal requests. This happens if the remote solr is
actually a solrCloud; in such cases the luke request returns only the
result of the single solr peer, not the whole cloud.
also done: some refactoring.
- selecting more than one nav combines the 2 selections (with AND)
- unselecting one nav clears all selected
(e.g. select filetype:pdf and /language/fr shows ~ french pdf's only)
works fine to restrict language for local solrSearches.
More work needs to be done to make rwi/remote searches respect the modifier.language restriction.
- refactored all code which uses URIMetadataRow as standard for word
hash length and word hash ordering and moved that to the class 'Word',
becuase the class URIMetadataRow defined the old metadata data structure
and should be superfluous in the future
- removed unused methods from URIMetadataRow as preparation for further
removal of that class
- since specific heuristic Twitter & Blekko is not longer available or redundant with OpenSearchHeuristic,
adjusted ConfigHeuristic to use OpensearchHeuristic settings only.
For this the default OSD search target list is made available (copied) by default and the other configs are removed.
- the return of QueryGoal.getOriginalQueryString includes the queryModifier, which are held separately in a modifier object,
but in most (all) cases just the query term is expected, clarified and renamed it to QueryGoal.getQueryString which returns
just the search term (if needed a .getOrigianlQueryString could be implemented in Queryparameters, adding the modifiers)
- started to adjust internal html href references from absolute to relative (currently it is mixed).
For future development we should prefer relative href targets (less trouble with context aware servlets)
- all non-dht targets (previously separated into 'robinson' for dht-like
queries and 'node' for solr queries) are non 'extra' peers, which are
queries using solr
- these extra-peers are now selected using a ranking on last-seen,
peer-tag-matches, node-peer flags, peer age, and link count. The ranking
is done using a weight and a random factor.
- the number of extra peers is 50% of the dht peers
- the dht peers now exclude too young peers to prevent bad results
during strong growth of the network
- the number of dht peers (and therefore extra-peers) is reduced when
the memory of the peer is low and/or some documents still appear in the
indexing-queue. This shall prevent a peer from deadlocks when p2p
queries are made in a fast sequence on weak hardware.
- added default filename filter to select field (as only addition to *.black list is permanent)
- modified Blacklist_p header/legend to show all active blacklists
(to support understanding that all configured lists are active)
- removed obsolete code in Blacklist_p servlet
- metatags my be null
Caused by: java.lang.NullPointerException
at net.yacy.search.query.QueryParams.getFacets(QueryParams.java:445)
at net.yacy.search.query.QueryParams.getBasicParams(QueryParams.java:400)
at net.yacy.search.query.QueryParams.solrTextQuery(QueryParams.java:345)
at net.yacy.search.query.QueryParams.solrQuery(QueryParams.java:334)
at net.yacy.search.query.SearchEvent.<init>(SearchEvent.java:290)
at net.yacy.search.query.SearchEventCache.getEvent(SearchEventCache.java:176)
at IndexControlRWIs_p.genSearchresult(IndexControlRWIs_p.java:641)
at IndexControlRWIs_p.respond(IndexControlRWIs_p.java:141)
regular expressions cause no results. Usage of '*' followed by a dot or
any expression will now cause that this expression is used as a filetype
search.
- see numerous idx entries with content_type image without url_file_ext_s (for various reason) which should be included in result
- try it yourself with following sample query
/solr/select?q=content_type:image/* AND -url_file_ext_s:[* TO *]&defType=edismax&fl=sku,url_file_ext_s,content_type
adresses also possible url without or deviating extension.
the right content domain (i.e. identifying that it is an image, text
etc.) because it used the file extension and not an existing mime type
assignment.
- fixed the new setting that images shall be loaded for a better image
search.
- both fixes together makes it now possible to crawl
commons.wikimedia.org which makes use of 'funny' document names (i.e.
ending with .jpg while the document is html)
is visible whenever a location is available in the search result.
To activate this, the search.navigation property in yacy.conf must be
modified to the new default values.
all unique links! This made it necessary, that a large portion of the
parser and link processing classes must be adopted to carry a different
type of link collection which carry a property attribute which are
attached to web anchors.
- introduction of a new URL class, AnchorURL
- the other url classes, DigestURI and MultiProtocolURI had been renamed
and refactored to fit into a new document package schema, document.id
- cleanup of net.yacy.cora.document package and refactoring
fuzzy_signature_copycount_i, which count the number of copies of
non-unique documents and assigns this to each document. Thus, each
document there is a number assigned which shows how many copies of this
document exists.
These fields are disabled by default.