that enables viewing of images for intranet SMB servers.
- added a filter search for protocol, tld and ext again; otherwise p2p
search produces a lot of rubbish
proper/known
- automatically show the protocol navigation if there is more than http
and https
- automatically show the extension navigation if there is some media
content
introduce a copy-field for the author field to be copied to a string
field. This field is then used to generate facets. Without this field,
the facet would consist only of the words of the author names, not of
the full author string.
INCOMING links to the corresponding web page. This information is taken
from the reverse link index (a 'little sister' of the RWI index).
- this field can be of use to enhance the ranking because a web page
with more incoming links can be more more important than others. But
this is not true for typical link pages like menues. Therefore the
number of outgoing links is needed.
- added a new solr attribute 'bf' to solr queries which is a boost
function extension. this field can contain a formula which comuptes the
boost according to given field values. After some experiments the
following forumla is now default:
div(add(1,references_i),pow(add(1,inboundlinkscount_i),1.6))^0.4
This takes the number of references and the inbound links. Further
experiments are needed to enhance that forumula.
This uses an enhanced version of the Nutch/Solr TextProfileSignatue.
As a result, a signature of the document is written to the solr search
index. Additionally for each time when a signature is written, it is
checked if the singature exists already in the index. If the signature
does not exist, the document is marked as unique. The unique attribute
can now be used to sort document lists and bring duplicates to the end
of a result list.
To enable this, a large portion of the search api to Solr had to be
changed. This affected mainly caching of 'exists' searches to enhance
the check for existing signatures and do this without actually doing a
solr query.
Because here the first time a long number is used as value in the Solr
store, also the value naming in the YaCySchema had to be adopted and
normalized. This caused that many files had to be changed.
need regular expressions as search attributes. They had now been removed
from the advanced search page while they are still created internally.
The filter is then expressed against solr as regular expression filter
query. If the expression points out a selection of an specific protocol,
host or filetype this is then translated into a facetted query.
added solr highlighting / YaCy snippets to YaCy search results
- facets are now much more complete
- facets are computed and searched much faster
- snippet computation is done by solr if solr knows the snippet
- fixed type definition found by the verifier
- added multivalue-string fields for solr with extension 'sxt'
- added multivalue-integer fields for solr with extension 'val'
- renamed some solr attributes from txt to sxt
- changed solr query line to an explicit AND/OR structure
- added a country code second level domain list to Domains class; with
parser
- added a host string parser to get domain class name, country-code
second-level domain and subdomain out of it
- removed old coordinate attributes
- vocabulary annotation is not done any more into the metadata of urldb
- vocabularies are written into the jena triplestore using a rdf
vocabulary
- vocabularies for rdf tripel must be updated; refactoring done
- with the new navigation tags in the triplestore a faster
pre-urldb-lookup is possible: navigation is processed now within the RWI
during pre-ranking retrieval
- added also a Owl vocabulary stub to add the plain-text url to the
triplestore using the owl:sameas predicate