one request:
- allow larger match-fields in html interface
- delete all host hashes at once from zurl
- when deleting by host, do not count size of deleted entries since that
was the reason it took so long
4.0.0 there is a new softcommit feature which implements a
near-real-time (NRT) search option. The softcommit does not do IO and
does not cause performance issues.
YaCy has now an extension in its solr connectors to use the softcommit
feature. The softcommit call now replaces all places where a hard commit
was used. Furthermore the commit strategy in when doing a search from
the web interface was changed (it's done every time before a search is
done).
The softcommit feature was implemented because it was needed for the
following changes (customer demands), which is also included in this
git commit:
- added a feature to identify all documents which have unique titles
and/or unique descriptions. These unique flags are disabled by default.
- added also a feature to set a flag when the url from a canonical tag
is equal to the document url. This is also disabled by default.
To support the new softcommit strategy, the commitWithinMs option was
set to -1 do disable automatic commit based on document insert times. If
documents are inserted permanently then also a commit would happen
permanently whenever the commitWithinMs time is reached. This would
conflict with the regular autocommit of 10 minutes and the new
softcommit strategy.
- when using a site operator search for a domain where the domain has a
www prefix, also the domain without the www is enclosed
- when using a site operator search for a domain where the domain has no
www prefix, also the domain with the www in enclosed
- in the host navigator, all domains with and without a www prefix are
accumulated. That means that the host navigator does never show a host
with a www prefix.
This should prevent usage mistakes of the site operator.
data. This the NPE looked like:
Caused by: java.lang.NullPointerException
at net.yacy.search.query.SearchEvent.<init>(SearchEvent.java:279)
at
net.yacy.search.query.SearchEventCache.getEvent(SearchEventCache.java:155)
at search.respond(search.java:314)
... 12 more
forum at
http://forum.yacy-websuche.de/viewtopic.php?f=18&t=4572#p27410
The feature was not called 'haslink' but called 'inlink' to have a
analogous naming like 'inurl'. This causes now that you can search for
words in links of the document, like:
* inlink:yacy
searches all documents which link to pages which have an 'yacy' in the
url.
- migrates all entries in old urldb
Metadata coordinate (lat / lon) NumberFormatException still relative often (see excerpt below),
- added try/catch for URIMetadataRow (seems not to be needed in URIMetaDataNode, as Solr internally checks for number format)
- removed possible typ conversion for lat() / lon() comparison with 0.0f, changed to 0.0 (leaving it to the compiler/optimizer to choose number format)
current log excerpt for NumberFormatException:
W 2013/01/14 00:10:07 StackTrace For input string: "-"
java.lang.NumberFormatException: For input string: "-"
at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source)
at java.lang.Double.parseDouble(Unknown Source)
at net.yacy.kelondro.data.meta.URIMetadataRow$Components.lon(URIMetadataRow.java:525)
at net.yacy.kelondro.data.meta.URIMetadataRow.lon(URIMetadataRow.java:279)
at net.yacy.search.index.SolrConfiguration.metadata2solr(SolrConfiguration.java:277)
at net.yacy.search.index.Fulltext.putMetadata(Fulltext.java:329)
at transferURL.respond(transferURL.java:152)
...
Caused by: java.lang.NumberFormatException: For input string: "-"
at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source)
at java.lang.Double.parseDouble(Unknown Source)
at net.yacy.kelondro.data.meta.URIMetadataRow$Components.lon(URIMetadataRow.java:525)
at net.yacy.kelondro.data.meta.URIMetadataRow.lon(URIMetadataRow.java:279)
at net.yacy.search.index.SolrConfiguration.metadata2solr(SolrConfiguration.java:277)
at net.yacy.search.index.Fulltext.putMetadata(Fulltext.java:329)
at transferURL.respond(transferURL.java:152)
- added additional config page (ConfigSearchPage_p) for easy setup of search page layout (to not overload ConfigPortal page)
- currently redundant setting with part of ConfigPortal page
- added missing config for filetype and protocol navigator
- adjusted init of SearchEvent to check navigation config setting
- renamed RankigProcess.getTopicNavigator to getTopics (to distiguish between added SearchEvent.getTopicNavigator)
metadata and old rwi and for the citation index. The important
advancement is the separation of the citation index deletion because
that index is responsible for the linkdepth calculation. Now a search
index can be deleted without the citation index and that should cause
that less clickdepths must be post-processed.
This attribute can be used for ranking and for other purpose (demand by
customer)
The click depth is computed in two steps:
- during indexing the current fill-state of the reverse link index is
used to backtrack the current page to the root page. The length of that
backtrack is the clickdepth. But this does not discover the shortest
click depth. To get this, a second process to check again is needed
- added a process tag that can be used to do operations on the existing
index after a crawl; i.e. calculation the shortest clickpath. Added a
field to control this operation but not a method to operate on this.
- added a visualization of the clickpath length in the host browser
- fixed conflict in htroot/yacysearch.java
- removed nedres check because that causes that the remote server is not
called at all in most cases (local index has already results but we want
more)
- fixed a regex bug (a '=' too much)
- urlcitationindex != null check added to ResultEntry.referencesCount
- plus other places where conflicting procedure was used (and urlcitationindex not already checked != null)
- any system supporting opensearch specification can be configured
- search query is only forwarded to remote system if not enough results available on local peer
- discover function provided, checking the local Solr index for links to opensearchdescription files, to add to the config
- sample config file with some general search engines with opensearch support
that enables viewing of images for intranet SMB servers.
- added a filter search for protocol, tld and ext again; otherwise p2p
search produces a lot of rubbish
- an event type (once, regular) can be selected
- for this event type, a fixed time can be selected. This may be either
directly after startup or at one of the full hours at a day (==25
options)
The main point about this feature is the opportunity to start an action
directly after startup. That makes it possible to create YaCy
distributions which, after started at the first time, start to index
parts of the intranet/internet by itself.
proper/known
- automatically show the protocol navigation if there is more than http
and https
- automatically show the extension navigation if there is some media
content
introduce a copy-field for the author field to be copied to a string
field. This field is then used to generate facets. Without this field,
the facet would consist only of the words of the author names, not of
the full author string.
clicks which are necessary to get from the portal of a host to a
specific document. At this time, only the start document is flagged with
clickdepth '0', all other with '-1'. To get the actual clickdepth, a
process must use crawled information to collect the actual number of
clicks. This will be added in another/next step.
INCOMING links to the corresponding web page. This information is taken
from the reverse link index (a 'little sister' of the RWI index).
- this field can be of use to enhance the ranking because a web page
with more incoming links can be more more important than others. But
this is not true for typical link pages like menues. Therefore the
number of outgoing links is needed.
- added a new solr attribute 'bf' to solr queries which is a boost
function extension. this field can contain a formula which comuptes the
boost according to given field values. After some experiments the
following forumla is now default:
div(add(1,references_i),pow(add(1,inboundlinkscount_i),1.6))^0.4
This takes the number of references and the inbound links. Further
experiments are needed to enhance that forumula.
I don't actually know if this is correct. It is a desperate try to get
YaCy running on production servers which must get alive even with
strange hacks like this. This is also related to a forum posting in
http://forum.yacy-websuche.de/viewtopic.php?t=4528&p=27135#p27135
This uses an enhanced version of the Nutch/Solr TextProfileSignatue.
As a result, a signature of the document is written to the solr search
index. Additionally for each time when a signature is written, it is
checked if the singature exists already in the index. If the signature
does not exist, the document is marked as unique. The unique attribute
can now be used to sort document lists and bring duplicates to the end
of a result list.
To enable this, a large portion of the search api to Solr had to be
changed. This affected mainly caching of 'exists' searches to enhance
the check for existing signatures and do this without actually doing a
solr query.
Because here the first time a long number is used as value in the Solr
store, also the value naming in the YaCySchema had to be adopted and
normalized. This caused that many files had to be changed.
the actual data which is fetched from solr.
- used the new field options to reduce generic options like getting the
load date or the count of search results. should increase overall speed
- used the new field options to reduce overhead in the host browser
during aquisition of links.
- used the field options to make checking of links in crawler faster
- if the crawler is paused, the crawl queue is not cleaned
need regular expressions as search attributes. They had now been removed
from the advanced search page while they are still created internally.
The filter is then expressed against solr as regular expression filter
query. If the expression points out a selection of an specific protocol,
host or filetype this is then translated into a facetted query.
added solr highlighting / YaCy snippets to YaCy search results
- facets are now much more complete
- facets are computed and searched much faster
- snippet computation is done by solr if solr knows the snippet
- only visible for admins
- a faceted search generates a huge list for all hosts in the host list
- the faceted search algorithms had to be modified for that
- within the browsing of the directory path, the error cause is written
to the url which is presented as error-url
- the errors are also accumulated for directory sums
deleted which are selected by a crawl filter (host or subpath)
- site crawl used this option be default now
- made option to deleteDomain() concurrency
- language default on missing lang property to "uk" (fix set to nothing)
- language set to TLD (added call to existing language calculation from TLD)
- coordinate number exception on possible lat/lon content of "NaN,NaN"
adjust Netbeans IDE classpath (for Solr/Lucene 4.0.0 jars)
- time-out after 3 seconds to speed up display (may be incomplete)
- showing also all links from the balancer queue in the host list (after
the '/') and in the result browser view with tag 'loading'