for internal Solr indexes. This is mainly caused by the fact that an
external Solr may be also a SolrCloud which do not support LukeRequests,
which are needed to request the old Schema.
was optional and another alternative metadata store was available. Since
that store is now removed, Solr is always available (internally or
externally)
and adjusted xslt, for solr snippets (&hl=true) to decode the xml encoded html <b> tag by adding disable-output-escaping
(still open item description may be double as dc: tag and rss.description tag)
- Web servers may now deliver YaCy-specific http header field with a
title and keywords. The new http header fields are:
X-YaCy-Media-Title - to be used for media (image, audio, video) titles
X-YaCy-Media-Keywords - to be used for media (image, audio, video)
keywords
- both fields are written to document fields title and keywords and are
searched also during image search.
- to make the usage of arbitrary http header fields (including this new
fields) possible in the /api/push_p.json servlet, a new POST argument is
also introduced to push http header fields. The new POST attribute is
named "responseHeader-X" (where X is the counter). It is allowed to use
this attribute as multi-attribute several times, each can be filled with
a http header line.
- see /api/push_p.html for examples
force page navigation to be displayed below results in image search for any number of displayed images instead to be displayed to the right of last image.
Filter queries are not actually related to ranking, but user requests
have pointed out that specific boost queries to move results to the end
of the result list are not sufficient. Such boost filters may be better
executed as actual filter and therefore such a filter can now be
statically applied to every search request. A typical use could be the
expression "http_unique_b:true AND www_unique_b:true" which uses the
recently introduced fields http_unique_b and www_unique_b which are true
only for one of the alternatives with/without http(s) and with/without
prefix 'www.' in host names.
The collection field (can be filled i.e. in Crawl Start) can be used to
add categories to YaCy index entries. The usage of that field was
restricted to solr searches and post argument filters as implemented in
commit f7571386a3.
This commit extends collections to a full navigation option in the
standard YaCy search interface. The field is not active by default but
can be activated easily in the /ConfigSearchPage_p.html servlet (just
check the 'Collection' facet field). Collections can now be used for (at
least) two purposes:
- to provide search tenants (through post argument collection)
- to provide self-made category navigation
Search requests may now have (independently from switched on or off
collection facet) a "collection:<collection-name>" modifier attached;
firthermore collection names may use disjunctions using the '|' pipe
symbol. For example, this is a valid search request:
www collection:user|proxy
crawling to the YaCy indexer. Files are uploaded using POST multipart
requests; multiple file uploads are possible as well. Each file has
attached the file date and mime type which is used to get the right
parser for the submitted data. Also an url is submitted which is
assigned to the document.
The CrawlSwitchboard has a new option for default Crawl Profiles which
are assigned dynamically from the new push interface.
- unique-postprocessing was destroying results from other
postprocessings; removed cross-updates as they had been not necessary
- unique-postprocessing did not restrict on same protocol
- inefficient concurrent update cache was redesigned completely
- increased limits for concurrent blocking queues to prevent early
time-out
local files can be crawled (intranet mode) url parsing fixed according to RFC 1738 (for unix and windows)
for win like file:///c:/tmp or file://localhost/c:/tmp
for linux like file:///tmp or file://localhost/tmp
Host is ignored and path must be absolute
adding transparently text_t as pseudo default search field if no fieldname (no : ) is included.
adressing bug report http://mantis.tokeek.de/view.php?id=274
besides adjustments in code it makes the servlet settings in web.xml significant.
This applies to solr, gsa and proxy servlet. There is no longer a default setup in code during init (as jetty 9 checks for double definition).
*) added JSON support
*) fixed Exception in case of missing parameters
*) renamed parameter for items in "add entry" and "delete entry" from
"entry" to "item" to match term in XML
This organizes all urls to be loaded in separate queues for each host.
Each host separates the crawl depth into it's own queue. The primary
rule for urls taken from any queue is, that the crawl depth is minimal.
This produces a crawl depth which is identical to the clickdepth.
Furthermorem the crawl is able to create a much better balancing over
all hosts which is fair to all hosts that are in the queue.
This process will create a very large number of files for wide crawls in
the QUEUES folder: for each host a directory, for each crawl depth a
file inside the directory. A crawl with maxdepth = 4 will be able to
create 10.000s of files. To be able to use that many file readers, it
was necessary to implement a new index data structure which opens the
file only if an access is wanted (OnDemandOpenFileIndex). The usage of
such on-demand file reader shall prevent that the number of file
pointers is over the system limit, which is usually about 10.000 open
files. Some parts of YaCy had to be adopted to handle the crawl depth
number correctly. The logging and the IndexCreateQueues servlet had to
be adopted to show the crawl queues differently, because the host name
is attached to the port on the host to differentiate between http,
https, and ftp services.
- added order option to solr queries to be able to retrieve document
lists in specific order, here: link length
- added HyperlinkEdge class which manages the link structure
- integrated the HyperlinkEdge class into clickdepth computation
- extended the linkstructure.json servlet to show also the clickdepth
and other statistic information
- added a servlet api/linkstructure.json which generates a link graph
information in json
- added a javascript link graph renderer hypertree.js using d3 and the
new servlet linkstructure.json
- embedded the new link graph in the crawler monitor and the host
browser
a document. This is the upper limit for the clickdepth_i value which may
be shorter in case that the crawler did not take the shortest path to
the document.
principles:
- provide an easy tutorial-like "what should I do first" menu
- provide all elements which are subject to most first questions to YaCy
exibition people on top level: Resource limitation, Parser and Ranking
settings
I apologize to everyone who are used to the old style and need to find
the menu items (again) after this change. I hope that this will make the
interface more usable for new users who see a web indexer/crawler the
first time.
- they are incomplete and buggy
- it was not easy to explain
- it did not comply with a KISS strategy
- setting a performance of low priority actually caused crashing of a
peer
- there was nobody who would maintain that functionality
which is the location for the search input frame. The new embedded
header is a clone of the simpleHeader with removed menus and without a
link to the index administration.
possible
- re-implemented formerly jquery-based tagcloud with simple html
- fixed minimized menu for mobile/minimized view (try that: make window
very narrow...)
- adopted all YaCy administration pages to new framework
- created new search page layout (working, but still work in progress)
- old skin files are fully appliable! (and looking good)
- target is a new style based on bootstrap examples, see /test.html
- icons in YaCy may be replaced by glyphicons (to be done)
config settings frome the page also removed from yacy.init
augmentation.reflect
augmentation.addDoctype
augmentation.reparse
interaction.overlayinteraction.enabled
- should be investigated in more detail to look for additional implications
Remove "yacyaction" from proxyservlet as it was only needed for removed interaction routines.
- find best fittng identifier (url) by checking all given dc:identifier in record (many entries proviede several identifiers)
as identifier is currently a multivalued field use "getParams" in preference of splitting the 1st string by ";"
- add resolve DOI:... identifier via http://dx.doi.org/
processes in favor of throwaway-processes. The control mechanism does
less often report a 'queue full' message to the busy loop which then
does not perform a long busy waiting; instead all requests are queued
and new loader processes are started if necessary up to a given limit
(as set before)
instead of TreeMaps)
- enhanced memory footprint of database indexes (by introduction of
optimize calls)
- optimize calls shrink the amount of used memory for index sets if they
are not changed afterwards any more
different from normal requests. This happens if the remote solr is
actually a solrCloud; in such cases the luke request returns only the
result of the single solr peer, not the whole cloud.
also done: some refactoring.
The resource observer is now able to recognize free disk space AND
available space for YaCy. The amount of space which is assigned for YaCy
are defined in new settings in the configuration file.
Furthermore, there is now a cleanup process which deletes files in case
that an autodelete is activated. The autodelete is now BY DEFAULT ON if
the disk space is low, which means that YaCy starts to delete documents
when the disk is full!
admin, we expect that the user wants to check recently crawled document
to ensure that recent crawl results are inside the search results, we do
a soft commit here.
occupied disc space. These values are also shown on the status page.
The disc space calculation shall be used for a disk-limitation of the
search index.
introduced, it was also used for search facets. The generic search
facets are now deduced from generic solr fields which makes jena as tool
for facet semantics superfluous.
works fine to restrict language for local solrSearches.
More work needs to be done to make rwi/remote searches respect the modifier.language restriction.
It is now possible to assign every peer other YaCy peers as remote solr
server which are only used for read operations during search. This also
affects crawling: it will exclude urls from crawls which exist on remote
solr/remote YaCy peers.
- refactored all code which uses URIMetadataRow as standard for word
hash length and word hash ordering and moved that to the class 'Word',
becuase the class URIMetadataRow defined the old metadata data structure
and should be superfluous in the future
- removed unused methods from URIMetadataRow as preparation for further
removal of that class
- since specific heuristic Twitter & Blekko is not longer available or redundant with OpenSearchHeuristic,
adjusted ConfigHeuristic to use OpensearchHeuristic settings only.
For this the default OSD search target list is made available (copied) by default and the other configs are removed.
- the return of QueryGoal.getOriginalQueryString includes the queryModifier, which are held separately in a modifier object,
but in most (all) cases just the query term is expected, clarified and renamed it to QueryGoal.getQueryString which returns
just the search term (if needed a .getOrigianlQueryString could be implemented in Queryparameters, adding the modifiers)
- started to adjust internal html href references from absolute to relative (currently it is mixed).
For future development we should prefer relative href targets (less trouble with context aware servlets)
as BASIC were pwd is transmitted near clear text (B64enc).
This has some implication as RFC 2617 requires and recommends a password hash MD5(user:realm:pwd) for DIGEST.
!!! before activating DIGEST you have to reassign all passwords !!! to allow new calculation of the hash
- default authentication is still BASIC
- configuration at this time only manually in (DATA/settings) or defaults/web.xml (<auth-method>
- the realmname is in defaults/yacy.init adminRealm=YaCy-AdminUI
- fyi: the realmname is shown on login screen
- changing the realm name invalidates all passwords - but for security you are encouraged to do so (as localhostadmin)
- implemented to support both, old hashes for BASIC and new hashes for BASIC and DIGEST
- to differentiate old / new hash the in Jetty used hash-prefix "MD5:" is used for new pwd-hashes ( "MD5:hash" )
- all non-dht targets (previously separated into 'robinson' for dht-like
queries and 'node' for solr queries) are non 'extra' peers, which are
queries using solr
- these extra-peers are now selected using a ranking on last-seen,
peer-tag-matches, node-peer flags, peer age, and link count. The ranking
is done using a weight and a random factor.
- the number of extra peers is 50% of the dht peers
- the dht peers now exclude too young peers to prevent bad results
during strong growth of the network
- the number of dht peers (and therefore extra-peers) is reduced when
the memory of the peer is low and/or some documents still appear in the
indexing-queue. This shall prevent a peer from deadlocks when p2p
queries are made in a fast sequence on weak hardware.
- the admin user name can be configured, in apiExec calls the default "admin" username is used.
TODO: the bin/apicall.sh script should likely take that into account.
as path for solr index dumps (instead of the SEGMENTS path). This will
make a maintenance of index backups easier. It will also provide a tool
to migrate from an freeworld index to a webportal index.
via Jetty IPAccessHandler to allow only configured IP's to access.
Handler is only loaded if a restriction is configured.
Since IPAcessHandler (Jetty 8) does not support IPv6 system property java.net.preferIPv4Stack=true
Testing showed system.setProperty seems to be sensitive to point of calling (earliest possible time seems to be best = early in yacy.main).
Moved the "isrunning..." just open browser check also to the new routine to preread the yacy.config only once.
- added default filename filter to select field (as only addition to *.black list is permanent)
- modified Blacklist_p header/legend to show all active blacklists
(to support understanding that all configured lists are active)
- removed obsolete code in Blacklist_p servlet
- user entry in UserDB with admin right can login to access protected pages
- dto. admin user, choosen username is stored in conf (adminAccountUserName=)
regular expressions cause no results. Usage of '*' followed by a dot or
any expression will now cause that this expression is used as a filetype
search.
!!! attention !!! to make sure YaCy can start, https will be disabled if port 8443 is used
- added ping test for above to migration
- as of now port for https is hardcoded to default 8443
- if not urgend required I'd leave it this way (it's standard) to use different ports for http and https
- post https port on ConfigBasic.html (if active)