the parser initialization. To make the apk parser usable, the handling
of application type links had to be modified. Now all documents which
have not a parser attached are placed to the noload-queue while all
other documents are parsed using the associated parser class. This may
have side-Effects on other parsers and the display of different file
classes (images, apps, videos).
solr to the YaCy built-in solr search servlet. Its not complete and not
fully correct (there is still a utf8 encoding problem) but it is a way
to get easily requests forwarded through YaCy to an external Solr.
it is now possible to get the results in two steps:
- first retrieve all IDs as given for a query
- then retieve each document individually
This was necessary for very large result sets where a query may run for
hours and is possibly terminated by a solr-internal timeout. This occurs
regulary during postprocessing and therefore this commit may fix
unwanted postprocessing terminations.
hour. The table can be shown with
http://localhost:8090/Tables_p.html?table=stats
The entries have the following meaning:
aM: activeLastMonth
aW: activeLastWeek
aD: activeLastDay
aH: activeLastHour
cC: countConnected (Active Senior)
cD: countDisconnected (Passive Senior)
cP: countPotential (Junior)
cR: count of the RWI entries
cI: size of the index (number of documents)
The entry keys are abbreviated to reduce the space in the table as the
name is written again for every row.
This is the beginning of a 'yacystats' micro-alternative als built-in
function in YaCy. Graphics may follow after some time if enough test
data is available.
found a situation after crash (reboot) with existing running semaphore but YaCy not running.
Ping generated exception which finally deleted the conf file (during pre-read procedure)
- change to ping (catch exception solved it)
- additionally removed delete yacy.conf file (if needed we need to make a backup)
when the buffer is cleaned after a buffer flush which is not then
available in Solr since that is waiting for a commit. In such cases the
counter would run backwards which is prevented by ignoring the buffer
size.
solr+buffer and the hit cache. This shall help during first crawls to
see a running document counter even if there was no commit meanwhile to
solr. To support that strategy, the hit cache must be written earlier.
during document parsing; instead use the same references that would also
be written into the webgraph. That should cause that the webgraph and
the citation index express the exact same semantic.
string which is visible in the browser. That makes it possible that the
browser instructs the user how to change a forgotten admin password
(during runtime).
attributes are attached to artificial constructed index.html files which
list directories. Such files are naturally rejected by the crawler and
should not appear in the error log because these files are part of the
construction of file crawlers and confuse users if they see them in the
error log.
- default mySeed.ip to hostip in SeedDB.initMySeed() if Intranetmode
this allows to become senior status in intranet hosted search network with view peers,
otherwise peer would stay junior because of default init with loopback ip as public (dna) ip.
- use CommonParams and DisMaxParams constants
- fix typo in get sort parameter
- getDocumentCountByParams redundant implementation and risk of not optimized call (row parameter unspecified) -> as only used from getCountByQuery removed from interface
see http://mantis.tokeek.de/view.php?id=437
test result (concurrency=7)
2000 docs = eom always
1000 docs = eom always
100 docs = eom never
chosen -> 200 docs (eom not encountered during test with 1GB mem setting)
- type detection (rss/atom)
- init type parameter overwritten during parse, parameter obsolete
- detection by endtag changed to simpler first-tag evaluation
- channel image not used, removed related extra parser handling
- remove unused code (set/getImage) in rssfeed
- atom link extraction to account for possible multipe link tags
- spec limits link to one with rel="alternate" or one without rel attribute
not accounting for the follwing type & hreflang exception yet:
o atom:entry elements MUST NOT contain more than one atom:link
element with a rel attribute value of "alternate" that has the
same combination of type and hreflang attribute values.
work now, at least it does on my network. UPNP code in YaCy can still
be improved though (see TODO comment: make port on gateway configurable
or find free one).
*) removed old code
*) added new lib
*) changed code to work with new lib
formulated as edismax query but this was not set as query attribut. The
defType=edismax property needs a qf-field, so this was added as well. Do
not remove that field again! This fixes also a problem with title-unique
computation.
filled with the date, when the url is recognized as to be outdated. That
field was partly misinterpreted and the time interval was filled in. In
case that all the urls which are in the index shall be treated as
outdated, the field is filled now with Long.MAX_VALUE because then all
crawl dates are before that date and therefore outdated.
according to the ranking profiles boost fields defined by the peer (if df/qf is not specified in query).
This allows for pretty simple queries ( q=word) without the need to know about the specific index configuration.
Making sure all relevant fields (as determined by the index owner) are searched, still maintaining the option to query specific fields
and does not relay on the duplication of text to text_t.
- add author to reset-default boost fields (support results for author nav)
- with search modifier /heuristic a request is send to all configured opensearch target systems (old /heuristic/blekko modifier not longer valid)
- this allows to use opensearch heuristic on individual search request (in contrast to configuration HEURISTIC_OPENSEARCH=true which sends a osd request on all global searches
- the index.html searchoption text adjusted to be displayed only if option configured
- add Archive-It to predefined systems
attribute in the <a> tag for each crawl. This introduces a lot of
changes because it extends the usage of the AnchorURL Object type which
now also has a different toString method that the underlying
DigestURL.toString. It is therefore not advised to use .toString at all
for urls, just just toNormalform(false) instead.
with metadata retrieval from connectors directly. This should cause
better usage of the cache. Automatically increase the metadata cache if
more memory is available.
'cloud' was a bad idea. Changed also the accumulation process for peer
targets so that every dht chunk is not assigned the set of redundant
targets but they are assigned to redundant targets individually. This
enhances the granularity of the target accumulation and should enhance
the efficiency of the process. Finally the dht protocol client was
enriched with the ability to remove the 'accept remote index' flag from
peers or remove peers completely if they do not answer at all.
generic parser but extracts links like the htmlParser. This should be
used for ASCII documents without known text format annotation like
source code files or json documents. Probably also good for xml files
without known schema.
- close stackfile inputstream at end of ChunkIterator
This should solve startup delay while unfinished crawl jobs exist (maybe also too many open file situation)
was optional and another alternative metadata store was available. Since
that store is now removed, Solr is always available (internally or
externally)
and adjusted xslt, for solr snippets (&hl=true) to decode the xml encoded html <b> tag by adding disable-output-escaping
(still open item description may be double as dc: tag and rss.description tag)
- author facet is based on omitted author_sxt field
- adjust to make author nav available on exist of author field but keep using author_sxt to construct the facet (why!?)
- add check for querymodifier author in searchevent
- Web servers may now deliver YaCy-specific http header field with a
title and keywords. The new http header fields are:
X-YaCy-Media-Title - to be used for media (image, audio, video) titles
X-YaCy-Media-Keywords - to be used for media (image, audio, video)
keywords
- both fields are written to document fields title and keywords and are
searched also during image search.
- to make the usage of arbitrary http header fields (including this new
fields) possible in the /api/push_p.json servlet, a new POST argument is
also introduced to push http header fields. The new POST attribute is
named "responseHeader-X" (where X is the counter). It is allowed to use
this attribute as multi-attribute several times, each can be filled with
a http header line.
- see /api/push_p.html for examples
- to allow manually renew index content for this url (e.g. in case it is a remote search result with metadata only)
- use simply a QuickCrawlLink_p javascript snippet (minimalistic 1st solution)
Filter queries are not actually related to ranking, but user requests
have pointed out that specific boost queries to move results to the end
of the result list are not sufficient. Such boost filters may be better
executed as actual filter and therefore such a filter can now be
statically applied to every search request. A typical use could be the
expression "http_unique_b:true AND www_unique_b:true" which uses the
recently introduced fields http_unique_b and www_unique_b which are true
only for one of the alternatives with/without http(s) and with/without
prefix 'www.' in host names.
The collection field (can be filled i.e. in Crawl Start) can be used to
add categories to YaCy index entries. The usage of that field was
restricted to solr searches and post argument filters as implemented in
commit f7571386a3.
This commit extends collections to a full navigation option in the
standard YaCy search interface. The field is not active by default but
can be activated easily in the /ConfigSearchPage_p.html servlet (just
check the 'Collection' facet field). Collections can now be used for (at
least) two purposes:
- to provide search tenants (through post argument collection)
- to provide self-made category navigation
Search requests may now have (independently from switched on or off
collection facet) a "collection:<collection-name>" modifier attached;
firthermore collection names may use disjunctions using the '|' pipe
symbol. For example, this is a valid search request:
www collection:user|proxy
crawling to the YaCy indexer. Files are uploaded using POST multipart
requests; multiple file uploads are possible as well. Each file has
attached the file date and mime type which is used to get the right
parser for the submitted data. Also an url is submitted which is
assigned to the document.
The CrawlSwitchboard has a new option for default Crawl Profiles which
are assigned dynamically from the new push interface.
requests. The internal representation of post-arguments is String and
therefore not appropriate for byte[] object as submitted by file pushes.
Therefore all pushed files are encoded to base64 _after_ uploading with
an http form (you do not need to do that encoding yourself) to hand-over
the byte[] as string in the post argument.
Servlets which read such files must decode the base64 data to get the
original byte[] array.
This is considered as a temporary solution for file uploads and a proper
implementations would need to consider all attributes as handed over as
Objects with either String or byte[] Object instances. This would be a
major code change and is not done at this time here now. The feature was
submitted to realize a feature as pushed with the next commit.
use alternative delete to fight the sympthom (and fix deletion of host dirs on startup)
Root cause (which class holds a lock on .stack) not found.
http://mantis.tokeek.de/view.php?id=404
- unique-postprocessing was destroying results from other
postprocessings; removed cross-updates as they had been not necessary
- unique-postprocessing did not restrict on same protocol
- inefficient concurrent update cache was redesigned completely
- increased limits for concurrent blocking queues to prevent early
time-out
local files can be crawled (intranet mode) url parsing fixed according to RFC 1738 (for unix and windows)
for win like file:///c:/tmp or file://localhost/c:/tmp
for linux like file:///tmp or file://localhost/tmp
Host is ignored and path must be absolute
in peer hash hashing. This should not change anything if java casts long
to int by masking with 0xFFFFFFFFL but you never know. The important
thing is, that the hashCode() should not return numbers that have the
same order as the hash code order because hashing of seeds is used to
remove the order in some places.
besides adjustments in code it makes the servlet settings in web.xml significant.
This applies to solr, gsa and proxy servlet. There is no longer a default setup in code during init (as jetty 9 checks for double definition).
After import the conent is never accessed but may take up a lot of disk space,
also the getLoadedOAIServer (which lists the files in surrogate out) is not used.
Making the surrogate.out obsolete. Removed keeping of xmls after import.
on proxied pages a link to exit proxy is added to top of page.
Link text can be configured in web.xml init-parameter (see default/web.xml). If missing no link is displayed.
- some outboundlinks_anchortext_txt in index contain e.g. <span>text</span> or more tags,
remove all tags for text property (inline img tags are still parsed)
- added test case for above (to htmlParserTest)
- fix solr test case
- doublecheck cache now records the crawl depth as well
- doublecheck cache is available from the outside (made static)
- no more need to crawl hosts with lowest depth first, instead all hosts
which have only singleton entries are preferred to reduce the number of
files.
This organizes all urls to be loaded in separate queues for each host.
Each host separates the crawl depth into it's own queue. The primary
rule for urls taken from any queue is, that the crawl depth is minimal.
This produces a crawl depth which is identical to the clickdepth.
Furthermorem the crawl is able to create a much better balancing over
all hosts which is fair to all hosts that are in the queue.
This process will create a very large number of files for wide crawls in
the QUEUES folder: for each host a directory, for each crawl depth a
file inside the directory. A crawl with maxdepth = 4 will be able to
create 10.000s of files. To be able to use that many file readers, it
was necessary to implement a new index data structure which opens the
file only if an access is wanted (OnDemandOpenFileIndex). The usage of
such on-demand file reader shall prevent that the number of file
pointers is over the system limit, which is usually about 10.000 open
files. Some parts of YaCy had to be adopted to handle the crawl depth
number correctly. The logging and the IndexCreateQueues servlet had to
be adopted to show the crawl queues differently, because the host name
is attached to the port on the host to differentiate between http,
https, and ftp services.
- removed the default version blacklist regex from init (for future versions)
!!! left existing update blacklist setting untouched !!!
(existing installation wanting autoupdate for 1.71 need to change blacklist in ConfigUpdate_p.html)
- moved old blacklist patch to migration.java
stack on html tag objects, not using a recursive parse-again method
which may cause bad performance and huge memory allocation. The new
method also produced better parsed image objects with exact anchor text
references.
- added order option to solr queries to be able to retrieve document
lists in specific order, here: link length
- added HyperlinkEdge class which manages the link structure
- integrated the HyperlinkEdge class into clickdepth computation
- extended the linkstructure.json servlet to show also the clickdepth
and other statistic information
- not doing an optimization will likely cause a too many files exception
- without optimization performance will be even worse which would
prevent optimization in the future as well (prevent a deadlock
situation)
a document. This is the upper limit for the clickdepth_i value which may
be shorter in case that the crawler did not take the shortest path to
the document.
- use JSoup parser for selective rewrite of html body <a href= links only,
instead of regex which rewrites also header href/src links
- this improves display of pages which use header <base> tag
- tags with src attribute are taken from original location (like css) improving display and are not routed trough the indexer
Disadvantage: scripting links will drop out of proxy
Setting of the servlet through web.xml exclusivly (in case one would like to quickly switch back to the YaCyProxyServlet,
leaving the existing code of YaCyProxyServlet untouched available)
- should be investigated in more detail to look for additional implications
Remove "yacyaction" from proxyservlet as it was only needed for removed interaction routines.
- find best fittng identifier (url) by checking all given dc:identifier in record (many entries proviede several identifiers)
as identifier is currently a multivalued field use "getParams" in preference of splitting the 1st string by ";"
- add resolve DOI:... identifier via http://dx.doi.org/
processes in favor of throwaway-processes. The control mechanism does
less often report a 'queue full' message to the busy loop which then
does not perform a long busy waiting; instead all requests are queued
and new loader processes are started if necessary up to a given limit
(as set before)
instead of TreeMaps)
- enhanced memory footprint of database indexes (by introduction of
optimize calls)
- optimize calls shrink the amount of used memory for index sets if they
are not changed afterwards any more