skip reading bz2 file magicbyte to identify bz2 format as inputstream reset would be required. Common compress reads and checks the magicbytes internally and throws ioexception if wrong, making preread obsolete.
see freshly deprecated https://developers.google.com/webmasters/ajax-crawling/
Implementation improves parsing of the homepage (ajax page) which uses metatag "fragment" in header and parses supplied html snapshot instead of mostly empty ajax/scripted page.
Implementation supports also hash-bang urls (url with anchor starting with ! like ...path#!hashfragment) but our crawler filters it
(use of hash-bang is controversly discussed and proposal is deprecated, makes no sense to adjust the crawler, but as long as it is used by some sites the minor change/improvement in htmlparser is good for some time).
Quick - how does it work
- if metatag fragment with content "!" is found
- htmlparser tries to get content of htmls snapshot (using a different url)
- htmlparser returns 2 documents (original url and snapshot content - but using same original url)
- after parsing result documents are joined (and stored to index containing content also from snapshot page... as the original ajax page contains typically no parseable html content)
and allow injection of recrawl urls before queue is empty
During recrawl the balancer hangs on the very last urls often on hosts with huge delay time,
by allowing injection earlier progress is more balanced. Max number of injected crawl urls by recrawl job is 2 * max loader.
use getAttributes() to get query parameters as clear text (w/o url encoding)
use getSearchpartMap() to get in internal format (url encoded)
fix for http://mantis.tokeek.de/view.php?id=606
This is a workaround for existing index (not fully reindexed) since intro of schema with docvalues
to prevent solr exception causing recrawljob to fail with
org.apache.solr.core.SolrCore java.lang.IllegalStateException: unexpected docvalues type NONE for field 'load_date_dt' (expected=NUMERIC). Use UninvertingReader or index with docvalues.
1-char tokens and also more-than-1-char tokens, then remove the 1-char
tokens to prevent that we are to strict. This will make it possible to
be a bit more fuzzy in the search where it is appropriate.
Reads document level included title and description and skips the graphic content to save bandwidth.
svg metadata element is not interpreted
- remove rdfParser from init (current function identical with genericParser)
to prevent blank thumbnail display in image search because of not handled source which don't load on click.
Now the cross icon indicates the problem (inlcuding not supported format)
function with a numeric date field:
"unexpected docvalues type NUMERIC for field 'last_modified' (expected
one of [SORTED, SORTED_SET]). Use UninvertingReader or index with
docvalues."
This is a well-known bug inside solr which prevents that now the 'sort
by date' in the YaCy search interface can be used. Without this patch no
results at all is displayed (since the exception prevents that). Now
there is at least a result but it is not ordered properly.
(regardless if these fields part of update).
Switch partial update option off in postprocessing if schema contains *_dts (multivalued date field).
see http://mantis.tokeek.de/view.php?id=601
In case of error we deleted the original document and added the new doc to the index.
This is not valid for partial update documents (which contain only a subset of the fields).
Remove the "delete" error handling step.
as simple string and no more as regular expressions.
Updated all locale files to adapt to refectored Translator : removed
useless escaped characters and did minor corrections.
Performed minor syntax corrections on some html source files.
Added an util to translate all html source files with all locales
without launching full YaCy application.
Corrected main arguments parsing on other translation utils.
caused as concurrency issue:
W 2015/09/05 14:09:10 ConcurrentLog java.lang.NullPointerException
java.lang.NullPointerException
at java.util.TreeMap.rotateRight(TreeMap.java:2239)
at java.util.TreeMap.fixAfterInsertion(TreeMap.java:2271)
at java.util.TreeMap.put(TreeMap.java:582)
at net.yacy.kelondro.table.Table.<init>(Table.java:235)
at net.yacy.crawler.HostQueue.openStack(HostQueue.java:229)
at net.yacy.crawler.HostQueue.getStack(HostQueue.java:204)
at net.yacy.crawler.HostQueue.push(HostQueue.java:397)
at net.yacy.crawler.HostBalancer.push(HostBalancer.java:237)
at net.yacy.crawler.data.NoticedURL.push(NoticedURL.java:184)
at net.yacy.crawler.CrawlStacker.stackCrawl(CrawlStacker.java:355)
at net.yacy.crawler.CrawlStacker.job(CrawlStacker.java:134)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at
net.yacy.kelondro.workflow.InstantBlockingThread.job(InstantBlockingThread.java:101)
at
net.yacy.kelondro.workflow.AbstractBlockingThread.run(AbstractBlockingThread.java:82)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
moved and was not cleared anymore. This results in an huge fieldcache.
(http://lucene.apache.org/#highlights-of-the-lucene-release-includehttps://issues.apache.org/jira/browse/LUCENE-5666)
Here I try to use DovValues where it is possible.
For this I used the Api-Scheme as new basis für the Solr-Schema.
This needs at least a complete optimization of the Solr-Index to get a
smaller FieldCache.
Everything that is indexed with these setting will not use the
Fieldcache at all.
bayesian filters. This can be used to classify documents during
indexing-time using a pre-definied bayesian filter.
New wordings:
- a context is a class where different categories are possible. The
context name is equal to a facet name.
- a category is a facet type within a facet navigation. Each context
must have several categories, at least one custom name (things you want
to discover) and one with the exact name "negative".
To use this, you must do:
- for each context, you must create a directory within
DATA/CLASSIFICATION with the name of the context (the facet name)
- within each context directory, you must create text files with one
document each per line for every categroy. One of these categories MUST
have the name 'negative.txt'.
Then, each new document is classified to match within one of the given
categories for each context.
- fixed superfluous space in query field list
- fixed filter query logic
- removed look-ahead query which caused that each new search page
submitted two solr queries
- fixed random solr result orders in case that the solr score was equal:
this was then re-ordered by YaCy using the document hash which came from
the solr object and that appeared to be random. Now the hash of the url
is used and the score is additionally modified by the url length to
prevent that this particular case appears at all.
Background: some user report problem with connecting/crawling some sites via https which require SNI support (by default switched off in YaCy). On the other hand systems not demanding SNI support are sometimes not properly configured and due to a bug/feature in java 1.7 connection is aborted. The later is more often the case, so the default is still fine. With the java start parameter expert user can no alter the startparameter to -Djsse.enableSNIExtension=true (java default) if they crawl more hosts requiring SNI support.
The alternative to let YaCy try both during https handshake (deep inside the httpclient) is not pursut at this time.
during surrogate reading: those attributes from the dump are removed
during the import process and replaced by new detected attributes
according to the setting of the YaCy peer.
This may cause that all such attributes are removed if the importing
peer has no synonyms and/or no vocabularies defined.
index has more than two million documents. This protects the index from
beeing flooded with search requests that cannot be resolved before the
real search query has to be computet.
This is a very complex migration: many classes had been renamed or
removed, dependencies changed and the solr index type is now aligned to
be a solr cloud repository.
Together with the Solr 5.2 library update, one other dependent library
had been updated as well: httpclient 4.4->4.4.1
Older indexes are migrated from 4_10 to 5_2. However, the new index
structure is more efficient and we recommend to re-index everything.
Please use the index export before you do the update to a large
surrogate xml file. After the update, start with an empty index and then
initialize this with your dump.
This was caused by reading via GZIP from close-wait connection an caused
high cpu- and system-loads.
Instat of implementing handling of the RedListener now I found a
timelimeted 'get' "realy" solving this problem.
This was caused by reading via GZIP from close-wait connection an caused
high cpu- and system-loads.
Solved by implementing handling of the RedListener.
reason: experimental implementatin of RDFa parser not executed (limited to special urls) but may cause error on normal html parsing due to a inputstream.reset
export function is also now the default export option. The export file
format for a full solr export is very similar to a solr search result
xml, only the <lst name="responseHeader"> tag is missing.
The exported xml has a special line termination feature: all documents
will be exported into a single line without any CR in between. That
means that every document is completely inside a single line. While this
is not readable at all for humans, it is very useful for linux line
processing scripts, like grep. Using grep it will be easy to select
single documents which match for a given pattern.
Such dumps shall be importable with the DATA/SURROGATE/in import
function, but that import is not yet adopted to the new file format.
by instatiation from URIMetadataNode, by eleminating differentiation of ResultEntry/URIMetadataNode.
- moved remaining ResultEntry functionallity to URIMetadataNode
- for 1:1 functionallity added a function makeResultEntry()
- removed ResultEntry
- refactored related code
Main difference is after makeResultEntry the text_t content is removed and alternative title/url strings for display are calculated.
Main difference left is, that
If remote crawl option is not activated, skip init of remoteCrawlJob to save the resources of queue and ideling thread.
Deploy of the remoteCrawlJob deferred on activation of the option.
added to IndexReIndexMonitor_p.html
Selects existing documents from index and feeds it to the crawler.
currently only the field fresh_date_dt is used determine documents for recrawl (fresh_date_dt:[* TO NOW-1DAY]
Documents are added in small chunks (200) to the crawler, only if no other crawl is running.
keeping surrogates after processing is essential for some users. If the
space they are taking is too high, please set up an automatic deletion
process (like a cronjob).
This assigns priorities to incoming requests. Higher priority numbers are served before lower.
(disabled by default in defaults/web.xml,
uncomment or copy entry to DATA/Settings/web.xml)
as we flag if a peer is accesible via https, we need to know the port if we want to use is (e.g. for interYaCy communication)
start to provide / tansport the port by recording it in peers dna.
- add https link on the Network.html lock symbol
to support the new time parser and search functions in YaCy a high
precision detection of date and time on the day is necessary. That
requires that the time zone of the document content and the time zone of
the user, doing a search, is detected. The time zone of the search
request is done automatically using the browsers time zone offset which
is delivered to the search request automatically and invisible to the
user. The time zone for the content of web pages cannot be detected
automatically and must be an attribute of crawl starts. The advanced
crawl start now provides an input field to set the time zone in minutes
as an offset number. All parsers must get a time zone offset passed, so
this required the change of the parser java api. A lot of other changes
had been made which corrects the wrong handling of dates in YaCy which
was to add a correction based on the time zone of the server. Now no
correction is added and all dates in YaCy are UTC/GMT time zone, a
normalized time zone for all peers.
collections in search result. When selecting one of them in another
search, switch off the previously selected collection. This actually
turns the collection navigation modifier into a radio-button like
behaviour
- if less or equal of 8 facet options are present, they are shown by
default
- if more facet options are present, they are hidden
To view or hide all facets, just click on the facet header bar
for location nav facet of field coordinate_p does not return results, now using coordinate_p_0_coordinate as alternative to get facet counts. As the actual facet value is not used this should not harm any analysis (even if facet is a incomplete location).
If facet value is used in future likely *_geohash field could be introduced (for facet and other ... as transport value)