- adjust check of data available according to doc
- return null on no recognized content (to not exit TextParser next parser try)
- use commons.compress directly
skip reading bz2 file magicbyte to identify bz2 format as inputstream reset would be required. Common compress reads and checks the magicbytes internally and throws ioexception if wrong, making preread obsolete.
see freshly deprecated https://developers.google.com/webmasters/ajax-crawling/
Implementation improves parsing of the homepage (ajax page) which uses metatag "fragment" in header and parses supplied html snapshot instead of mostly empty ajax/scripted page.
Implementation supports also hash-bang urls (url with anchor starting with ! like ...path#!hashfragment) but our crawler filters it
(use of hash-bang is controversly discussed and proposal is deprecated, makes no sense to adjust the crawler, but as long as it is used by some sites the minor change/improvement in htmlparser is good for some time).
Quick - how does it work
- if metatag fragment with content "!" is found
- htmlparser tries to get content of htmls snapshot (using a different url)
- htmlparser returns 2 documents (original url and snapshot content - but using same original url)
- after parsing result documents are joined (and stored to index containing content also from snapshot page... as the original ajax page contains typically no parseable html content)
and allow injection of recrawl urls before queue is empty
During recrawl the balancer hangs on the very last urls often on hosts with huge delay time,
by allowing injection earlier progress is more balanced. Max number of injected crawl urls by recrawl job is 2 * max loader.
use getAttributes() to get query parameters as clear text (w/o url encoding)
use getSearchpartMap() to get in internal format (url encoded)
fix for http://mantis.tokeek.de/view.php?id=606
This is a workaround for existing index (not fully reindexed) since intro of schema with docvalues
to prevent solr exception causing recrawljob to fail with
org.apache.solr.core.SolrCore java.lang.IllegalStateException: unexpected docvalues type NONE for field 'load_date_dt' (expected=NUMERIC). Use UninvertingReader or index with docvalues.
1-char tokens and also more-than-1-char tokens, then remove the 1-char
tokens to prevent that we are to strict. This will make it possible to
be a bit more fuzzy in the search where it is appropriate.
Reads document level included title and description and skips the graphic content to save bandwidth.
svg metadata element is not interpreted
- remove rdfParser from init (current function identical with genericParser)
to prevent blank thumbnail display in image search because of not handled source which don't load on click.
Now the cross icon indicates the problem (inlcuding not supported format)
function with a numeric date field:
"unexpected docvalues type NUMERIC for field 'last_modified' (expected
one of [SORTED, SORTED_SET]). Use UninvertingReader or index with
docvalues."
This is a well-known bug inside solr which prevents that now the 'sort
by date' in the YaCy search interface can be used. Without this patch no
results at all is displayed (since the exception prevents that). Now
there is at least a result but it is not ordered properly.
(regardless if these fields part of update).
Switch partial update option off in postprocessing if schema contains *_dts (multivalued date field).
see http://mantis.tokeek.de/view.php?id=601
In case of error we deleted the original document and added the new doc to the index.
This is not valid for partial update documents (which contain only a subset of the fields).
Remove the "delete" error handling step.