As reported edycop in mantis 765 (
http://mantis.tokeek.de/view.php?id=765 ), parsing of xlsx files was
quite incomplete.
Now properly support "Shared String Table" entry in Office Open XML
spreadsheets, an also detect embedded URLs.
Integrating the Apache poi-ooxml library could be an option for finer
OOXML formats support, but their SAX style parsing example (
http://poi.apache.org/spreadsheet/how-to.html#xssf_sax_api ) tends to
show that a custom SAX handler is still efficient for lightweight and
low memory footprint processing.
This parser adds support for any XML based format other than already
supported XML vocabularies such XHTML, RSS/Atom feeds... It will
eventually be used as a fallback if one of these specific parsers fail,
before falling back to the existing genericParser which extracts not
that much useful information except URL tokens.
These 3 files contain the same text in different HTML encodings. We use this documents to test if the parser and indexer creates the same set of word hashes for all three texts.
To use these files, run a indexing/crawling on them. To get the files inside the localhost-path, do the following:
cd <yacy-home>
rmdir DATA/HTDOCS/repository
ln -s test/parsertest DATA/HTDOCS/repository
you have then linked the test directory as repository directory which you can reach in yacy if you switch to intranet indexing mode. So the next step is to start yacy, then
- switch to intranet use case
- go to the crawl start page
- the repository directory should be the default path as crawl start
- start the crawl
- search for any word that appears in the demo texts
- search not only for words with umlautel but also for words without umlaute to ensure that you find _all_ three documents
- see how yacy presents the snippet with the text containing umlaute
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5293 6c8d7289-2bf4-0310-a012-ef5d649a1542