yacy_search_server

Commit Graph

Author	SHA1	Message	Date
reger24	91a2ad1457	fix test xlsx file with correct anchor as test anchor in source changed	3 years ago
Michael Peter Christen	f0f12f875b	fix for failing parser test: new forum link	5 years ago
luccioman	e90405b6f0	Support parsing audio URLs without file extension Added also a Junit for the audio tag parser	6 years ago
luccioman	685122363d	Added a parser for XZ compressed archives. As suggested by LA_FORGE on mantis 781 (http://mantis.tokeek.de/view.php?id=781)	6 years ago
luccioman	32c9dfa768	Added partial bzip2 stream parsing support and bzipParser Junit test	7 years ago
luccioman	c6ae87168a	Added unit tests on the gzip parser.	7 years ago
luccioman	169ffdd1c7	Finer control on max links to parse in the html parser.	7 years ago
luccioman	e41d046a9d	Improved parsing support for OOXML spreadsheets (.xlsx) As reported edycop in mantis 765 ( http://mantis.tokeek.de/view.php?id=765 ), parsing of xlsx files was quite incomplete. Now properly support "Shared String Table" entry in Office Open XML spreadsheets, an also detect embedded URLs. Integrating the Apache poi-ooxml library could be an option for finer OOXML formats support, but their SAX style parsing example ( http://poi.apache.org/spreadsheet/how-to.html#xssf_sax_api ) tends to show that a custom SAX handler is still efficient for lightweight and low memory footprint processing.	7 years ago
luccioman	780173008e	Implemented partial stream parsing of tar archives. Also added JUnit tests for the tar parser and fixed unwanted use of the tar parser as a fallback on files included in a tar archive.	7 years ago
luccioman	319231a458	Added a generic XML parser, able to parse elements text and URLs. This parser adds support for any XML based format other than already supported XML vocabularies such XHTML, RSS/Atom feeds... It will eventually be used as a fallback if one of these specific parsers fail, before falling back to the existing genericParser which extracts not that much useful information except URL tokens.	7 years ago
luccioman	1acb7005d0	Added a basic JUnit test with test gz files for the gzip parser	8 years ago
reger	9edc7308aa	update to metadata-extractor-2.7.0.jar add 2 simple JUnit test cases for jpeg and tif parsing	10 years ago
reger	aa2e15d846	allow url parameter in worktable apicall allow url=wwwl?param=a&param=b (with ?, & encoded) fix: http://mantis.tokeek.de/view.php?id=100 fix double adding of '&' in MultiProtocolURL.escape()	10 years ago
orbiter	3528b970d6	- refactoring - added new experimental (not-yet-working) image parser - added new test image git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6431 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	65b1d51e70	added xml version of windows office test files git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6244 6c8d7289-2bf4-0310-a012-ef5d649a1542	16 years ago
f1ori	67da20647f	* add new odf parser based on sax-xml-parser * remove odf_utils-jar * test metadata in ParserTest git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6231 6c8d7289-2bf4-0310-a012-ef5d649a1542	16 years ago
orbiter	d553e4ff39	added visio test files and mime types git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6165 6c8d7289-2bf4-0310-a012-ef5d649a1542	16 years ago
lotus	bb570716e6	added more testfiles git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5347 6c8d7289-2bf4-0310-a012-ef5d649a1542	16 years ago
orbiter	84185baa81	added more test files for windows from lulabad git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5340 6c8d7289-2bf4-0310-a012-ef5d649a1542	16 years ago
orbiter	3246358485	mistake -> rename git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5336 6c8d7289-2bf4-0310-a012-ef5d649a1542	16 years ago
orbiter	55ec57d27f	added linux umlute test files from low012 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5335 6c8d7289-2bf4-0310-a012-ef5d649a1542	16 years ago
orbiter	e9262b3890	re-named old test files added more mac test files git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5333 6c8d7289-2bf4-0310-a012-ef5d649a1542	16 years ago
orbiter	ff2a54da68	added more umlaute test files: mac git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5332 6c8d7289-2bf4-0310-a012-ef5d649a1542	16 years ago
orbiter	204220ecd5	added test files for UTF-8 / Umlaute - Testing: These 3 files contain the same text in different HTML encodings. We use this documents to test if the parser and indexer creates the same set of word hashes for all three texts. To use these files, run a indexing/crawling on them. To get the files inside the localhost-path, do the following: cd <yacy-home> rmdir DATA/HTDOCS/repository ln -s test/parsertest DATA/HTDOCS/repository you have then linked the test directory as repository directory which you can reach in yacy if you switch to intranet indexing mode. So the next step is to start yacy, then - switch to intranet use case - go to the crawl start page - the repository directory should be the default path as crawl start - start the crawl - search for any word that appears in the demo texts - search not only for words with umlautel but also for words without umlaute to ensure that you find _all_ three documents - see how yacy presents the snippet with the text containing umlaute git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5293 6c8d7289-2bf4-0310-a012-ef5d649a1542	16 years ago

24 Commits (d5d4e8fe3a76fadbf8e939ff13cce5944740e80f)