yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Michael Peter Christen	34a9fc1a07	bugfixes to zim reader:	1 year ago
Michael Peter Christen	7db0534d8a	Added a zim parser to the surrogate import option. You can now import zim files into YaCy by simply moving them to the DATA/SURROGATE/IN folder. They will be fetched and after parsing moved to DATA/SURROGATE/OUT. There are exceptions where the parser is not able to identify the original URL of the documents in the zim file. In that case the file is simply ignored. This commit also carries an important fix to the pdf parser and an increase of the maximum parsing speed to 60000 PPM which should make it possible to index up to 1000 files in one second.	1 year ago
Michael Peter Christen	496f768c44	modified cache strategy for zim clusters	1 year ago
Michael Peter Christen	fdc6311dc7	added parsing rules for wikibooks and wikinews in zim reader	1 year ago
Michael Peter Christen	2ea54b3503	fixed blob iterator in zim cluster definition	1 year ago
Michael Peter Christen	54fa5d3c2e	added a cluster cache but it requires more testing	1 year ago
Michael Peter Christen	53b01dbf2e	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git	1 year ago
Michael Peter Christen	41856e9f34	added an optimized zim file entry iterator	1 year ago
Michael Peter Christen	1c0df28bfb	added a zim importer that can be used for surrogate imports. Can not be used yet because it requires some security additions to verify that the given urls actually work.	1 year ago
Michael Peter Christen	e2c86a8eba	added a ZIM cluster pointer cache	1 year ago
Michael Peter Christen	9c8fb97985	introduced url list and title list caching and enhanced input stream performance in ZIM reader	1 year ago
Michael Peter Christen	b0ae660790	added Zstandard compressed data decompression for ZIM files type 5 also: more generalization and performance enhancements	1 year ago
Michael Peter Christen	ad8ee3a0b6	fixed typo in class name	1 year ago
Michael Peter Christen	c4082c4ff2	refactoring of ZIM reader, simplification, removed unnecessary code	1 year ago
Michael Peter Christen	c2b6b6e7b9	Fixed a large number of problems in the ZIM reader. This library was not prepared for large data because it was missing long data types for pointers. I had to modify the code-base in a fundamental way: - Proof-Reading, - unclustering, - refactoring, - naming adoption to https://wiki.openzim.org/wiki/ZIM_file_format, - change of Exception handling, - extension to more attributes as defined in spec (bugfix for mime type loading) - bugfix to long parsing (prevented reading of large files) The code is furthermore very inefficient and requires more attention. However the format is very useful for YaCy as there are numerous data sources for ZIM-Files.	1 year ago
Michael Peter Christen	1fefae9baf	integrated the source code of a openzim file format reader. These are the raw format reader files with no integration in YaCy yet, which will maybe follow as a next step. The zim file format is documented in https://openzim.org and the reader code was taken from the archived, non-maintained repository at https://github.com/openzim/zimreader-java	1 year ago

16 Commits (master)