yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Michael Peter Christen	3d3bdb0f5f	added zim importer rule for mdwiki	1 year ago
Michael Peter Christen	4a611ac6a3	another possible fix for https://github.com/yacy/yacy_search_server/issues/500	1 year ago
Michael Peter Christen	cff0991d85	test if this is helpful for https://github.com/yacy/yacy_search_server/issues/500	1 year ago
Michael Peter Christen	ceb07a5218	fixed problem with zim importer which crashed when non-valid urls appeared	1 year ago
Michael Peter Christen	34a9fc1a07	bugfixes to zim reader:	1 year ago
Michael Peter Christen	7db0534d8a	Added a zim parser to the surrogate import option. You can now import zim files into YaCy by simply moving them to the DATA/SURROGATE/IN folder. They will be fetched and after parsing moved to DATA/SURROGATE/OUT. There are exceptions where the parser is not able to identify the original URL of the documents in the zim file. In that case the file is simply ignored. This commit also carries an important fix to the pdf parser and an increase of the maximum parsing speed to 60000 PPM which should make it possible to index up to 1000 files in one second.	1 year ago
Michael Peter Christen	70e29937ef	added a check in zim importer which tests if import URLs actually exist	1 year ago
Michael Peter Christen	fdc6311dc7	added parsing rules for wikibooks and wikinews in zim reader	1 year ago
Michael Peter Christen	53b01dbf2e	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git	1 year ago
Michael Peter Christen	1c0df28bfb	added a zim importer that can be used for surrogate imports. Can not be used yet because it requires some security additions to verify that the given urls actually work.	1 year ago
Michael Peter Christen	5ba5fb5d23	upgraded pdfbox to 3.0.0	1 year ago
Michael Peter Christen	0689f4f0ae	Check if the character is a minus sign and is followed by a letter or a digit. Treat it as part of the word/number.	1 year ago
Michael Peter Christen	5db97a8928	parser can now separate numbers from words also when they are not separated by space, i.e. 4.7Ohm	1 year ago
Michael Peter Christen	e3797de7de	enhanced the word tokenizer to recognize numbers in a proper way	1 year ago
Michael Peter Christen	8285fe715a	tab to spaces for classes supporting the condenser. This is a preparation step to make changes in condenser and parser more visible; no functional changes so far.	1 year ago
Michael Peter Christen	92dad3ed49	removed 7Zip parser because the old library could not be replaced by a maven repository	1 year ago
Michael Peter Christen	1c0f50985c	fixed documentation and some details of handling of keywords	2 years ago
Michael Christen	3472bcb4d3	patched a 'java.lang.NoSuchMethodError: com.twelvemonkeys.imageio.util.IIOUtil.lookupProviderByName' problem which occurred only on ARM	2 years ago
Michael Peter Christen	9fcd8f1bda	added canonical filter attention: this is on by default! (it should do the right thing)	2 years ago
Michael Christen	4304e07e6f	crawl profile adoption to new tag valency attribute	2 years ago
Michael Peter Christen	5acd98f4da	introduction of tag-to-indexing relation TagValency	2 years ago
Michael Peter Christen	309adb814e	fixed import of jsonlist imort from searchlab.eu using a direct URL	2 years ago
Michael Peter Christen	62d177bf59	stub for jsonlist index importer web page	2 years ago
Michael Peter Christen	efa0425f00	refactoring: moved jsonlist importer to importer class	2 years ago
Michael Peter Christen	d49f937b98	added iso,apk,dmg to extension-deny list see also https://github.com/yacy/yacy_search_server/issues/510 zip is not on the list because it can be parsed	2 years ago
Michael Christen	867f96a32b	removed warnings	2 years ago
Michael Christen	8a06beaf24	removed finalize() methods, deprecated	2 years ago
Daleth Darko	3ced06c731	Various javadoc fixes	3 years ago
reger24	eae16287e9	Added epub (ebook) format to existing zipParser *.epub files are zip files containing xhtml files with content and other artifact files, which the zipParser can already feed to index - extension "epub" - mime "epub+zip"	3 years ago
sgaebel	cdf901270c	always use HTTPClient by 'try with resources' pattern to free up resources	3 years ago
sgaebel	69adaa9f55	makes our HTTPClient closable	3 years ago
Michael Peter Christen	552ab7051b	fix for warc importer	3 years ago
Michael Peter Christen	e9c5e78868	replaced new Number(Number) with Number.instanceOf to remove deprecation warnings for Java 9	3 years ago
Michael Peter Christen	9ef4503672	fixed some newInstance() warnings .. by adding .getDeclaredConstructor()	3 years ago
jfhs	10bddc2c2d	Decode HTML entities in all property values by default	4 years ago
jfhs	2135d259e3	Replace hardcoded html/xml entities with a file, support decoding all defined HTML entities	4 years ago
Michael Peter Christen	d3526c52af	fixed a problem in warc importer: do not fail if single WARC entries are faulty	4 years ago
Michael Peter Christen	d359d521a1	fixed warc importer The importer tried to import a gziped files as plain warc. It will now check the file extension and use a unzip automatically on-the-fly.	4 years ago
sgaebel	fc03c4b4fe	removes some warning and unused objects	4 years ago
sgaebel	df9ea0a42a	removes some warnings: unused imports, params	4 years ago
Michael Peter Christen	e0ad8ca9da	replaced json library from JSON.org with libandroid-json-java This fixes https://github.com/yacy/yacy_search_server/issues/347	5 years ago
luccioman	e90405b6f0	Support parsing audio URLs without file extension Added also a Junit for the audio tag parser	6 years ago
sgaebel	c2398fd890	remove warnings: 'Statement unnecessarily nested within else clause'	6 years ago
sgaebel	811d40a6c4	taking care of closing inputstreams, HTTPClient	6 years ago
luccioman	3fb449b3b6	Properly resolve relative URLs against document URL in html base tags Fixes issue #256	6 years ago
luccioman	fcf6b16db4	Added new crawler attribute for finer control over Media Type detection New "Media Type detection" section in the advanced crawl start page allow to choose between : - not loading URLs with unknown or unsupported file extension without checking the actual Media Type (relying Content-Type header for now). This was the old default behavior, faster, but not really accurate. - always cross check URL file extension against the actual Media Type. This lets properly parse URLs ending with an apparently odd file extension, but which have actually a supported Media Type such as text/html. Sample URLs with misleading file extensions added as documentation in the crawl start page. fixes issue #244	6 years ago
luccioman	54fbe166ba	Updated pdf cache clear steps consistently with current pdfbox version - Removed calls to no more existing clearResources functions (on PDFont class and its children) since upgrade to pdfbox 2.n.n - Removed hacky usage of protected internal ClassLoader function. This removes the warnings displayed when running with JDK9 or JDK10 : [java] WARNING: Illegal reflective access by net.yacy.document.parser.pdfParser$ResourceCleaner (file:<path>) to method java.lang.ClassLoader.findLoadedClass(java.lang.String) [java] WARNING: Please consider reporting this to the maintainers of net.yacy.document.parser.pdfParser$ResourceCleaner [java] WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations [java] WARNING: All illegal access operations will be denied in a future release Crawling thousands of pdf documents from various sources after modifications applied, revealed no new memory leak related to pdfbox (measurements done with JVisualVM).	6 years ago
luccioman	685122363d	Added a parser for XZ compressed archives. As suggested by LA_FORGE on mantis 781 (http://mantis.tokeek.de/view.php?id=781)	6 years ago
luccioman	8a29551c54	Upgraded the OpenGeoDB dump URL The status of the library in the DictionaryLoader_p.html page now also advertises the user that an upgrade can be applied when an older dump is already loaded. Upgrade applied as suggested by Niklas Andrus @fapth_gitlab on Gitter chat.	6 years ago
luccioman	bb51555830	Removed remaining unsafe accesses to SimpleDateFormat instances. SimpleDateFormat must not be used by concurrent threads without synchronization for parsing or formating dates as it is not thread-safe (internally holds a calendar instance that is not synchronized). Prefer now DateTimeFormatter when possible as it is thread-safe without concurrent access performance bottleneck (does not internally use synchronization locks).	6 years ago

1 2 3 4 5 ...

780 Commits (6d5e9ff53f4090e24a0cbe601df5665fb10b6ddf)