Commit Graph

780 Commits (8eb0d490aa486e4ade9dd40f4da433e3dc5f4ed6)

Author SHA1 Message Date
Michael Peter Christen 3d3bdb0f5f added zim importer rule for mdwiki 1 year ago
Michael Peter Christen 4a611ac6a3 another possible fix for 1 year ago
Michael Peter Christen cff0991d85 test if this is helpful for https://github.com/yacy/yacy_search_server/issues/500 1 year ago
Michael Peter Christen ceb07a5218 fixed problem with zim importer which crashed when non-valid urls appeared 1 year ago
Michael Peter Christen 34a9fc1a07 bugfixes to zim reader: 1 year ago
Michael Peter Christen 7db0534d8a Added a zim parser to the surrogate import option. 1 year ago
Michael Peter Christen 70e29937ef added a check in zim importer which tests if import URLs actually exist 1 year ago
Michael Peter Christen fdc6311dc7 added parsing rules for wikibooks and wikinews in zim reader 1 year ago
Michael Peter Christen 53b01dbf2e Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 1 year ago
Michael Peter Christen 1c0df28bfb added a zim importer that can be used for surrogate imports. 1 year ago
Michael Peter Christen 5ba5fb5d23 upgraded pdfbox to 3.0.0 1 year ago
Michael Peter Christen 0689f4f0ae Check if the character is a minus sign and is followed by a letter or a 2 years ago
Michael Peter Christen 5db97a8928 parser can now separate numbers from words also when they are not 2 years ago
Michael Peter Christen e3797de7de enhanced the word tokenizer to recognize numbers in a proper way 2 years ago
Michael Peter Christen 8285fe715a tab to spaces for classes supporting the condenser. 2 years ago
Michael Peter Christen 92dad3ed49 removed 7Zip parser because the old library could not be replaced by a maven repository 2 years ago
Michael Peter Christen 1c0f50985c fixed documentation and some details of handling of keywords 2 years ago
Michael Christen 3472bcb4d3 patched a 'java.lang.NoSuchMethodError: com.twelvemonkeys.imageio.util.IIOUtil.lookupProviderByName' problem which occurred only on ARM 2 years ago
Michael Peter Christen 9fcd8f1bda added canonical filter 2 years ago
Michael Christen 4304e07e6f crawl profile adoption to new tag valency attribute 2 years ago
Michael Peter Christen 5acd98f4da introduction of tag-to-indexing relation TagValency 2 years ago
Michael Peter Christen 309adb814e fixed import of jsonlist imort from searchlab.eu using a direct URL 2 years ago
Michael Peter Christen 62d177bf59 stub for jsonlist index importer web page 2 years ago
Michael Peter Christen efa0425f00 refactoring: moved jsonlist importer to importer class 2 years ago
Michael Peter Christen d49f937b98 added iso,apk,dmg to extension-deny list 2 years ago
Michael Christen 867f96a32b removed warnings 2 years ago
Michael Christen 8a06beaf24 removed finalize() methods, deprecated 2 years ago
Daleth Darko 3ced06c731 Various javadoc fixes 3 years ago
reger24 eae16287e9 Added epub (ebook) format to existing zipParser 3 years ago
sgaebel cdf901270c always use HTTPClient by 'try with resources' pattern to free up 3 years ago
sgaebel 69adaa9f55 makes our HTTPClient closable 3 years ago
Michael Peter Christen 552ab7051b fix for warc importer 3 years ago
Michael Peter Christen e9c5e78868 replaced new Number(Number) with Number.instanceOf 4 years ago
Michael Peter Christen 9ef4503672 fixed some newInstance() warnings 4 years ago
jfhs 10bddc2c2d Decode HTML entities in all property values by default 4 years ago
jfhs 2135d259e3 Replace hardcoded html/xml entities with a file, support decoding all defined HTML entities 4 years ago
Michael Peter Christen d3526c52af fixed a problem in warc importer: do not fail if single WARC entries are 4 years ago
Michael Peter Christen d359d521a1 fixed warc importer 4 years ago
sgaebel fc03c4b4fe removes some warning and unused objects 5 years ago
sgaebel df9ea0a42a removes some warnings: unused imports, params 5 years ago
Michael Peter Christen e0ad8ca9da replaced json library from JSON.org with libandroid-json-java 5 years ago
luccioman e90405b6f0 Support parsing audio URLs without file extension 6 years ago
sgaebel c2398fd890 remove warnings: 'Statement unnecessarily nested within else clause' 6 years ago
sgaebel 811d40a6c4 taking care of closing inputstreams, HTTPClient 6 years ago
luccioman 3fb449b3b6 Properly resolve relative URLs against document URL in html base tags 6 years ago
luccioman fcf6b16db4 Added new crawler attribute for finer control over Media Type detection 6 years ago
luccioman 54fbe166ba Updated pdf cache clear steps consistently with current pdfbox version 7 years ago
luccioman 685122363d Added a parser for XZ compressed archives. 7 years ago
luccioman 8a29551c54 Upgraded the OpenGeoDB dump URL 7 years ago
luccioman bb51555830 Removed remaining unsafe accesses to SimpleDateFormat instances. 7 years ago