Michael Peter Christen
3944984840
added snippet extraction with synonym matching
8 months ago
Michael Peter Christen
910a496c9f
replaced http links with https
9 months ago
Michael Peter Christen
833d720989
upgraded ppt parser by migration of org.apache,poi from 3.17 to 5.3.0
...
This also fixes the security waning
https://github.com/yacy/yacy_search_server/security/dependabot/37
9 months ago
Michael Peter Christen
b295e38969
fine-tuned the import process of jsonl files which had been missing
...
to actually be able to make searches and browse the index with the host
browser
11 months ago
Michael Peter Christen
3d3bdb0f5f
added zim importer rule for mdwiki
1 year ago
Michael Peter Christen
4a611ac6a3
another possible fix for
...
https://github.com/yacy/yacy_search_server/issues/500
1 year ago
Michael Peter Christen
cff0991d85
test if this is helpful for https://github.com/yacy/yacy_search_server/issues/500
1 year ago
Michael Peter Christen
ceb07a5218
fixed problem with zim importer which crashed when non-valid urls appeared
1 year ago
Michael Peter Christen
34a9fc1a07
bugfixes to zim reader:
1 year ago
Michael Peter Christen
7db0534d8a
Added a zim parser to the surrogate import option.
...
You can now import zim files into YaCy by simply moving them
to the DATA/SURROGATE/IN folder. They will be fetched and after
parsing moved to DATA/SURROGATE/OUT.
There are exceptions where the parser is not able to identify the
original URL of the documents in the zim file. In that case the file
is simply ignored.
This commit also carries an important fix to the pdf parser and an
increase of the maximum parsing speed to 60000 PPM which should make it
possible to index up to 1000 files in one second.
1 year ago
Michael Peter Christen
70e29937ef
added a check in zim importer which tests if import URLs actually exist
1 year ago
Michael Peter Christen
fdc6311dc7
added parsing rules for wikibooks and wikinews in zim reader
1 year ago
Michael Peter Christen
53b01dbf2e
Merge branch 'master' of https://github.com/yacy/yacy_search_server.git
1 year ago
Michael Peter Christen
1c0df28bfb
added a zim importer that can be used for surrogate imports.
...
Can not be used yet because it requires some security additions
to verify that the given urls actually work.
1 year ago
Michael Peter Christen
5ba5fb5d23
upgraded pdfbox to 3.0.0
1 year ago
Michael Peter Christen
0689f4f0ae
Check if the character is a minus sign and is followed by a letter or a
...
digit. Treat it as part of the word/number.
2 years ago
Michael Peter Christen
5db97a8928
parser can now separate numbers from words also when they are not
...
separated by space, i.e. 4.7Ohm
2 years ago
Michael Peter Christen
e3797de7de
enhanced the word tokenizer to recognize numbers in a proper way
2 years ago
Michael Peter Christen
8285fe715a
tab to spaces for classes supporting the condenser.
...
This is a preparation step to make changes in condenser and parser more
visible; no functional changes so far.
2 years ago
Michael Peter Christen
92dad3ed49
removed 7Zip parser because the old library could not be replaced by a maven repository
2 years ago
Michael Peter Christen
1c0f50985c
fixed documentation and some details of handling of keywords
2 years ago
Michael Christen
3472bcb4d3
patched a 'java.lang.NoSuchMethodError: com.twelvemonkeys.imageio.util.IIOUtil.lookupProviderByName' problem which occurred only on ARM
2 years ago
Michael Peter Christen
9fcd8f1bda
added canonical filter
...
attention: this is on by default!
(it should do the right thing)
2 years ago
Michael Christen
4304e07e6f
crawl profile adoption to new tag valency attribute
2 years ago
Michael Peter Christen
5acd98f4da
introduction of tag-to-indexing relation TagValency
2 years ago
Michael Peter Christen
309adb814e
fixed import of jsonlist imort from searchlab.eu using a direct URL
2 years ago
Michael Peter Christen
62d177bf59
stub for jsonlist index importer web page
3 years ago
Michael Peter Christen
efa0425f00
refactoring: moved jsonlist importer to importer class
3 years ago
Michael Peter Christen
d49f937b98
added iso,apk,dmg to extension-deny list
...
see also https://github.com/yacy/yacy_search_server/issues/510
zip is not on the list because it can be parsed
3 years ago
Michael Christen
867f96a32b
removed warnings
3 years ago
Michael Christen
8a06beaf24
removed finalize() methods, deprecated
3 years ago
Daleth Darko
3ced06c731
Various javadoc fixes
3 years ago
reger24
eae16287e9
Added epub (ebook) format to existing zipParser
...
*.epub files are zip files containing xhtml files with content and other artifact files,
which the zipParser can already feed to index
- extension "epub"
- mime "epub+zip"
3 years ago
sgaebel
cdf901270c
always use HTTPClient by 'try with resources' pattern to free up
...
resources
3 years ago
sgaebel
69adaa9f55
makes our HTTPClient closable
3 years ago
Michael Peter Christen
552ab7051b
fix for warc importer
4 years ago
Michael Peter Christen
e9c5e78868
replaced new Number(Number) with Number.instanceOf
...
to remove deprecation warnings for Java 9
4 years ago
Michael Peter Christen
9ef4503672
fixed some newInstance() warnings
...
.. by adding .getDeclaredConstructor()
4 years ago
jfhs
10bddc2c2d
Decode HTML entities in all property values by default
4 years ago
jfhs
2135d259e3
Replace hardcoded html/xml entities with a file, support decoding all defined HTML entities
4 years ago
Michael Peter Christen
d3526c52af
fixed a problem in warc importer: do not fail if single WARC entries are
...
faulty
4 years ago
Michael Peter Christen
d359d521a1
fixed warc importer
...
The importer tried to import a gziped files as plain warc.
It will now check the file extension and use a unzip automatically
on-the-fly.
4 years ago
sgaebel
fc03c4b4fe
removes some warning and unused objects
5 years ago
sgaebel
df9ea0a42a
removes some warnings: unused imports, params
5 years ago
Michael Peter Christen
e0ad8ca9da
replaced json library from JSON.org with libandroid-json-java
...
This fixes https://github.com/yacy/yacy_search_server/issues/347
5 years ago
luccioman
e90405b6f0
Support parsing audio URLs without file extension
...
Added also a Junit for the audio tag parser
6 years ago
sgaebel
c2398fd890
remove warnings: 'Statement unnecessarily nested within else clause'
6 years ago
sgaebel
811d40a6c4
taking care of closing inputstreams, HTTPClient
6 years ago
luccioman
3fb449b3b6
Properly resolve relative URLs against document URL in html base tags
...
Fixes issue #256
6 years ago
luccioman
fcf6b16db4
Added new crawler attribute for finer control over Media Type detection
...
New "Media Type detection" section in the advanced crawl start page
allow to choose between :
- not loading URLs with unknown or unsupported file extension without
checking the actual Media Type (relying Content-Type header for now).
This was the old default behavior, faster, but not really accurate.
- always cross check URL file extension against the actual Media Type.
This lets properly parse URLs ending with an apparently odd file
extension, but which have actually a supported Media Type such as
text/html.
Sample URLs with misleading file extensions added as documentation in
the crawl start page.
fixes issue #244
7 years ago