Commit Graph

115 Commits (fd45ccf76ebb7f18f63ce508e43ff66d118f2cc3)

Author SHA1 Message Date
Michael Peter Christen b295e38969 fine-tuned the import process of jsonl files which had been missing
10 months ago
Michael Peter Christen 3d3bdb0f5f added zim importer rule for mdwiki
1 year ago
Michael Peter Christen ceb07a5218 fixed problem with zim importer which crashed when non-valid urls appeared
1 year ago
Michael Peter Christen 34a9fc1a07 bugfixes to zim reader:
1 year ago
Michael Peter Christen 7db0534d8a Added a zim parser to the surrogate import option.
1 year ago
Michael Peter Christen 70e29937ef added a check in zim importer which tests if import URLs actually exist
1 year ago
Michael Peter Christen fdc6311dc7 added parsing rules for wikibooks and wikinews in zim reader
1 year ago
Michael Peter Christen 53b01dbf2e Merge branch 'master' of https://github.com/yacy/yacy_search_server.git
1 year ago
Michael Peter Christen 1c0df28bfb added a zim importer that can be used for surrogate imports.
1 year ago
Michael Christen 4304e07e6f crawl profile adoption to new tag valency attribute
2 years ago
Michael Peter Christen 309adb814e fixed import of jsonlist imort from searchlab.eu using a direct URL
2 years ago
Michael Peter Christen 62d177bf59 stub for jsonlist index importer web page
2 years ago
Michael Peter Christen efa0425f00 refactoring: moved jsonlist importer to importer class
2 years ago
Michael Christen 867f96a32b removed warnings
2 years ago
Michael Peter Christen 552ab7051b fix for warc importer
3 years ago
Michael Peter Christen d3526c52af fixed a problem in warc importer: do not fail if single WARC entries are
4 years ago
Michael Peter Christen d359d521a1 fixed warc importer
4 years ago
sgaebel fc03c4b4fe removes some warning and unused objects
5 years ago
sgaebel df9ea0a42a removes some warnings: unused imports, params
5 years ago
sgaebel c2398fd890 remove warnings: 'Statement unnecessarily nested within else clause'
6 years ago
Michael Peter Christen 25573bd5ab added a crawl filter based on <div> tag class names
7 years ago
luccioman 46f37e38dc Customized Threads with generic name for easier monitoring.
7 years ago
reger 51a4e03c93 Allow to stop currently running warc import (stop button)
8 years ago
reger 1737af37cf Set request originator to own peer in warc importer
8 years ago
reger 039162fbf0 Change warc importer to use defaultsurrogate-crawl profile, as reported
8 years ago
luccioman edd7ccac40 Added some JavaDoc
8 years ago
reger bec34d3546 Add url input field as source for WarcImporter
8 years ago
luccioman f66438442e Extended Mediawiki dump import to remote URLs.
8 years ago
reger ba339a2a45 Add servlet to import warc file from filesystem IndexImportWarc_p.html.
8 years ago
reger 510f11d374 Implement surrogate import from Warc archives (as first option handle
8 years ago
luccioman 6a4d51d8f9 Cleaned up some Javadoc warnings.
8 years ago
luccioman eec5779889 Added a name prefix to pooled threads for easier monitoring.
8 years ago
luccioman f0639d810c Customized name for Threads still using the default "Thread-n" pattern.
9 years ago
luc 571bc55937 Refactoring : use StandardCharsets constants instead of hard-coded
9 years ago
reger 46ac0867ff fix poison mediawikiimporter output queue also after ExecutionException
9 years ago
reger a7591d3ed0 fix mediawikiimporter number format exception on coordinate parsing
9 years ago
reger 6b7c10cef8 fix dc:date in mediawikiimporter/document.writexml to use lastmodified
9 years ago
luc 7736ee5a42 Updated MediaWimporter main() : display usage in console and stop
9 years ago
reger bbe9df2bb3 fix MediawikiImporter for bz2 dump
10 years ago
Michael Peter Christen 6f4fe4b175 revert of 8a7c68e4c7
10 years ago
Michael Peter Christen fed26f33a8 enhanced timezone managament for indexed data:
10 years ago
Michael Peter Christen b5ac29c9a5 added a html field scraper which reads text from html entities of a
10 years ago
Michael Peter Christen 321840fde3 Replaced all fixed thread pools with cached thread pools. The cached
10 years ago
orbiter 97983ba89f fixed generics warnings for generic array instantiation that appeared
11 years ago
reger 8a7c68e4c7 content of surrogates/out never accessed (remove)
11 years ago
reger 121d25be38 recover sax fatal error on OAI-PMH import of xml with entity error
11 years ago
Michael Peter Christen da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues:
11 years ago
Michael Peter Christen 8b44fcf0f4 added missing @Override annotation
11 years ago
Michael Peter Christen 61c5e40687 - replaced the properties object in AnchorURL with distinct variables
12 years ago
Michael Peter Christen 5e31bad711 - the webgraph shall store all links which appear on a web page and not
12 years ago