From 89aeb318d3bdf7c61f8d735d8488de1925485b46 Mon Sep 17 00:00:00 2001 From: orbiter Date: Fri, 8 May 2009 10:36:13 +0000 Subject: [PATCH] enhanced the wikimedia dump import process enhanced the wiki parser and condenser speed git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5931 6c8d7289-2bf4-0310-a012-ef5d649a1542 --- htroot/IndexImportWikimedia_p.html | 24 +++++++----- htroot/IndexImportWikimedia_p.java | 12 ++++++ source/de/anomic/data/wiki/wikiCode.java | 1 + source/de/anomic/plasma/parser/Condenser.java | 8 +++- source/de/anomic/plasma/plasmaWordIndex.java | 8 ++-- source/de/anomic/tools/mediawikiIndex.java | 37 +++++++++++++++++-- 6 files changed, 70 insertions(+), 20 deletions(-) diff --git a/htroot/IndexImportWikimedia_p.html b/htroot/IndexImportWikimedia_p.html index 104fb84f3..853bfb85c 100644 --- a/htroot/IndexImportWikimedia_p.html +++ b/htroot/IndexImportWikimedia_p.html @@ -12,23 +12,23 @@ #(import)#

#(status)#No import thread is running, you can start a new thread here::Bad input data: #[message]# #(/status)#

-
+
Wikimedia Dump File Selection: select a 'bz2' file You can import Wikipedia dumps here. An example is the file http://download.wikimedia.org/dewiki/20090311/dewiki-20090311-pages-articles.xml.bz2. -
+
Dumps must be in XML format and must be encoded in bz2. Do not decompress the file after downloading! -
- +
+

When the import is started, the following happens: -