|
|
|
<!DOCTYPE html>
|
|
|
|
<html lang="en">
|
|
|
|
<head>
|
|
|
|
<title>YaCy '#[clientname]#': MediaWiki Dump Import</title>
|
|
|
|
#%env/templates/metas.template%#
|
|
|
|
#(import)#::<meta http-equiv="REFRESH" content="10;url=IndexImportMediawiki_p.html" />
|
|
|
|
<!-- the url= removes http get parameters on refresh, preventing restart of import -->
|
|
|
|
#(/import)#
|
|
|
|
</head>
|
|
|
|
<body id="IndexImportMediawiki">
|
|
|
|
#%env/templates/header.template%#
|
|
|
|
#%env/templates/submenuIndexImport.template%#
|
|
|
|
<h2>MediaWiki Dump Import</h2>
|
|
|
|
|
|
|
|
#(import)#
|
|
|
|
<p>#(status)#<div class="alert alert-info" role="alert">No import thread is running, you can start a new thread here</div>
|
|
|
|
::<div class="alert alert-danger" role="alert">Error : file argument must be a path to a document in the local file system</div>
|
|
|
|
::<div class="alert alert-danger" role="alert">Error : file not found "#[sourceFile]#"</div>
|
|
|
|
::<div class="alert alert-danger" role="alert">Error : can not read file "#[sourceFile]#"</div>
|
|
|
|
::<div class="alert alert-danger" role="alert">Error : you selected a directory ("#[sourceFile]#")</div>
|
|
|
|
#(/status)#</p>
|
|
|
|
<form action="IndexImportMediawiki_p.html" method="post" accept-charset="UTF-8" class="form-horizontal">
|
|
|
|
<fieldset>
|
|
|
|
<legend>MediaWiki Dump File Selection: select an XML file (which may be bz2- or gz-encoded)</legend>
|
|
|
|
<p>
|
|
|
|
You can import MediaWiki dumps here. An example is the file
|
|
|
|
<a href="http://dumps.wikimedia.org/dewiki/latest/dewiki-latest-pages-articles.xml.bz2">
|
|
|
|
http://dumps.wikimedia.org/dewiki/latest/dewiki-latest-pages-articles.xml.bz2</a>.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
Dumps must be stored in the local file system in XML format and may be compressed in gz or bz2.
|
|
|
|
</p>
|
|
|
|
<div class="form-group">
|
|
|
|
<div class="col-sm-3 col-md-2 col-lg-2">
|
|
|
|
<label for="file" class="control-label" >Dump file path</label>
|
|
|
|
</div>
|
|
|
|
<div class="col-sm-9 col-md-8 col-lg-8">
|
|
|
|
<input id="file" class="form-control" name="file" type="text" title="Dump file path on this YaCy server file system" required="required"/>
|
|
|
|
</div>
|
|
|
|
</div>
|
|
|
|
<input name="submit" class="btn btn-primary" type="submit" value="Import MediaWiki Dump" />
|
|
|
|
</fieldset>
|
|
|
|
</form>
|
|
|
|
<p>
|
|
|
|
When the import is started, the following happens:
|
|
|
|
</p><ul>
|
|
|
|
<li>The dump is extracted on the fly and wiki entries are translated into Dublin Core data format. The output looks like this:
|
|
|
|
<pre>
|
|
|
|
<?xml version="1.0" encoding="UTF-8"?>
|
|
|
|
<surrogates xmlns:dc="http://purl.org/dc/elements/1.1/">
|
|
|
|
<record>
|
|
|
|
<dc:Title><![CDATA[Alan Smithee]]></dc:Title>
|
|
|
|
<dc:Identifier>http://de.wikipedia.org/wiki/Alan%20Smithee</dc:Identifier>
|
|
|
|
<dc:Description><![CDATA[Der als Filmregisseur oft genannte Alan Smithee ist ein Anagramm]]></dc:Description>
|
|
|
|
<dc:Language>de</dc:Language>
|
|
|
|
<dc:Date>2009-05-07T06:03:48Z</dc:Date>
|
|
|
|
</record>
|
|
|
|
<record>
|
|
|
|
...
|
|
|
|
</record>
|
|
|
|
</surrogates>
|
|
|
|
</pre>
|
|
|
|
</li>
|
|
|
|
<li>Each 10000 wiki records are combined in one output file which is written to /DATA/SURROGATES/in into a temporary file.</li>
|
|
|
|
<li>When each of the generated output file is finished, it is renamed to a .xml file</li>
|
|
|
|
<li>Each time a xml surrogate file appears in /DATA/SURROGATES/in, the YaCy indexer fetches the file and indexes the record entries.</li>
|
|
|
|
<li>When a surrogate file is finished with indexing, it is moved to /DATA/SURROGATES/out</li>
|
|
|
|
<li>You can recycle processed surrogate files by moving them from /DATA/SURROGATES/out to /DATA/SURROGATES/in</li>
|
|
|
|
</ul>
|
|
|
|
<br />
|
|
|
|
::
|
|
|
|
<form><fieldset><legend>Import Process</legend>
|
|
|
|
<dl>
|
|
|
|
<dt>Thread:</dt><dd>#[thread]#</dd>
|
|
|
|
<dt>Dump:</dt><dd>#[dump]#</dd>
|
|
|
|
<dt>Processed:</dt><dd>#[count]# Wiki Entries</dd>
|
|
|
|
<dt>Speed:</dt><dd>#[speed]# articles per second</dd>
|
|
|
|
<dt>Running Time:</dt><dd>#[runningHours]# hours, #[runningMinutes]# minutes</dd>
|
|
|
|
<dt>Remaining Time:</dt><dd>#[remainingHours]# hours, #[remainingMinutes]# minutes</dd>
|
|
|
|
</dl>
|
|
|
|
</fieldset></form>
|
|
|
|
#(/import)#
|
|
|
|
|
|
|
|
#%env/templates/footer.template%#
|
|
|
|
</body>
|
|
|
|
</html>
|