yacy_search_server/htroot/IndexImportWikimedia_p.html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>YaCy '#[clientname]#': Wikimedia Dump Import</title>
    #%env/templates/metas.template%#
    #(import)#::<meta http-equiv="REFRESH" content="10" />#(/import)#
  </head>
  <body id="IndexImportWikimedia">
    #%env/templates/header.template%#
    #%env/templates/submenuIndexCreate.template%#
    <h2>Wikimedia Dump Import</h2>
    
    #(import)#
    <p>#(status)#No import thread is running, you can start a new thread here::Bad input data: #[message]# #(/status)#</p>
    <form action="IndexImportWikimedia_p.html" method="get" accept-charset="UTF-8">
        <!-- no post method here, we don't want to transmit the whole file, only the path-->
        <fieldset>
          <legend>Wikimedia Dump File Selection: select a 'bz2' file</legend>
          You can import Wikipedia dumps here. An example is the file
          <a href="http://download.wikimedia.org/dewiki/20090311/dewiki-20090311-pages-articles.xml.bz2">
          http://download.wikimedia.org/dewiki/20090311/dewiki-20090311-pages-articles.xml.bz2</a>.
          <br />
          Dumps must be in XML format and must be encoded in bz2. Do not decompress the file after downloading!
          <br />
          <input name="file" type="text" value="DATA/HTCACHE/dewiki-20090311-pages-articles.xml.bz2" size="80" />
          <input name="submit" type="submit" value="Import Wikimedia Dump" />
        </fieldset>
    </form>
    <p>
    When the import is started, the following happens:
    </p><ul>
    <li>The dump is extracted on the fly and wiki entries are translated into Dublin Core data format. The output looks like this:
    <pre>
    &lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;surrogates xmlns:dc="http://purl.org/dc/elements/1.1/"&gt;
  &lt;record&gt;
    &lt;dc:Title&gt;&lt;![CDATA[Alan Smithee]]&gt;&lt;/dc:Title&gt;
    &lt;dc:Identifier&gt;http://de.wikipedia.org/wiki/Alan%20Smithee&lt;/dc:Identifier&gt;
    &lt;dc:Description&gt;&lt;![CDATA[Der als Filmregisseur oft genannte Alan Smithee ist ein Anagramm]]&gt;&lt;/dc:Description&gt;
    &lt;dc:Language&gt;de&lt;/dc:Language&gt;
    &lt;dc:Date&gt;2009-05-07T06:03:48Z&lt;/dc:Date&gt;
  &lt;/record&gt;
  &lt;record&gt;
    ...
  &lt;/record&gt; 
&lt;/surrogates&gt;
    </pre>
    </li>
    <li>Each 10000 wiki records are combined in one output file which is written to /DATA/SURROGATES/in into a temporary file.</li>
    <li>When each of the generated output file is finished, it is renamed to a .xml file</li>
    <li>Each time a xml surrogate file appears in /DATA/SURROGATES/in, the YaCy indexer fetches the file and indexes the record entries.</li>
    <li>When a surrogate file is finished with indexing, it is moved to /DATA/SURROGATES/out</li>
    <li>You can recycle processed surrogate files by moving them from /DATA/SURROGATES/out to /DATA/SURROGATES/in</li>
    </ul>
    <br />
    ::
    <form><fieldset><legend>Import Process</legend>
      <dl>
        <dt>Thread:</dt><dd>#[thread]#</dd>
        <dt>Dump:</dt><dd>#[dump]#</dd>
        <dt>Processed:</dt><dd>#[count]# Wiki Entries</dd>
        <dt>Speed:</dt><dd>#[speed]# articles per second</dd>
        <dt>Running Time:</dt><dd>#[runningHours]# hours, #[runningMinutes]# minutes</dd>
        <dt>Remaining Time:</dt><dd>#[remainingHours]# hours, #[remainingMinutes]# minutes</dd>
      </dl>    
    </fieldset></form>
    #(/import)#
    
    #%env/templates/footer.template%#
  </body>
</html>
added a submenu to index administration to import a wikimedia dump (i.e. a dump from wikipedia) into the YaCy index: see http://localhost:8080/IndexImportWikimedia_p.html git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5930 6c8d7289-2bf4-0310-a012-ef5d649a1542 16 years ago			`<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">`
			`<html xmlns="http://www.w3.org/1999/xhtml">`
			`<head>`
			`<title>YaCy '#[clientname]#': Wikimedia Dump Import</title>`
			`#%env/templates/metas.template%#`
			`#(import)#::<meta http-equiv="REFRESH" content="10" />#(/import)#`
			`</head>`
			`<body id="IndexImportWikimedia">`
			`#%env/templates/header.template%#`
- enhanced index create menu structure - clear search log caches each time a search is done git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7142 6c8d7289-2bf4-0310-a012-ef5d649a1542 14 years ago			`#%env/templates/submenuIndexCreate.template%#`
added a submenu to index administration to import a wikimedia dump (i.e. a dump from wikipedia) into the YaCy index: see http://localhost:8080/IndexImportWikimedia_p.html git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5930 6c8d7289-2bf4-0310-a012-ef5d649a1542 16 years ago			`<h2>Wikimedia Dump Import</h2>`

			`#(import)#`
			`<p>#(status)#No import thread is running, you can start a new thread here::Bad input data: #[message]# #(/status)#</p>`
added accept-charset="UTF-8" to all forms this applies patches from http://forum.yacy-websuche.de/viewtopic.php?p=20891#p20891 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7482 6c8d7289-2bf4-0310-a012-ef5d649a1542 14 years ago			`<form action="IndexImportWikimedia_p.html" method="get" accept-charset="UTF-8">`
added a submenu to index administration to import a wikimedia dump (i.e. a dump from wikipedia) into the YaCy index: see http://localhost:8080/IndexImportWikimedia_p.html git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5930 6c8d7289-2bf4-0310-a012-ef5d649a1542 16 years ago			`<!-- no post method here, we don't want to transmit the whole file, only the path-->`
			`<fieldset>`
			`<legend>Wikimedia Dump File Selection: select a 'bz2' file</legend>`
			`You can import Wikipedia dumps here. An example is the file`
			`<a href="http://download.wikimedia.org/dewiki/20090311/dewiki-20090311-pages-articles.xml.bz2">`
			`http://download.wikimedia.org/dewiki/20090311/dewiki-20090311-pages-articles.xml.bz2</a>.`
enhanced the wikimedia dump import process enhanced the wiki parser and condenser speed git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5931 6c8d7289-2bf4-0310-a012-ef5d649a1542 16 years ago			`<br />`
added a submenu to index administration to import a wikimedia dump (i.e. a dump from wikipedia) into the YaCy index: see http://localhost:8080/IndexImportWikimedia_p.html git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5930 6c8d7289-2bf4-0310-a012-ef5d649a1542 16 years ago			`Dumps must be in XML format and must be encoded in bz2. Do not decompress the file after downloading!`
enhanced the wikimedia dump import process enhanced the wiki parser and condenser speed git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5931 6c8d7289-2bf4-0310-a012-ef5d649a1542 16 years ago			`<br />`
			`<input name="file" type="text" value="DATA/HTCACHE/dewiki-20090311-pages-articles.xml.bz2" size="80" />`
added a submenu to index administration to import a wikimedia dump (i.e. a dump from wikipedia) into the YaCy index: see http://localhost:8080/IndexImportWikimedia_p.html git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5930 6c8d7289-2bf4-0310-a012-ef5d649a1542 16 years ago			`<input name="submit" type="submit" value="Import Wikimedia Dump" />`
			`</fieldset>`
			`</form>`
			`<p>`
			`When the import is started, the following happens:`
enhanced the wikimedia dump import process enhanced the wiki parser and condenser speed git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5931 6c8d7289-2bf4-0310-a012-ef5d649a1542 16 years ago			`</p><ul>`
added a submenu to index administration to import a wikimedia dump (i.e. a dump from wikipedia) into the YaCy index: see http://localhost:8080/IndexImportWikimedia_p.html git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5930 6c8d7289-2bf4-0310-a012-ef5d649a1542 16 years ago			`<li>The dump is extracted on the fly and wiki entries are translated into Dublin Core data format. The output looks like this:`
			`<pre>`
replaced utf-8 with UTF-8 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7485 6c8d7289-2bf4-0310-a012-ef5d649a1542 14 years ago			`<?xml version="1.0" encoding="UTF-8"?>`
added a submenu to index administration to import a wikimedia dump (i.e. a dump from wikipedia) into the YaCy index: see http://localhost:8080/IndexImportWikimedia_p.html git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5930 6c8d7289-2bf4-0310-a012-ef5d649a1542 16 years ago			`<surrogates xmlns:dc="http://purl.org/dc/elements/1.1/">`
			`<record>`
			`<dc:Title><![CDATA[Alan Smithee]]></dc:Title>`
			`<dc:Identifier>http://de.wikipedia.org/wiki/Alan%20Smithee</dc:Identifier>`
			`<dc:Description><![CDATA[Der als Filmregisseur oft genannte Alan Smithee ist ein Anagramm]]></dc:Description>`
			`<dc:Language>de</dc:Language>`
			`<dc:Date>2009-05-07T06:03:48Z</dc:Date>`
			`</record>`
			`<record>`
			`...`
			`</record>`
			`</surrogates>`
			`</pre>`
			`</li>`
			`<li>Each 10000 wiki records are combined in one output file which is written to /DATA/SURROGATES/in into a temporary file.</li>`
			`<li>When each of the generated output file is finished, it is renamed to a .xml file</li>`
			`<li>Each time a xml surrogate file appears in /DATA/SURROGATES/in, the YaCy indexer fetches the file and indexes the record entries.</li>`
			`<li>When a surrogate file is finished with indexing, it is moved to /DATA/SURROGATES/out</li>`
			`<li>You can recycle processed surrogate files by moving them from /DATA/SURROGATES/out to /DATA/SURROGATES/in</li>`
			`</ul>`
de.lng: Added some more untranslated strings I found and uncommented old ones that were removed terminal_p.html: Put back the old ID which was really easy to find IndexCreate.js: Because XHTML 1.0 Strict does not allow name tags for some elements rewrote most element access functions to use getElementById Table_API_p.html and all other html pages: Some XHTMl 1.0 Strict fixes, changed checkAll javascript, marked the first row with checkboxes as unsortable where applicable Table_API_p.java and all other java pages: URLencoded lines with possible ampersands & -> & for validation XHTML 1.0 Strict sourcecode --> All Index Create pages should validate now. Hope I did not break anything else (too much :-) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7225 6c8d7289-2bf4-0310-a012-ef5d649a1542 14 years ago			`<br />`
added a submenu to index administration to import a wikimedia dump (i.e. a dump from wikipedia) into the YaCy index: see http://localhost:8080/IndexImportWikimedia_p.html git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5930 6c8d7289-2bf4-0310-a012-ef5d649a1542 16 years ago			`::`
enhanced the wikimedia dump import process enhanced the wiki parser and condenser speed git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5931 6c8d7289-2bf4-0310-a012-ef5d649a1542 16 years ago			`<form><fieldset><legend>Import Process</legend>`
added a submenu to index administration to import a wikimedia dump (i.e. a dump from wikipedia) into the YaCy index: see http://localhost:8080/IndexImportWikimedia_p.html git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5930 6c8d7289-2bf4-0310-a012-ef5d649a1542 16 years ago			`<dl>`
enhanced the wikimedia dump import process enhanced the wiki parser and condenser speed git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5931 6c8d7289-2bf4-0310-a012-ef5d649a1542 16 years ago			`<dt>Thread:</dt><dd>#[thread]#</dd>`
			`<dt>Dump:</dt><dd>#[dump]#</dd>`
			`<dt>Processed:</dt><dd>#[count]# Wiki Entries</dd>`
			`<dt>Speed:</dt><dd>#[speed]# articles per second</dd>`
			`<dt>Running Time:</dt><dd>#[runningHours]# hours, #[runningMinutes]# minutes</dd>`
			`<dt>Remaining Time:</dt><dd>#[remainingHours]# hours, #[remainingMinutes]# minutes</dd>`
added a submenu to index administration to import a wikimedia dump (i.e. a dump from wikipedia) into the YaCy index: see http://localhost:8080/IndexImportWikimedia_p.html git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5930 6c8d7289-2bf4-0310-a012-ef5d649a1542 16 years ago			`</dl>`
enhanced the wikimedia dump import process enhanced the wiki parser and condenser speed git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5931 6c8d7289-2bf4-0310-a012-ef5d649a1542 16 years ago			`</fieldset></form>`
added a submenu to index administration to import a wikimedia dump (i.e. a dump from wikipedia) into the YaCy index: see http://localhost:8080/IndexImportWikimedia_p.html git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5930 6c8d7289-2bf4-0310-a012-ef5d649a1542 16 years ago			`#(/import)#`

			`#%env/templates/footer.template%#`
			`</body>`
			`</html>`