yacy_search_server/source/net/yacy/document/parser/augment/AugmentParser.java

package net.yacy.document.parser.augment;

import java.io.IOException;
import java.io.InputStream;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Set;

import net.yacy.cora.util.ConcurrentLog;
import net.yacy.data.ymark.YMarkUtil;
import net.yacy.document.AbstractParser;
import net.yacy.document.Document;
import net.yacy.document.Parser;
import net.yacy.document.parser.rdfa.impl.RDFaParser;
import net.yacy.kelondro.data.meta.DigestURI;
import net.yacy.search.Switchboard;


public class AugmentParser extends AbstractParser implements Parser {

    RDFaParser rdfaParser;

    public AugmentParser() {
        super("AugmentParser");
        this.rdfaParser = new RDFaParser();

        ConcurrentLog.info("AugmentedParser", "augmented parser was initialized");

        this.SUPPORTED_EXTENSIONS.add("html");
        this.SUPPORTED_EXTENSIONS.add("htm");
        this.SUPPORTED_EXTENSIONS.add("xhtml");        
        this.SUPPORTED_EXTENSIONS.add("php");
        this.SUPPORTED_MIME_TYPES.add("text/html");
        this.SUPPORTED_MIME_TYPES.add("text/xhtml+xml");
    }

    @Override
    public Document[] parse(DigestURI url, String mimeType, String charset, InputStream source) throws Parser.Failure, InterruptedException {

        Document[] htmlDocs = this.rdfaParser.parse(url, mimeType, charset, source);

        for (final Document doc : htmlDocs) {
            /* analyze(doc, url, mimeType, charset);  // enrich document text */
            parseAndAugment(doc, url, mimeType, charset); // enrich document with additional tags
        }
        return htmlDocs;
    }

/*  TODO: not implemented yet
 *
    private void analyze(Document origDoc, DigestURI url,
            String mimeType, String charset) {
        // if the magic word appears in the document, perform extra actions.
        if (origDoc.getKeywords().contains("magicword")) {
            String all = "";
            all = "yacylatest";
            // TODO: append content of string all to origDoc.text, maybe use Document.mergeDocuments() to do so
        }
    }
*/
    private void parseAndAugment(Document origDoc, DigestURI url, @SuppressWarnings("unused") String mimeType, @SuppressWarnings("unused") String charset) {

        Iterator<net.yacy.kelondro.blob.Tables.Row> it;
        try {
            it = Switchboard.getSwitchboard().tables.iterator("aggregatedtags");
            it = Switchboard.getSwitchboard().tables.orderBy(it, -1, "timestamp_creation").iterator();
            while (it.hasNext()) {
                net.yacy.kelondro.blob.Tables.Row r = it.next();
                if (r.get("url", "").equals(url.toNormalform(false))) {
                    Set<String> tags = new HashSet<String>();
                    for (String s : YMarkUtil.keysStringToSet(r.get("scitag", ""))) {
                        tags.add(s);
                    }
                    origDoc.addTags(tags);
                }
            }

        } catch (final IOException e) {
            ConcurrentLog.logException(e);
        }
    }


}
- fix: with augmented parsing = on; missing metadata in index (like title) due to overwriting metadata by adding multiple result docs from augmentparser with same url - fix Document.addsubdocuments: sections might be initialized as Arrays.toList which does not provide the used .addAll methode see e.g. http://kamleshkr.wordpress.com/2010/02/17/inside-java-arrays-aslistt-a/ 12 years ago			`package net.yacy.document.parser.augment;`

			`import java.io.IOException;`
			`import java.io.InputStream;`
			`import java.util.HashSet;`
			`import java.util.Iterator;`
			`import java.util.Set;`

- refactoring of log to ConcurrentLog: jdk-based logger tend to block at java.util.logging.Logger.log(Logger.java:476) in concurrent environments. This makes logging a main performance issue. To overcome this problem, this is a add-on to jdk logging to put log entries on a concurrent message queue and log the messages one by one using a separate process. - FTPClient uses the concurrent logging instead of the log4j logger 12 years ago			`import net.yacy.cora.util.ConcurrentLog;`
- fix: with augmented parsing = on; missing metadata in index (like title) due to overwriting metadata by adding multiple result docs from augmentparser with same url - fix Document.addsubdocuments: sections might be initialized as Arrays.toList which does not provide the used .addAll methode see e.g. http://kamleshkr.wordpress.com/2010/02/17/inside-java-arrays-aslistt-a/ 12 years ago			`import net.yacy.data.ymark.YMarkUtil;`
			`import net.yacy.document.AbstractParser;`
			`import net.yacy.document.Document;`
			`import net.yacy.document.Parser;`
			`import net.yacy.document.parser.rdfa.impl.RDFaParser;`
			`import net.yacy.kelondro.data.meta.DigestURI;`
			`import net.yacy.search.Switchboard;`


			`public class AugmentParser extends AbstractParser implements Parser {`

			`RDFaParser rdfaParser;`

- optimize code of augmented parsing to enhence document tags - commented out augmentedparser.analyse (not function implemented yet) - adjust init of document title list to always use same list type 12 years ago			`public AugmentParser() {`
			`super("AugmentParser");`
			`this.rdfaParser = new RDFaParser();`

- refactoring of log to ConcurrentLog: jdk-based logger tend to block at java.util.logging.Logger.log(Logger.java:476) in concurrent environments. This makes logging a main performance issue. To overcome this problem, this is a add-on to jdk logging to put log entries on a concurrent message queue and log the messages one by one using a separate process. - FTPClient uses the concurrent logging instead of the log4j logger 12 years ago			`ConcurrentLog.info("AugmentedParser", "augmented parser was initialized");`
- optimize code of augmented parsing to enhence document tags - commented out augmentedparser.analyse (not function implemented yet) - adjust init of document title list to always use same list type 12 years ago
			`this.SUPPORTED_EXTENSIONS.add("html");`
- small adjustment to make sure genericParser is tried last -- for some documents genericParser grabs document instead of specific available parser due to unordered pick of 1st to try parser (like .ps .rdf files and other) - remove redundant file extension registration 11 years ago			`this.SUPPORTED_EXTENSIONS.add("htm");`
- remove possible double initialization of rdfa parser - use ordered list to use preferred parser for mime/extension first (relates to html, rdfa, argument parser) - harmonize xhtml extension config for the 3 html base parsers 11 years ago			`this.SUPPORTED_EXTENSIONS.add("xhtml");`
- optimize code of augmented parsing to enhence document tags - commented out augmentedparser.analyse (not function implemented yet) - adjust init of document title list to always use same list type 12 years ago			`this.SUPPORTED_EXTENSIONS.add("php");`
			`this.SUPPORTED_MIME_TYPES.add("text/html");`
			`this.SUPPORTED_MIME_TYPES.add("text/xhtml+xml");`
			`}`

			`@Override`
			`public Document[] parse(DigestURI url, String mimeType, String charset, InputStream source) throws Parser.Failure, InterruptedException {`

			`Document[] htmlDocs = this.rdfaParser.parse(url, mimeType, charset, source);`

			`for (final Document doc : htmlDocs) {`
			`/* analyze(doc, url, mimeType, charset); // enrich document text */`
			`parseAndAugment(doc, url, mimeType, charset); // enrich document with additional tags`
			`}`
			`return htmlDocs;`
			`}`

			`/* TODO: not implemented yet`
			`*`
			`private void analyze(Document origDoc, DigestURI url,`
			`String mimeType, String charset) {`
			`// if the magic word appears in the document, perform extra actions.`
			`if (origDoc.getKeywords().contains("magicword")) {`
			`String all = "";`
			`all = "yacylatest";`
			`// TODO: append content of string all to origDoc.text, maybe use Document.mergeDocuments() to do so`
			`}`
			`}`
			`*/`
- added field options to all solr queries. This can be used to restrict the actual data which is fetched from solr. - used the new field options to reduce generic options like getting the load date or the count of search results. should increase overall speed - used the new field options to reduce overhead in the host browser during aquisition of links. - used the field options to make checking of links in crawler faster - if the crawler is paused, the crawl queue is not cleaned 12 years ago			`private void parseAndAugment(Document origDoc, DigestURI url, @SuppressWarnings("unused") String mimeType, @SuppressWarnings("unused") String charset) {`
- optimize code of augmented parsing to enhence document tags - commented out augmentedparser.analyse (not function implemented yet) - adjust init of document title list to always use same list type 12 years ago
			`Iterator<net.yacy.kelondro.blob.Tables.Row> it;`
			`try {`
			`it = Switchboard.getSwitchboard().tables.iterator("aggregatedtags");`
			`it = Switchboard.getSwitchboard().tables.orderBy(it, -1, "timestamp_creation").iterator();`
			`while (it.hasNext()) {`
			`net.yacy.kelondro.blob.Tables.Row r = it.next();`
			`if (r.get("url", "").equals(url.toNormalform(false))) {`
			`Set<String> tags = new HashSet<String>();`
			`for (String s : YMarkUtil.keysStringToSet(r.get("scitag", ""))) {`
			`tags.add(s);`
			`}`
			`origDoc.addTags(tags);`
			`}`
- fix: with augmented parsing = on; missing metadata in index (like title) due to overwriting metadata by adding multiple result docs from augmentparser with same url - fix Document.addsubdocuments: sections might be initialized as Arrays.toList which does not provide the used .addAll methode see e.g. http://kamleshkr.wordpress.com/2010/02/17/inside-java-arrays-aslistt-a/ 12 years ago			`}`

Added 'final' for all exception blocks as this helps the Java compiler to optimize memory usage Conflicts: source/net/yacy/search/Switchboard.java 11 years ago			`} catch (final IOException e) {`
- refactoring of log to ConcurrentLog: jdk-based logger tend to block at java.util.logging.Logger.log(Logger.java:476) in concurrent environments. This makes logging a main performance issue. To overcome this problem, this is a add-on to jdk logging to put log entries on a concurrent message queue and log the messages one by one using a separate process. - FTPClient uses the concurrent logging instead of the log4j logger 12 years ago			`ConcurrentLog.logException(e);`
- optimize code of augmented parsing to enhence document tags - commented out augmentedparser.analyse (not function implemented yet) - adjust init of document title list to always use same list type 12 years ago			`}`
			`}`
- fix: with augmented parsing = on; missing metadata in index (like title) due to overwriting metadata by adding multiple result docs from augmentparser with same url - fix Document.addsubdocuments: sections might be initialized as Arrays.toList which does not provide the used .addAll methode see e.g. http://kamleshkr.wordpress.com/2010/02/17/inside-java-arrays-aslistt-a/ 12 years ago

			`}`