yacy_search_server/source/de/anomic/plasma/plasmaSearchImages.java

// plasmaSearchImages.java 
// -----------------------
// part of YACY
// (C) by Michael Peter Christen; mc@yacy.net
// first published on http://www.anomic.de
// Frankfurt, Germany, 2006
// Created: 04.04.2006
//
// This program is free software; you can redistribute it and/or modify
// it under the terms of the GNU General Public License as published by
// the Free Software Foundation; either version 2 of the License, or
// (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with this program; if not, write to the Free Software
// Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA

package de.anomic.plasma;

import java.io.InputStream;
import java.net.MalformedURLException;
import java.util.HashMap;
import java.util.Iterator;

import de.anomic.htmlFilter.htmlFilterContentScraper;
import de.anomic.htmlFilter.htmlFilterImageEntry;
import de.anomic.plasma.parser.ParserException;
import de.anomic.server.serverDate;
import de.anomic.yacy.yacyURL;

public final class plasmaSearchImages {

    private final HashMap<String, htmlFilterImageEntry> images;
    
    public plasmaSearchImages(final long maxTime, final yacyURL url, final int depth, final boolean indexing) {
        final long start = System.currentTimeMillis();
        this.images = new HashMap<String, htmlFilterImageEntry>();
        if (maxTime > 10) {
            final Object[] resource = plasmaSnippetCache.getResource(url, true, (int) maxTime, false, indexing);
            final InputStream res = (InputStream) resource[0];
            final Long resLength = (Long) resource[1];
            if (res != null) {
                plasmaParserDocument document = null;
                try {
                    // parse the document
                    document = plasmaSnippetCache.parseDocument(url, resLength.longValue(), res);
                } catch (final ParserException e) {
                    // parsing failed
                } finally {
                    try { res.close(); } catch (final Exception e) {/* ignore this */}
                }
                if (document == null) return;
                
                // add the image links
                htmlFilterContentScraper.addAllImages(this.images, document.getImages());

                // add also links from pages one step deeper, if depth > 0
                if (depth > 0) {
                    final Iterator<yacyURL> i = document.getHyperlinks().keySet().iterator();
                    String nexturlstring;
                    while (i.hasNext()) {
                        try {
                            nexturlstring = i.next().toNormalform(true, true);
                            addAll(new plasmaSearchImages(serverDate.remainingTime(start, maxTime, 10), new yacyURL(nexturlstring, null), depth - 1, indexing));
                        } catch (final MalformedURLException e1) {
                            e1.printStackTrace();
                        }
                    }
                }
                document.close();
            }
        }
    }
    
    public void addAll(final plasmaSearchImages m) {
        synchronized (m.images) {
            htmlFilterContentScraper.addAllImages(this.images, m.images);
        }
    }
    
    public Iterator<htmlFilterImageEntry> entries() {
        // returns htmlFilterImageEntry - Objects
        return images.values().iterator();
    }
    
}
redesigned some parts of the html scanner & parser to better support image tags git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1995 6c8d7289-2bf4-0310-a012-ef5d649a1542 19 years ago			`// plasmaSearchImages.java`
			`// -----------------------`
			`// part of YACY`
- removed superfluous copyright statement - updated my email address git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5011 6c8d7289-2bf4-0310-a012-ef5d649a1542 17 years ago			`// (C) by Michael Peter Christen; mc@yacy.net`
redesigned some parts of the html scanner & parser to better support image tags git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1995 6c8d7289-2bf4-0310-a012-ef5d649a1542 19 years ago			`// first published on http://www.anomic.de`
			`// Frankfurt, Germany, 2006`
			`// Created: 04.04.2006`
			`//`
			`// This program is free software; you can redistribute it and/or modify`
			`// it under the terms of the GNU General Public License as published by`
			`// the Free Software Foundation; either version 2 of the License, or`
			`// (at your option) any later version.`
			`//`
			`// This program is distributed in the hope that it will be useful,`
			`// but WITHOUT ANY WARRANTY; without even the implied warranty of`
			`// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the`
			`// GNU General Public License for more details.`
			`//`
			`// You should have received a copy of the GNU General Public License`
			`// along with this program; if not, write to the Free Software`
			`// Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA`

			`package de.anomic.plasma;`

) plasmaHTCache: - method loadResourceContent defined as deprecated. Please do not use this function to avoid OutOfMemory Exceptions when loading large files - new function getResourceContentStream to get an inputstream of a cache file - new function getResourceContentLength to get the size of a cached file ) httpc.java: - Bugfix: resource content was loaded into memory even if this was not requested ) Crawler: - new option to hold loaded resource content in memory - adding option to use the worker class without the worker pool (needed by the snippet fetcher) ) plasmaSnippetCache - snippet loader does not use a crawl-worker from pool but uses a newly created instance to avoid blocking by normal crawling activity. - now operates on streams instead of byte arrays to avoid OutOfMemory Exceptions when operating on large files - snippet loader now forces the crawl-worker to keep the loaded resource in memory to avoid IO ) plasmaCondenser: adding new function getWords that can directly operate on input streams ) Parsers - keep resource in memory whenever possible (to avoid IO) - when parsing from stream the content length must be passed to the parser function now. this length value is needed by the parsers to decide if the parsed resource content is to large to hold it in memory and must be stored to file - AbstractParser.java: new function to pass the contentLength of a resource to the parsers git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2701 6c8d7289-2bf4-0310-a012-ef5d649a1542 18 years ago			`import java.io.InputStream;`
redesigned some parts of the html scanner & parser to better support image tags git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1995 6c8d7289-2bf4-0310-a012-ef5d649a1542 19 years ago			`import java.net.MalformedURLException;`
- enhanced recognition, parsing, management and double-occurrence-handling of image tags - enhanced text parser (condenser): found and eliminated bad code parts; increase of speed - added handling of image preview using the image cache from HTCACHE - some other minor changes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4507 6c8d7289-2bf4-0310-a012-ef5d649a1542 17 years ago			`import java.util.HashMap;`
redesigned some parts of the html scanner & parser to better support image tags git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1995 6c8d7289-2bf4-0310-a012-ef5d649a1542 19 years ago			`import java.util.Iterator;`

- enhanced recognition, parsing, management and double-occurrence-handling of image tags - enhanced text parser (condenser): found and eliminated bad code parts; increase of speed - added handling of image preview using the image cache from HTCACHE - some other minor changes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4507 6c8d7289-2bf4-0310-a012-ef5d649a1542 17 years ago			`import de.anomic.htmlFilter.htmlFilterContentScraper;`
redesigned some parts of the html scanner & parser to better support image tags git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1995 6c8d7289-2bf4-0310-a012-ef5d649a1542 19 years ago			`import de.anomic.htmlFilter.htmlFilterImageEntry;`
- code cleanup - version 0.471 - moved surftipps to own web page git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2676 6c8d7289-2bf4-0310-a012-ef5d649a1542 18 years ago			`import de.anomic.plasma.parser.ParserException;`
redesigned some parts of the html scanner & parser to better support image tags git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1995 6c8d7289-2bf4-0310-a012-ef5d649a1542 19 years ago			`import de.anomic.server.serverDate;`
joined anomic.net.URL, plasmaURL and url hash computation: search profiling showed, that a major amount of time is wasted by computing url hashes. The computation does an intranet-check, which needs a DNS lookup. This caused that each urlhash computation needed 100-200 milliseconds, which caused remote searches to delay at least 1 second more that necessary. The solution to this problem is to attach a URL hash to the URL data structure, because that means that the url hash value can be filled after retrieval of the URL from the database. The redesign of the url/urlhash management caused a major redesign of many parts of the software. Since some parts had been decided to be given up they had been removed during this change to avoid unnecessary maintenance of unused code. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4074 6c8d7289-2bf4-0310-a012-ef5d649a1542 17 years ago			`import de.anomic.yacy.yacyURL;`
redesigned some parts of the html scanner & parser to better support image tags git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1995 6c8d7289-2bf4-0310-a012-ef5d649a1542 19 years ago
			`public final class plasmaSearchImages {`

added final where possible git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5030 6c8d7289-2bf4-0310-a012-ef5d649a1542 17 years ago			`private final HashMap<String, htmlFilterImageEntry> images;`
redesigned some parts of the html scanner & parser to better support image tags git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1995 6c8d7289-2bf4-0310-a012-ef5d649a1542 19 years ago
added final where possible git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5030 6c8d7289-2bf4-0310-a012-ef5d649a1542 17 years ago			`public plasmaSearchImages(final long maxTime, final yacyURL url, final int depth, final boolean indexing) {`
			`final long start = System.currentTimeMillis();`
- enhanced recognition, parsing, management and double-occurrence-handling of image tags - enhanced text parser (condenser): found and eliminated bad code parts; increase of speed - added handling of image preview using the image cache from HTCACHE - some other minor changes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4507 6c8d7289-2bf4-0310-a012-ef5d649a1542 17 years ago			`this.images = new HashMap<String, htmlFilterImageEntry>();`
redesigned some parts of the html scanner & parser to better support image tags git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1995 6c8d7289-2bf4-0310-a012-ef5d649a1542 19 years ago			`if (maxTime > 10) {`
added final where possible git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5030 6c8d7289-2bf4-0310-a012-ef5d649a1542 17 years ago			`final Object[] resource = plasmaSnippetCache.getResource(url, true, (int) maxTime, false, indexing);`
			`final InputStream res = (InputStream) resource[0];`
			`final Long resLength = (Long) resource[1];`
redesigned some parts of the html scanner & parser to better support image tags git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1995 6c8d7289-2bf4-0310-a012-ef5d649a1542 19 years ago			`if (res != null) {`
) Parser now throws an ParserException instead of returning null on parsing errors (e.g. needed by snippet fetcher) ) better logging of parser failures ) simplified usage of plasmaparser through switchboard ) restructuring of crawler - crawler now returns an error message if it is used in sync mode (e.g. by snippet fetcher) ) snippet-fetcher: more verbose error messages ) serverByteBuffer.java: adding new function append(String,encoding) *) serverFileUtils.java: adding functions to copy only a given number of bytes between streams git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2641 6c8d7289-2bf4-0310-a012-ef5d649a1542 18 years ago			`plasmaParserDocument document = null;`
			`try {`
) plasmaHTCache: - method loadResourceContent defined as deprecated. Please do not use this function to avoid OutOfMemory Exceptions when loading large files - new function getResourceContentStream to get an inputstream of a cache file - new function getResourceContentLength to get the size of a cached file ) httpc.java: - Bugfix: resource content was loaded into memory even if this was not requested ) Crawler: - new option to hold loaded resource content in memory - adding option to use the worker class without the worker pool (needed by the snippet fetcher) ) plasmaSnippetCache - snippet loader does not use a crawl-worker from pool but uses a newly created instance to avoid blocking by normal crawling activity. - now operates on streams instead of byte arrays to avoid OutOfMemory Exceptions when operating on large files - snippet loader now forces the crawl-worker to keep the loaded resource in memory to avoid IO ) plasmaCondenser: adding new function getWords that can directly operate on input streams ) Parsers - keep resource in memory whenever possible (to avoid IO) - when parsing from stream the content length must be passed to the parser function now. this length value is needed by the parsers to decide if the parsed resource content is to large to hold it in memory and must be stored to file - AbstractParser.java: new function to pass the contentLength of a resource to the parsers git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2701 6c8d7289-2bf4-0310-a012-ef5d649a1542 18 years ago			`// parse the document`
refactoring of search process: - re-designed remote request result processing - re-designed local result accumulation, will be further enhanced with snippet fetcher - removed search process handling in switchboad - made snippet class static (there is no need for multiple snippet objects) - removed some redundant tasks in server-side search process, should be a little bit faster now git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4043 6c8d7289-2bf4-0310-a012-ef5d649a1542 18 years ago			`document = plasmaSnippetCache.parseDocument(url, resLength.longValue(), res);`
added final where possible git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5030 6c8d7289-2bf4-0310-a012-ef5d649a1542 17 years ago			`} catch (final ParserException e) {`
) Parser now throws an ParserException instead of returning null on parsing errors (e.g. needed by snippet fetcher) ) better logging of parser failures ) simplified usage of plasmaparser through switchboard ) restructuring of crawler - crawler now returns an error message if it is used in sync mode (e.g. by snippet fetcher) ) snippet-fetcher: more verbose error messages ) serverByteBuffer.java: adding new function append(String,encoding) *) serverFileUtils.java: adding functions to copy only a given number of bytes between streams git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2641 6c8d7289-2bf4-0310-a012-ef5d649a1542 18 years ago			`// parsing failed`
) plasmaHTCache: - method loadResourceContent defined as deprecated. Please do not use this function to avoid OutOfMemory Exceptions when loading large files - new function getResourceContentStream to get an inputstream of a cache file - new function getResourceContentLength to get the size of a cached file ) httpc.java: - Bugfix: resource content was loaded into memory even if this was not requested ) Crawler: - new option to hold loaded resource content in memory - adding option to use the worker class without the worker pool (needed by the snippet fetcher) ) plasmaSnippetCache - snippet loader does not use a crawl-worker from pool but uses a newly created instance to avoid blocking by normal crawling activity. - now operates on streams instead of byte arrays to avoid OutOfMemory Exceptions when operating on large files - snippet loader now forces the crawl-worker to keep the loaded resource in memory to avoid IO ) plasmaCondenser: adding new function getWords that can directly operate on input streams ) Parsers - keep resource in memory whenever possible (to avoid IO) - when parsing from stream the content length must be passed to the parser function now. this length value is needed by the parsers to decide if the parsed resource content is to large to hold it in memory and must be stored to file - AbstractParser.java: new function to pass the contentLength of a resource to the parsers git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2701 6c8d7289-2bf4-0310-a012-ef5d649a1542 18 years ago			`} finally {`
added final where possible git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5030 6c8d7289-2bf4-0310-a012-ef5d649a1542 17 years ago			`try { res.close(); } catch (final Exception e) {/* ignore this */}`
) Parser now throws an ParserException instead of returning null on parsing errors (e.g. needed by snippet fetcher) ) better logging of parser failures ) simplified usage of plasmaparser through switchboard ) restructuring of crawler - crawler now returns an error message if it is used in sync mode (e.g. by snippet fetcher) ) snippet-fetcher: more verbose error messages ) serverByteBuffer.java: adding new function append(String,encoding) *) serverFileUtils.java: adding functions to copy only a given number of bytes between streams git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2641 6c8d7289-2bf4-0310-a012-ef5d649a1542 18 years ago			`}`
			`if (document == null) return;`

redesigned some parts of the html scanner & parser to better support image tags git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1995 6c8d7289-2bf4-0310-a012-ef5d649a1542 19 years ago			`// add the image links`
- enhanced recognition, parsing, management and double-occurrence-handling of image tags - enhanced text parser (condenser): found and eliminated bad code parts; increase of speed - added handling of image preview using the image cache from HTCACHE - some other minor changes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4507 6c8d7289-2bf4-0310-a012-ef5d649a1542 17 years ago			`htmlFilterContentScraper.addAllImages(this.images, document.getImages());`
redesigned some parts of the html scanner & parser to better support image tags git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1995 6c8d7289-2bf4-0310-a012-ef5d649a1542 19 years ago
			`// add also links from pages one step deeper, if depth > 0`
			`if (depth > 0) {`
added final where possible git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5030 6c8d7289-2bf4-0310-a012-ef5d649a1542 17 years ago			`final Iterator<yacyURL> i = document.getHyperlinks().keySet().iterator();`
more generics git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4343 6c8d7289-2bf4-0310-a012-ef5d649a1542 17 years ago			`String nexturlstring;`
redesigned some parts of the html scanner & parser to better support image tags git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1995 6c8d7289-2bf4-0310-a012-ef5d649a1542 19 years ago			`while (i.hasNext()) {`
			`try {`
- added parsing of Dublin Core - compliant metadata (see RFC 5013 and ISO 15836) to html parser - refactoring of plasmaParserDocument to use Dublin Core - compatible property names - redesign of url handling in parser and condenser (less String-to-yacyURL conversion) - more generics git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4352 6c8d7289-2bf4-0310-a012-ef5d649a1542 17 years ago			`nexturlstring = i.next().toNormalform(true, true);`
added new default profiles to distinguish snippet fetch for local and global search the difference is, that a local search will no not cause a re-indexing of loaded pages git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4731 6c8d7289-2bf4-0310-a012-ef5d649a1542 17 years ago			`addAll(new plasmaSearchImages(serverDate.remainingTime(start, maxTime, 10), new yacyURL(nexturlstring, null), depth - 1, indexing));`
added final where possible git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5030 6c8d7289-2bf4-0310-a012-ef5d649a1542 17 years ago			`} catch (final MalformedURLException e1) {`
removed url normalform computation from htmlFilterContentScraper. This method was implemented in de.anomic.net.URL git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2377 6c8d7289-2bf4-0310-a012-ef5d649a1542 19 years ago			`e1.printStackTrace();`
			`}`
redesigned some parts of the html scanner & parser to better support image tags git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1995 6c8d7289-2bf4-0310-a012-ef5d649a1542 19 years ago			`}`
			`}`
*) Bugfix. Add missing plasmaParserDocument.close() calls git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2680 6c8d7289-2bf4-0310-a012-ef5d649a1542 18 years ago			`document.close();`
redesigned some parts of the html scanner & parser to better support image tags git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1995 6c8d7289-2bf4-0310-a012-ef5d649a1542 19 years ago			`}`
			`}`
			`}`

added final where possible git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5030 6c8d7289-2bf4-0310-a012-ef5d649a1542 17 years ago			`public void addAll(final plasmaSearchImages m) {`
redesigned some parts of the html scanner & parser to better support image tags git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1995 6c8d7289-2bf4-0310-a012-ef5d649a1542 19 years ago			`synchronized (m.images) {`
- enhanced recognition, parsing, management and double-occurrence-handling of image tags - enhanced text parser (condenser): found and eliminated bad code parts; increase of speed - added handling of image preview using the image cache from HTCACHE - some other minor changes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4507 6c8d7289-2bf4-0310-a012-ef5d649a1542 17 years ago			`htmlFilterContentScraper.addAllImages(this.images, m.images);`
redesigned some parts of the html scanner & parser to better support image tags git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1995 6c8d7289-2bf4-0310-a012-ef5d649a1542 19 years ago			`}`
			`}`

more generics git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4343 6c8d7289-2bf4-0310-a012-ef5d649a1542 17 years ago			`public Iterator<htmlFilterImageEntry> entries() {`
redesigned some parts of the html scanner & parser to better support image tags git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1995 6c8d7289-2bf4-0310-a012-ef5d649a1542 19 years ago			`// returns htmlFilterImageEntry - Objects`
- enhanced recognition, parsing, management and double-occurrence-handling of image tags - enhanced text parser (condenser): found and eliminated bad code parts; increase of speed - added handling of image preview using the image cache from HTCACHE - some other minor changes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4507 6c8d7289-2bf4-0310-a012-ef5d649a1542 17 years ago			`return images.values().iterator();`
redesigned some parts of the html scanner & parser to better support image tags git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1995 6c8d7289-2bf4-0310-a012-ef5d649a1542 19 years ago			`}`

			`}`