You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
2294 lines
120 KiB
2294 lines
120 KiB
// plasmaSwitchboard.java
|
|
// (C) 2004-2007 by Michael Peter Christen; mc@yacy.net, Frankfurt a. M., Germany
|
|
// first published 2004 on http://yacy.net
|
|
//
|
|
// This is a part of YaCy, a peer-to-peer based web search engine
|
|
//
|
|
// $LastChangedDate$
|
|
// $LastChangedRevision$
|
|
// $LastChangedBy$
|
|
//
|
|
// LICENSE
|
|
//
|
|
// This program is free software; you can redistribute it and/or modify
|
|
// it under the terms of the GNU General Public License as published by
|
|
// the Free Software Foundation; either version 2 of the License, or
|
|
// (at your option) any later version.
|
|
//
|
|
// This program is distributed in the hope that it will be useful,
|
|
// but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
// GNU General Public License for more details.
|
|
//
|
|
// You should have received a copy of the GNU General Public License
|
|
// along with this program; if not, write to the Free Software
|
|
// Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
|
|
|
|
/*
|
|
This class holds the run-time environment of the plasma
|
|
Search Engine. It's data forms a blackboard which can be used
|
|
to organize running jobs around the indexing algorithm.
|
|
The blackboard consist of the following entities:
|
|
- storage: one plasmaStore object with the url-based database
|
|
- configuration: initialized by properties once, then by external functions
|
|
- job queues: for parsing, condensing, indexing
|
|
- black/blue/whitelists: controls input and output to the index
|
|
|
|
this class is also the core of the http crawling.
|
|
There are some items that need to be respected when crawling the web:
|
|
1) respect robots.txt
|
|
2) do not access one domain too frequently, wait between accesses
|
|
3) remember crawled URL's and do not access again too early
|
|
4) priorization of specific links should be possible (hot-lists)
|
|
5) attributes for crawling (depth, filters, hot/black-lists, priority)
|
|
6) different crawling jobs with different attributes ('Orders') simultanoulsy
|
|
|
|
We implement some specific tasks and use different database to archieve these goals:
|
|
- a database 'crawlerDisallow.db' contains all url's that shall not be crawled
|
|
- a database 'crawlerDomain.db' holds all domains and access times, where we loaded the disallow tables
|
|
this table contains the following entities:
|
|
<flag: robotes exist/not exist, last access of robots.txt, last access of domain (for access scheduling)>
|
|
- four databases for scheduled access: crawlerScheduledHotText.db, crawlerScheduledColdText.db,
|
|
crawlerScheduledHotMedia.db and crawlerScheduledColdMedia.db
|
|
- two stacks for new URLS: newText.stack and newMedia.stack
|
|
- two databases for URL double-check: knownText.db and knownMedia.db
|
|
- one database with crawling orders: crawlerOrders.db
|
|
|
|
The Information flow of a single URL that is crawled is as follows:
|
|
- a html file is loaded from a specific URL within the module httpdProxyServlet as
|
|
a process of the proxy.
|
|
- the file is passed to httpdProxyCache. Here it's processing is delayed until the proxy is idle.
|
|
- The cache entry is passed on to the plasmaSwitchboard. There the URL is stored into plasmaLURL where
|
|
the URL is stored under a specific hash. The URL's from the content are stripped off, stored in plasmaLURL
|
|
with a 'wrong' date (the date of the URL's are not known at this time, only after fetching) and stacked with
|
|
plasmaCrawlerTextStack. The content is read and splitted into rated words in plasmaCondenser.
|
|
The splitted words are then integrated into the index with plasmaSearch.
|
|
- In plasmaSearch the words are indexed by reversing the relation between URL and words: one URL points
|
|
to many words, the words within the document at the URL. After reversing, one word points
|
|
to many URL's, all the URL's where the word occurrs. One single word->URL-hash relation is stored in
|
|
plasmaIndexEntry. A set of plasmaIndexEntries is a reverse word index.
|
|
This reverse word index is stored temporarly in plasmaIndexCache.
|
|
- In plasmaIndexCache the single plasmaIndexEntry'ies are collected and stored into a plasmaIndex - entry
|
|
These plasmaIndex - Objects are the true reverse words indexes.
|
|
- in plasmaIndex the plasmaIndexEntry - objects are stored in a kelondroTree; an indexed file in the file system.
|
|
|
|
The information flow of a search request is as follows:
|
|
- in httpdFileServlet the user enters a search query, which is passed to plasmaSwitchboard
|
|
- in plasmaSwitchboard, the query is passed to plasmaSearch.
|
|
- in plasmaSearch, the plasmaSearch.result object is generated by simultanous enumeration of
|
|
URL hases in the reverse word indexes plasmaIndex
|
|
- (future: the plasmaSearch.result - object is used to identify more key words for a new search)
|
|
|
|
|
|
|
|
*/
|
|
|
|
package de.anomic.plasma;
|
|
|
|
|
|
import java.io.File;
|
|
import java.io.FileInputStream;
|
|
import java.io.IOException;
|
|
import java.io.InputStream;
|
|
import java.io.UnsupportedEncodingException;
|
|
import java.lang.reflect.Constructor;
|
|
import java.net.MalformedURLException;
|
|
import java.text.SimpleDateFormat;
|
|
import java.util.ArrayList;
|
|
import java.util.Date;
|
|
import java.util.HashMap;
|
|
import java.util.Hashtable;
|
|
import java.util.Iterator;
|
|
import java.util.Locale;
|
|
import java.util.Map;
|
|
import java.util.Properties;
|
|
import java.util.Set;
|
|
import java.util.Timer;
|
|
import java.util.TimerTask;
|
|
import java.util.TreeMap;
|
|
import java.util.TreeSet;
|
|
|
|
import de.anomic.crawler.CrawlEntry;
|
|
import de.anomic.crawler.CrawlProfile;
|
|
import de.anomic.crawler.CrawlQueues;
|
|
import de.anomic.crawler.CrawlStacker;
|
|
import de.anomic.crawler.ErrorURL;
|
|
import de.anomic.crawler.HTTPLoader;
|
|
import de.anomic.crawler.ImporterManager;
|
|
import de.anomic.crawler.IndexingStack;
|
|
import de.anomic.crawler.NoticedURL;
|
|
import de.anomic.crawler.ResourceObserver;
|
|
import de.anomic.crawler.ResultImages;
|
|
import de.anomic.crawler.ResultURLs;
|
|
import de.anomic.crawler.RobotsTxt;
|
|
import de.anomic.crawler.ZURL;
|
|
import de.anomic.data.URLLicense;
|
|
import de.anomic.data.blogBoard;
|
|
import de.anomic.data.blogBoardComments;
|
|
import de.anomic.data.bookmarksDB;
|
|
import de.anomic.data.listManager;
|
|
import de.anomic.data.messageBoard;
|
|
import de.anomic.data.userDB;
|
|
import de.anomic.data.wikiBoard;
|
|
import de.anomic.data.wiki.wikiParser;
|
|
import de.anomic.http.HttpClient;
|
|
import de.anomic.http.JakartaCommonsHttpClient;
|
|
import de.anomic.http.httpHeader;
|
|
import de.anomic.http.httpRemoteProxyConfig;
|
|
import de.anomic.http.httpd;
|
|
import de.anomic.http.httpdRobotsTxtConfig;
|
|
import de.anomic.index.indexReferenceBlacklist;
|
|
import de.anomic.index.indexURLReference;
|
|
import de.anomic.kelondro.kelondroCache;
|
|
import de.anomic.kelondro.kelondroCachedRecords;
|
|
import de.anomic.kelondro.kelondroMSetTools;
|
|
import de.anomic.kelondro.kelondroNaturalOrder;
|
|
import de.anomic.plasma.parser.ParserException;
|
|
import de.anomic.server.serverAbstractSwitch;
|
|
import de.anomic.server.serverBusyThread;
|
|
import de.anomic.server.serverCodings;
|
|
import de.anomic.server.serverCore;
|
|
import de.anomic.server.serverDate;
|
|
import de.anomic.server.serverDomains;
|
|
import de.anomic.server.serverFileUtils;
|
|
import de.anomic.server.serverInstantBusyThread;
|
|
import de.anomic.server.serverMemory;
|
|
import de.anomic.server.serverObjects;
|
|
import de.anomic.server.serverProcessor;
|
|
import de.anomic.server.serverProcessorJob;
|
|
import de.anomic.server.serverProfiling;
|
|
import de.anomic.server.serverSemaphore;
|
|
import de.anomic.server.serverSwitch;
|
|
import de.anomic.server.serverSystem;
|
|
import de.anomic.server.serverThread;
|
|
import de.anomic.server.logging.serverLog;
|
|
import de.anomic.tools.crypt;
|
|
import de.anomic.tools.nxTools;
|
|
import de.anomic.yacy.yacyClient;
|
|
import de.anomic.yacy.yacyCore;
|
|
import de.anomic.yacy.yacyNewsPool;
|
|
import de.anomic.yacy.yacyNewsRecord;
|
|
import de.anomic.yacy.yacySeed;
|
|
import de.anomic.yacy.yacyTray;
|
|
import de.anomic.yacy.yacyURL;
|
|
import de.anomic.yacy.yacyVersion;
|
|
|
|
public final class plasmaSwitchboard extends serverAbstractSwitch<IndexingStack.QueueEntry> implements serverSwitch<IndexingStack.QueueEntry> {
|
|
|
|
// load slots
|
|
public static int xstackCrawlSlots = 2000;
|
|
|
|
private int dhtTransferIndexCount = 100;
|
|
public static long lastPPMUpdate = System.currentTimeMillis()- 30000;
|
|
|
|
// couloured list management
|
|
public static TreeSet<String> badwords = null;
|
|
public static TreeSet<String> blueList = null;
|
|
public static TreeSet<String> stopwords = null;
|
|
public static indexReferenceBlacklist urlBlacklist = null;
|
|
|
|
public static wikiParser wikiParser = null;
|
|
|
|
public yacyTray yacytray;
|
|
|
|
// storage management
|
|
public File htCachePath;
|
|
public File plasmaPath;
|
|
public File listsPath;
|
|
public File htDocsPath;
|
|
public File rankingPath;
|
|
public File workPath;
|
|
public File releasePath;
|
|
public Map<String, String> rankingPermissions;
|
|
public plasmaWordIndex webIndex;
|
|
public CrawlQueues crawlQueues;
|
|
public ResultURLs crawlResults;
|
|
public CrawlStacker crawlStacker;
|
|
public messageBoard messageDB;
|
|
public wikiBoard wikiDB;
|
|
public blogBoard blogDB;
|
|
public blogBoardComments blogCommentDB;
|
|
public RobotsTxt robots;
|
|
public boolean rankingOn;
|
|
public plasmaRankingDistribution rankingOwnDistribution;
|
|
public plasmaRankingDistribution rankingOtherDistribution;
|
|
public HashMap<String, Object[]> outgoingCookies, incomingCookies;
|
|
public plasmaParser parser;
|
|
public volatile long proxyLastAccess, localSearchLastAccess, remoteSearchLastAccess;
|
|
public yacyCore yc;
|
|
public ResourceObserver observer;
|
|
public userDB userDB;
|
|
public bookmarksDB bookmarksDB;
|
|
public plasmaWebStructure webStructure;
|
|
public ImporterManager dbImportManager;
|
|
public plasmaDHTFlush transferIdxThread = null;
|
|
private plasmaDHTChunk dhtTransferChunk = null;
|
|
public ArrayList<plasmaSearchQuery> localSearches; // array of search result properties as HashMaps
|
|
public ArrayList<plasmaSearchQuery> remoteSearches; // array of search result properties as HashMaps
|
|
public HashMap<String, TreeSet<Long>> localSearchTracker, remoteSearchTracker; // mappings from requesting host to a TreeSet of Long(access time)
|
|
public long lastseedcheckuptime = -1;
|
|
public long indexedPages = 0;
|
|
public long lastindexedPages = 0;
|
|
public double requestedQueries = 0d;
|
|
public double lastrequestedQueries = 0d;
|
|
public int totalPPM = 0;
|
|
public double totalQPM = 0d;
|
|
public TreeMap<String, String> clusterhashes; // map of peerhash(String)/alternative-local-address as ip:port or only ip (String) or null if address in seed should be used
|
|
public boolean acceptLocalURLs, acceptGlobalURLs;
|
|
public URLLicense licensedURLs;
|
|
public Timer moreMemory;
|
|
|
|
public serverProcessor<indexingQueueEntry> indexingDocumentProcessor;
|
|
public serverProcessor<indexingQueueEntry> indexingCondensementProcessor;
|
|
public serverProcessor<indexingQueueEntry> indexingAnalysisProcessor;
|
|
public serverProcessor<indexingQueueEntry> indexingStorageProcessor;
|
|
|
|
public httpdRobotsTxtConfig robotstxtConfig = null;
|
|
|
|
|
|
private final serverSemaphore shutdownSync = new serverSemaphore(0);
|
|
private boolean terminate = false;
|
|
|
|
//private Object crawlingPausedSync = new Object();
|
|
//private boolean crawlingIsPaused = false;
|
|
|
|
public Hashtable<String, Object[]> crawlJobsStatus = new Hashtable<String, Object[]>();
|
|
|
|
private static plasmaSwitchboard sb = null;
|
|
|
|
public plasmaSwitchboard(final File rootPath, final String initPath, final String configPath, final boolean applyPro) {
|
|
super(rootPath, initPath, configPath, applyPro);
|
|
serverProfiling.startSystemProfiling();
|
|
sb=this;
|
|
|
|
// set loglevel and log
|
|
setLog(new serverLog("PLASMA"));
|
|
if (applyPro) this.log.logInfo("This is the pro-version of YaCy");
|
|
|
|
initSystemTray();
|
|
|
|
// remote proxy configuration
|
|
httpRemoteProxyConfig.init(this);
|
|
|
|
// load the network definition
|
|
overwriteNetworkDefinition(this);
|
|
|
|
// load values from configs
|
|
this.plasmaPath = getConfigPath(plasmaSwitchboardConstants.PLASMA_PATH, plasmaSwitchboardConstants.PLASMA_PATH_DEFAULT);
|
|
this.log.logConfig("Plasma DB Path: " + this.plasmaPath.toString());
|
|
final File indexPrimaryPath = getConfigPath(plasmaSwitchboardConstants.INDEX_PRIMARY_PATH, plasmaSwitchboardConstants.INDEX_PATH_DEFAULT);
|
|
this.log.logConfig("Index Primary Path: " + indexPrimaryPath.toString());
|
|
final File indexSecondaryPath = (getConfig(plasmaSwitchboardConstants.INDEX_SECONDARY_PATH, "").length() == 0) ? indexPrimaryPath : new File(getConfig(plasmaSwitchboardConstants.INDEX_SECONDARY_PATH, ""));
|
|
this.log.logConfig("Index Secondary Path: " + indexSecondaryPath.toString());
|
|
this.listsPath = getConfigPath(plasmaSwitchboardConstants.LISTS_PATH, plasmaSwitchboardConstants.LISTS_PATH_DEFAULT);
|
|
this.log.logConfig("Lists Path: " + this.listsPath.toString());
|
|
this.htDocsPath = getConfigPath(plasmaSwitchboardConstants.HTDOCS_PATH, plasmaSwitchboardConstants.HTDOCS_PATH_DEFAULT);
|
|
this.log.logConfig("HTDOCS Path: " + this.htDocsPath.toString());
|
|
this.rankingPath = getConfigPath(plasmaSwitchboardConstants.RANKING_PATH, plasmaSwitchboardConstants.RANKING_PATH_DEFAULT);
|
|
this.log.logConfig("Ranking Path: " + this.rankingPath.toString());
|
|
this.rankingPermissions = new HashMap<String, String>(); // mapping of permission - to filename.
|
|
this.workPath = getConfigPath(plasmaSwitchboardConstants.WORK_PATH, plasmaSwitchboardConstants.WORK_PATH_DEFAULT);
|
|
this.log.logConfig("Work Path: " + this.workPath.toString());
|
|
|
|
// set a high maximum cache size to current size; this is adopted later automatically
|
|
final int wordCacheMaxCount = Math.max((int) getConfigLong(plasmaSwitchboardConstants.WORDCACHE_INIT_COUNT, 30000),
|
|
(int) getConfigLong(plasmaSwitchboardConstants.WORDCACHE_MAX_COUNT, 20000));
|
|
setConfig(plasmaSwitchboardConstants.WORDCACHE_MAX_COUNT, Integer.toString(wordCacheMaxCount));
|
|
|
|
// start indexing management
|
|
log.logConfig("Starting Indexing Management");
|
|
final String networkName = getConfig("network.unit.name", "");
|
|
webIndex = new plasmaWordIndex(networkName, log, indexPrimaryPath, indexSecondaryPath, wordCacheMaxCount);
|
|
crawlResults = new ResultURLs();
|
|
|
|
// start yacy core
|
|
log.logConfig("Starting YaCy Protocol Core");
|
|
this.yc = new yacyCore(this);
|
|
serverInstantBusyThread.oneTimeJob(this, "loadSeedLists", yacyCore.log, 0);
|
|
final long startedSeedListAquisition = System.currentTimeMillis();
|
|
|
|
// set up local robots.txt
|
|
this.robotstxtConfig = httpdRobotsTxtConfig.init(this);
|
|
|
|
// setting timestamp of last proxy access
|
|
this.proxyLastAccess = System.currentTimeMillis() - 10000;
|
|
this.localSearchLastAccess = System.currentTimeMillis() - 10000;
|
|
this.remoteSearchLastAccess = System.currentTimeMillis() - 10000;
|
|
this.webStructure = new plasmaWebStructure(log, rankingPath, "LOCAL/010_cr/", getConfig("CRDist0Path", plasmaRankingDistribution.CR_OWN), new File(plasmaPath, "webStructure.map"));
|
|
|
|
// configuring list path
|
|
if (!(listsPath.exists())) listsPath.mkdirs();
|
|
|
|
// load coloured lists
|
|
if (blueList == null) {
|
|
// read only once upon first instantiation of this class
|
|
final String f = getConfig(plasmaSwitchboardConstants.LIST_BLUE, plasmaSwitchboardConstants.LIST_BLUE_DEFAULT);
|
|
final File plasmaBlueListFile = new File(f);
|
|
if (f != null) blueList = kelondroMSetTools.loadList(plasmaBlueListFile, kelondroNaturalOrder.naturalComparator); else blueList= new TreeSet<String>();
|
|
this.log.logConfig("loaded blue-list from file " + plasmaBlueListFile.getName() + ", " +
|
|
blueList.size() + " entries, " +
|
|
ppRamString(plasmaBlueListFile.length()/1024));
|
|
}
|
|
|
|
// load the black-list / inspired by [AS]
|
|
final File blacklistsPath = getConfigPath(plasmaSwitchboardConstants.LISTS_PATH, plasmaSwitchboardConstants.LISTS_PATH_DEFAULT);
|
|
String blacklistClassName = getConfig(plasmaSwitchboardConstants.BLACKLIST_CLASS, plasmaSwitchboardConstants.BLACKLIST_CLASS_DEFAULT);
|
|
if (blacklistClassName.equals("de.anomic.plasma.urlPattern.defaultURLPattern")) {
|
|
// patch old class location
|
|
blacklistClassName = plasmaSwitchboardConstants.BLACKLIST_CLASS_DEFAULT;
|
|
setConfig(plasmaSwitchboardConstants.BLACKLIST_CLASS, blacklistClassName);
|
|
}
|
|
|
|
this.log.logConfig("Starting blacklist engine ...");
|
|
try {
|
|
final Class<?> blacklistClass = Class.forName(blacklistClassName);
|
|
final Constructor<?> blacklistClassConstr = blacklistClass.getConstructor( new Class[] { File.class } );
|
|
urlBlacklist = (indexReferenceBlacklist) blacklistClassConstr.newInstance(new Object[] { blacklistsPath });
|
|
this.log.logFine("Used blacklist engine class: " + blacklistClassName);
|
|
this.log.logConfig("Using blacklist engine: " + urlBlacklist.getEngineInfo());
|
|
} catch (final Exception e) {
|
|
this.log.logSevere("Unable to load the blacklist engine",e);
|
|
System.exit(-1);
|
|
} catch (final Error e) {
|
|
this.log.logSevere("Unable to load the blacklist engine",e);
|
|
System.exit(-1);
|
|
}
|
|
|
|
this.log.logConfig("Loading backlist data ...");
|
|
listManager.switchboard = this;
|
|
listManager.listsPath = blacklistsPath;
|
|
listManager.reloadBlacklists();
|
|
|
|
// load badwords (to filter the topwords)
|
|
if (badwords == null) {
|
|
final File badwordsFile = new File(rootPath, plasmaSwitchboardConstants.LIST_BADWORDS_DEFAULT);
|
|
badwords = kelondroMSetTools.loadList(badwordsFile, kelondroNaturalOrder.naturalComparator);
|
|
this.log.logConfig("loaded badwords from file " + badwordsFile.getName() +
|
|
", " + badwords.size() + " entries, " +
|
|
ppRamString(badwordsFile.length()/1024));
|
|
}
|
|
|
|
// load stopwords
|
|
if (stopwords == null) {
|
|
final File stopwordsFile = new File(rootPath, plasmaSwitchboardConstants.LIST_STOPWORDS_DEFAULT);
|
|
stopwords = kelondroMSetTools.loadList(stopwordsFile, kelondroNaturalOrder.naturalComparator);
|
|
this.log.logConfig("loaded stopwords from file " + stopwordsFile.getName() + ", " +
|
|
stopwords.size() + " entries, " +
|
|
ppRamString(stopwordsFile.length()/1024));
|
|
}
|
|
|
|
// load ranking tables
|
|
final File YBRPath = new File(rootPath, "ranking/YBR");
|
|
if (YBRPath.exists()) {
|
|
plasmaSearchRankingProcess.loadYBR(YBRPath, 15);
|
|
}
|
|
|
|
// loading the robots.txt db
|
|
this.log.logConfig("Initializing robots.txt DB");
|
|
final File robotsDBFile = new File(this.plasmaPath, plasmaSwitchboardConstants.DBFILE_CRAWL_ROBOTS);
|
|
robots = new RobotsTxt(robotsDBFile);
|
|
this.log.logConfig("Loaded robots.txt DB from file " + robotsDBFile.getName() +
|
|
", " + robots.size() + " entries" +
|
|
", " + ppRamString(robotsDBFile.length()/1024));
|
|
|
|
// start a cache manager
|
|
log.logConfig("Starting HT Cache Manager");
|
|
|
|
// create the cache directory
|
|
htCachePath = getConfigPath(plasmaSwitchboardConstants.HTCACHE_PATH, plasmaSwitchboardConstants.HTCACHE_PATH_DEFAULT);
|
|
this.log.logInfo("HTCACHE Path = " + htCachePath.getAbsolutePath());
|
|
final long maxCacheSize = 1024 * 1024 * Long.parseLong(getConfig(plasmaSwitchboardConstants.PROXY_CACHE_SIZE, "2")); // this is megabyte
|
|
plasmaHTCache.init(htCachePath, maxCacheSize);
|
|
|
|
// create the release download directory
|
|
releasePath = getConfigPath(plasmaSwitchboardConstants.RELEASE_PATH, plasmaSwitchboardConstants.RELEASE_PATH_DEFAULT);
|
|
releasePath.mkdirs();
|
|
this.log.logInfo("RELEASE Path = " + releasePath.getAbsolutePath());
|
|
|
|
// starting message board
|
|
initMessages();
|
|
|
|
// starting wiki
|
|
initWiki();
|
|
|
|
//starting blog
|
|
initBlog();
|
|
|
|
// Init User DB
|
|
this.log.logConfig("Loading User DB");
|
|
final File userDbFile = new File(getRootPath(), plasmaSwitchboardConstants.DBFILE_USER);
|
|
this.userDB = new userDB(userDbFile);
|
|
this.log.logConfig("Loaded User DB from file " + userDbFile.getName() +
|
|
", " + this.userDB.size() + " entries" +
|
|
", " + ppRamString(userDbFile.length()/1024));
|
|
|
|
//Init bookmarks DB
|
|
initBookmarks();
|
|
|
|
// set a maximum amount of memory for the caches
|
|
// long memprereq = Math.max(getConfigLong(INDEXER_MEMPREREQ, 0), wordIndex.minMem());
|
|
// setConfig(INDEXER_MEMPREREQ, memprereq);
|
|
// setThreadPerformance(INDEXER, getConfigLong(INDEXER_IDLESLEEP, 0), getConfigLong(INDEXER_BUSYSLEEP, 0), memprereq);
|
|
kelondroCachedRecords.setCacheGrowStati(40 * 1024 * 1024, 20 * 1024 * 1024);
|
|
kelondroCache.setCacheGrowStati(40 * 1024 * 1024, 20 * 1024 * 1024);
|
|
|
|
// make parser
|
|
log.logConfig("Starting Parser");
|
|
this.parser = new plasmaParser();
|
|
|
|
// define an extension-blacklist
|
|
log.logConfig("Parser: Initializing Extension Mappings for Media/Parser");
|
|
plasmaParser.initMediaExt(plasmaParser.extString2extList(getConfig(plasmaSwitchboardConstants.PARSER_MEDIA_EXT,"")));
|
|
plasmaParser.initSupportedHTMLFileExt(plasmaParser.extString2extList(getConfig(plasmaSwitchboardConstants.PARSER_MEDIA_EXT_PARSEABLE,"")));
|
|
|
|
// define a realtime parsable mimetype list
|
|
log.logConfig("Parser: Initializing Mime Types");
|
|
plasmaParser.initHTMLParsableMimeTypes(getConfig(plasmaSwitchboardConstants.PARSER_MIMETYPES_HTML, "application/xhtml+xml,text/html,text/plain"));
|
|
plasmaParser.initParseableMimeTypes(plasmaParser.PARSER_MODE_PROXY, getConfig(plasmaSwitchboardConstants.PARSER_MIMETYPES_PROXY, null));
|
|
plasmaParser.initParseableMimeTypes(plasmaParser.PARSER_MODE_CRAWLER, getConfig(plasmaSwitchboardConstants.PARSER_MIMETYPES_CRAWLER, null));
|
|
plasmaParser.initParseableMimeTypes(plasmaParser.PARSER_MODE_ICAP, getConfig(plasmaSwitchboardConstants.PARSER_MIMETYPES_ICAP, null));
|
|
plasmaParser.initParseableMimeTypes(plasmaParser.PARSER_MODE_URLREDIRECTOR, getConfig(plasmaSwitchboardConstants.PARSER_MIMETYPES_URLREDIRECTOR, null));
|
|
plasmaParser.initParseableMimeTypes(plasmaParser.PARSER_MODE_IMAGE, getConfig(plasmaSwitchboardConstants.PARSER_MIMETYPES_IMAGE, null));
|
|
|
|
// start a loader
|
|
log.logConfig("Starting Crawl Loader");
|
|
this.crawlQueues = new CrawlQueues(this, plasmaPath);
|
|
this.crawlQueues.noticeURL.setMinimumLocalDelta(this.getConfigLong("minimumLocalDelta", this.crawlQueues.noticeURL.getMinimumLocalDelta()));
|
|
this.crawlQueues.noticeURL.setMinimumGlobalDelta(this.getConfigLong("minimumGlobalDelta", this.crawlQueues.noticeURL.getMinimumGlobalDelta()));
|
|
|
|
/*
|
|
* Creating sync objects and loading status for the crawl jobs
|
|
* a) local crawl
|
|
* b) remote triggered crawl
|
|
* c) global crawl trigger
|
|
*/
|
|
this.crawlJobsStatus.put(plasmaSwitchboardConstants.CRAWLJOB_LOCAL_CRAWL, new Object[]{
|
|
new Object(),
|
|
Boolean.valueOf(getConfig(plasmaSwitchboardConstants.CRAWLJOB_LOCAL_CRAWL + "_isPaused", "false"))});
|
|
this.crawlJobsStatus.put(plasmaSwitchboardConstants.CRAWLJOB_REMOTE_TRIGGERED_CRAWL, new Object[]{
|
|
new Object(),
|
|
Boolean.valueOf(getConfig(plasmaSwitchboardConstants.CRAWLJOB_REMOTE_TRIGGERED_CRAWL + "_isPaused", "false"))});
|
|
this.crawlJobsStatus.put(plasmaSwitchboardConstants.CRAWLJOB_REMOTE_CRAWL_LOADER, new Object[]{
|
|
new Object(),
|
|
Boolean.valueOf(getConfig(plasmaSwitchboardConstants.CRAWLJOB_REMOTE_CRAWL_LOADER + "_isPaused", "false"))});
|
|
|
|
// init cookie-Monitor
|
|
this.log.logConfig("Starting Cookie Monitor");
|
|
this.outgoingCookies = new HashMap<String, Object[]>();
|
|
this.incomingCookies = new HashMap<String, Object[]>();
|
|
|
|
// init search history trackers
|
|
this.localSearchTracker = new HashMap<String, TreeSet<Long>>(); // String:TreeSet - IP:set of Long(accessTime)
|
|
this.remoteSearchTracker = new HashMap<String, TreeSet<Long>>();
|
|
this.localSearches = new ArrayList<plasmaSearchQuery>(); // contains search result properties as HashMaps
|
|
this.remoteSearches = new ArrayList<plasmaSearchQuery>();
|
|
|
|
// init messages: clean up message symbol
|
|
final File notifierSource = new File(getRootPath(), getConfig(plasmaSwitchboardConstants.HTROOT_PATH, plasmaSwitchboardConstants.HTROOT_PATH_DEFAULT) + "/env/grafics/empty.gif");
|
|
final File notifierDest = new File(getConfigPath(plasmaSwitchboardConstants.HTDOCS_PATH, plasmaSwitchboardConstants.HTDOCS_PATH_DEFAULT), "notifier.gif");
|
|
try {
|
|
serverFileUtils.copy(notifierSource, notifierDest);
|
|
} catch (final IOException e) {
|
|
}
|
|
|
|
// clean up profiles
|
|
this.log.logConfig("Cleaning Profiles");
|
|
try { cleanProfiles(); } catch (final InterruptedException e) { /* Ignore this here */ }
|
|
|
|
// init ranking transmission
|
|
/*
|
|
CRDistOn = true/false
|
|
CRDist0Path = GLOBAL/010_owncr
|
|
CRDist0Method = 1
|
|
CRDist0Percent = 0
|
|
CRDist0Target =
|
|
CRDist1Path = GLOBAL/014_othercr/1
|
|
CRDist1Method = 9
|
|
CRDist1Percent = 30
|
|
CRDist1Target = kaskelix.de:8080,yacy.dyndns.org:8000,suma-lab.de:8080
|
|
**/
|
|
rankingOn = getConfig(plasmaSwitchboardConstants.RANKING_DIST_ON, "true").equals("true") && networkName.equals("freeworld");
|
|
rankingOwnDistribution = new plasmaRankingDistribution(log, webIndex.seedDB, new File(rankingPath, getConfig(plasmaSwitchboardConstants.RANKING_DIST_0_PATH, plasmaRankingDistribution.CR_OWN)), (int) getConfigLong(plasmaSwitchboardConstants.RANKING_DIST_0_METHOD, plasmaRankingDistribution.METHOD_ANYSENIOR), (int) getConfigLong(plasmaSwitchboardConstants.RANKING_DIST_0_METHOD, 0), getConfig(plasmaSwitchboardConstants.RANKING_DIST_0_TARGET, ""));
|
|
rankingOtherDistribution = new plasmaRankingDistribution(log, webIndex.seedDB, new File(rankingPath, getConfig(plasmaSwitchboardConstants.RANKING_DIST_1_PATH, plasmaRankingDistribution.CR_OTHER)), (int) getConfigLong(plasmaSwitchboardConstants.RANKING_DIST_1_METHOD, plasmaRankingDistribution.METHOD_MIXEDSENIOR), (int) getConfigLong(plasmaSwitchboardConstants.RANKING_DIST_1_METHOD, 30), getConfig(plasmaSwitchboardConstants.RANKING_DIST_1_TARGET, "kaskelix.de:8080,yacy.dyndns.org:8000"));
|
|
|
|
// init facility DB
|
|
/*
|
|
log.logSystem("Starting Facility Database");
|
|
File facilityDBpath = new File(getRootPath(), "DATA/SETTINGS/");
|
|
facilityDB = new kelondroTables(facilityDBpath);
|
|
facilityDB.declareMaps("backlinks", 250, 500, new String[] {"date"}, null);
|
|
log.logSystem("..opened backlinks");
|
|
facilityDB.declareMaps("zeitgeist", 40, 500);
|
|
log.logSystem("..opened zeitgeist");
|
|
facilityDB.declareTree("statistik", new int[]{11, 8, 8, 8, 8, 8, 8}, 0x400);
|
|
log.logSystem("..opened statistik");
|
|
facilityDB.update("statistik", (new serverDate()).toShortString(false).substring(0, 11), new long[]{1,2,3,4,5,6});
|
|
long[] testresult = facilityDB.selectLong("statistik", "yyyyMMddHHm");
|
|
testresult = facilityDB.selectLong("statistik", (new serverDate()).toShortString(false).substring(0, 11));
|
|
*/
|
|
|
|
// init nameCacheNoCachingList
|
|
final String noCachingList = getConfig(plasmaSwitchboardConstants.HTTPC_NAME_CACHE_CACHING_PATTERNS_NO,"");
|
|
final String[] noCachingEntries = noCachingList.split(",");
|
|
for (int i = 0; i < noCachingEntries.length; i++) {
|
|
final String entry = noCachingEntries[i].trim();
|
|
serverDomains.nameCacheNoCachingPatterns.add(entry);
|
|
}
|
|
|
|
// generate snippets cache
|
|
log.logConfig("Initializing Snippet Cache");
|
|
plasmaSnippetCache.init(parser, log);
|
|
|
|
final String wikiParserClassName = getConfig(plasmaSwitchboardConstants.WIKIPARSER_CLASS, plasmaSwitchboardConstants.WIKIPARSER_CLASS_DEFAULT);
|
|
this.log.logConfig("Loading wiki parser " + wikiParserClassName + " ...");
|
|
try {
|
|
final Class<?> wikiParserClass = Class.forName(wikiParserClassName);
|
|
final Constructor<?> wikiParserClassConstr = wikiParserClass.getConstructor(new Class[] { plasmaSwitchboard.class });
|
|
wikiParser = (wikiParser)wikiParserClassConstr.newInstance(new Object[] { this });
|
|
} catch (final Exception e) {
|
|
this.log.logSevere("Unable to load wiki parser, the wiki won't work", e);
|
|
}
|
|
|
|
// initializing the resourceObserver
|
|
this.observer = new ResourceObserver(this);
|
|
// run the oberver here a first time
|
|
this.observer.resourceObserverJob();
|
|
|
|
// initializing the stackCrawlThread
|
|
this.crawlStacker = new CrawlStacker(this, this.plasmaPath, (int) getConfigLong("tableTypeForPreNURL", 0), (((int) getConfigLong("tableTypeForPreNURL", 0) == 0) && (getConfigLong(plasmaSwitchboardConstants.CRAWLSTACK_BUSYSLEEP, 0) <= 100)));
|
|
//this.sbStackCrawlThread = new plasmaStackCrawlThread(this,this.plasmaPath,ramPreNURL);
|
|
//this.sbStackCrawlThread.start();
|
|
|
|
// initializing dht chunk generation
|
|
this.dhtTransferChunk = null;
|
|
this.dhtTransferIndexCount = (int) getConfigLong(plasmaSwitchboardConstants.INDEX_DIST_CHUNK_SIZE_START, 50);
|
|
|
|
// init robinson cluster
|
|
// before we do that, we wait some time until the seed list is loaded.
|
|
while (((System.currentTimeMillis() - startedSeedListAquisition) < 8000) && (this.webIndex.seedDB.sizeConnected() == 0)) try {Thread.sleep(1000);} catch (final InterruptedException e) {}
|
|
try {Thread.sleep(1000);} catch (final InterruptedException e) {}
|
|
this.clusterhashes = this.webIndex.seedDB.clusterHashes(getConfig("cluster.peers.yacydomain", ""));
|
|
|
|
// deploy blocking threads
|
|
indexingStorageProcessor = new serverProcessor<indexingQueueEntry>(this, "storeDocumentIndex", 1, null);
|
|
indexingAnalysisProcessor = new serverProcessor<indexingQueueEntry>(this, "webStructureAnalysis", serverProcessor.useCPU + 1, indexingStorageProcessor);
|
|
indexingCondensementProcessor = new serverProcessor<indexingQueueEntry>(this, "condenseDocument", serverProcessor.useCPU + 1, indexingAnalysisProcessor);
|
|
indexingDocumentProcessor = new serverProcessor<indexingQueueEntry>(this, "parseDocument", serverProcessor.useCPU + 1, indexingCondensementProcessor);
|
|
|
|
// deploy busy threads
|
|
log.logConfig("Starting Threads");
|
|
serverMemory.gc(1000, "plasmaSwitchboard, help for profiler"); // help for profiler - thq
|
|
|
|
moreMemory = new Timer(); // init GC Thread - thq
|
|
moreMemory.schedule(new MoreMemory(), 300000, 600000);
|
|
|
|
deployThread(plasmaSwitchboardConstants.CLEANUP, "Cleanup", "simple cleaning process for monitoring information", null,
|
|
new serverInstantBusyThread(this, plasmaSwitchboardConstants.CLEANUP_METHOD_START, plasmaSwitchboardConstants.CLEANUP_METHOD_JOBCOUNT, plasmaSwitchboardConstants.CLEANUP_METHOD_FREEMEM), 600000); // all 5 Minutes, wait 10 minutes until first run
|
|
deployThread(plasmaSwitchboardConstants.CRAWLSTACK, "Crawl URL Stacker", "process that checks url for double-occurrences and for allowance/disallowance by robots.txt", null,
|
|
new serverInstantBusyThread(crawlStacker, plasmaSwitchboardConstants.CRAWLSTACK_METHOD_START, plasmaSwitchboardConstants.CRAWLSTACK_METHOD_JOBCOUNT, plasmaSwitchboardConstants.CRAWLSTACK_METHOD_FREEMEM), 8000);
|
|
deployThread(plasmaSwitchboardConstants.INDEXER, "Indexing", "thread that either initiates a parsing/indexing queue, distributes the index into the DHT, stores parsed documents or flushes the index cache", "/IndexCreateIndexingQueue_p.html",
|
|
new serverInstantBusyThread(this, plasmaSwitchboardConstants.INDEXER_METHOD_START, plasmaSwitchboardConstants.INDEXER_METHOD_JOBCOUNT, plasmaSwitchboardConstants.INDEXER_METHOD_FREEMEM), 10000);
|
|
deployThread(plasmaSwitchboardConstants.PROXY_CACHE_ENQUEUE, "Proxy Cache Enqueue", "job takes new input files from RAM stack, stores them, and hands over to the Indexing Stack", null,
|
|
new serverInstantBusyThread(this, plasmaSwitchboardConstants.PROXY_CACHE_ENQUEUE_METHOD_START, plasmaSwitchboardConstants.PROXY_CACHE_ENQUEUE_METHOD_JOBCOUNT, plasmaSwitchboardConstants.PROXY_CACHE_ENQUEUE_METHOD_FREEMEM), 10000);
|
|
deployThread(plasmaSwitchboardConstants.CRAWLJOB_REMOTE_TRIGGERED_CRAWL, "Remote Crawl Job", "thread that performes a single crawl/indexing step triggered by a remote peer", null,
|
|
new serverInstantBusyThread(crawlQueues, plasmaSwitchboardConstants.CRAWLJOB_REMOTE_TRIGGERED_CRAWL_METHOD_START, plasmaSwitchboardConstants.CRAWLJOB_REMOTE_TRIGGERED_CRAWL_METHOD_JOBCOUNT, plasmaSwitchboardConstants.CRAWLJOB_REMOTE_TRIGGERED_CRAWL_METHOD_FREEMEM), 30000);
|
|
deployThread(plasmaSwitchboardConstants.CRAWLJOB_REMOTE_CRAWL_LOADER, "Remote Crawl URL Loader", "thread that loads remote crawl lists from other peers", "",
|
|
new serverInstantBusyThread(crawlQueues, plasmaSwitchboardConstants.CRAWLJOB_REMOTE_CRAWL_LOADER_METHOD_START, plasmaSwitchboardConstants.CRAWLJOB_REMOTE_CRAWL_LOADER_METHOD_JOBCOUNT, plasmaSwitchboardConstants.CRAWLJOB_REMOTE_CRAWL_LOADER_METHOD_FREEMEM), 30000); // error here?
|
|
deployThread(plasmaSwitchboardConstants.CRAWLJOB_LOCAL_CRAWL, "Local Crawl", "thread that performes a single crawl step from the local crawl queue", "/IndexCreateWWWLocalQueue_p.html",
|
|
new serverInstantBusyThread(crawlQueues, plasmaSwitchboardConstants.CRAWLJOB_LOCAL_CRAWL_METHOD_START, plasmaSwitchboardConstants.CRAWLJOB_LOCAL_CRAWL_METHOD_JOBCOUNT, plasmaSwitchboardConstants.CRAWLJOB_LOCAL_CRAWL_METHOD_FREEMEM), 10000);
|
|
deployThread(plasmaSwitchboardConstants.SEED_UPLOAD, "Seed-List Upload", "task that a principal peer performes to generate and upload a seed-list to a ftp account", null,
|
|
new serverInstantBusyThread(yc, plasmaSwitchboardConstants.SEED_UPLOAD_METHOD_START, plasmaSwitchboardConstants.SEED_UPLOAD_METHOD_JOBCOUNT, plasmaSwitchboardConstants.SEED_UPLOAD_METHOD_FREEMEM), 180000);
|
|
deployThread(plasmaSwitchboardConstants.PEER_PING, "YaCy Core", "this is the p2p-control and peer-ping task", null,
|
|
new serverInstantBusyThread(yc, plasmaSwitchboardConstants.PEER_PING_METHOD_START, plasmaSwitchboardConstants.PEER_PING_METHOD_JOBCOUNT, plasmaSwitchboardConstants.PEER_PING_METHOD_FREEMEM), 2000);
|
|
|
|
deployThread(plasmaSwitchboardConstants.INDEX_DIST, "DHT Distribution", "selection, transfer and deletion of index entries that are not searched on your peer, but on others", null,
|
|
new serverInstantBusyThread(this, plasmaSwitchboardConstants.INDEX_DIST_METHOD_START, plasmaSwitchboardConstants.INDEX_DIST_METHOD_JOBCOUNT, plasmaSwitchboardConstants.INDEX_DIST_METHOD_FREEMEM), 60000,
|
|
Long.parseLong(getConfig(plasmaSwitchboardConstants.INDEX_DIST_IDLESLEEP , "5000")),
|
|
Long.parseLong(getConfig(plasmaSwitchboardConstants.INDEX_DIST_BUSYSLEEP , "0")),
|
|
Long.parseLong(getConfig(plasmaSwitchboardConstants.INDEX_DIST_MEMPREREQ , "1000000")));
|
|
|
|
// test routine for snippet fetch
|
|
//Set query = new HashSet();
|
|
//query.add(plasmaWordIndexEntry.word2hash("Weitergabe"));
|
|
//query.add(plasmaWordIndexEntry.word2hash("Zahl"));
|
|
//plasmaSnippetCache.result scr = snippetCache.retrieve(new URL("http://www.heise.de/mobil/newsticker/meldung/mail/54980"), query, true);
|
|
//plasmaSnippetCache.result scr = snippetCache.retrieve(new URL("http://www.heise.de/security/news/foren/go.shtml?read=1&msg_id=7301419&forum_id=72721"), query, true);
|
|
//plasmaSnippetCache.result scr = snippetCache.retrieve(new URL("http://www.heise.de/kiosk/archiv/ct/2003/4/20"), query, true, 260);
|
|
|
|
this.dbImportManager = new ImporterManager();
|
|
|
|
log.logConfig("Finished Switchboard Initialization");
|
|
}
|
|
|
|
|
|
/**
|
|
*
|
|
*/
|
|
private void initSystemTray() {
|
|
// make system tray
|
|
// TODO: make tray on linux
|
|
try {
|
|
final boolean trayIcon = getConfig("trayIcon", "false").equals("true");
|
|
if (trayIcon && serverSystem.isWindows) {
|
|
System.setProperty("java.awt.headless", "false");
|
|
yacytray = new yacyTray(this, false);
|
|
}
|
|
} catch (final Exception e) {
|
|
System.setProperty("java.awt.headless", "true");
|
|
}
|
|
}
|
|
|
|
|
|
public static void overwriteNetworkDefinition(final plasmaSwitchboard sb) {
|
|
|
|
// load network configuration into settings
|
|
String networkUnitDefinition = sb.getConfig("network.unit.definition", "defaults/yacy.network.freeworld.unit");
|
|
final String networkGroupDefinition = sb.getConfig("network.group.definition", "yacy.network.group");
|
|
|
|
// patch old values
|
|
if (networkUnitDefinition.equals("yacy.network.unit")) {
|
|
networkUnitDefinition = "defaults/yacy.network.freeworld.unit";
|
|
sb.setConfig("network.unit.definition", networkUnitDefinition);
|
|
}
|
|
|
|
// remove old release locations
|
|
int i = 0;
|
|
String location;
|
|
while (true) {
|
|
location = sb.getConfig("network.unit.update.location" + i, "");
|
|
if (location.length() == 0) break;
|
|
sb.removeConfig("network.unit.update.location" + i);
|
|
i++;
|
|
}
|
|
|
|
// include additional network definition properties into our settings
|
|
// note that these properties cannot be set in the application because they are
|
|
// _always_ overwritten each time with the default values. This is done so on purpose.
|
|
// the network definition should be made either consistent for all peers,
|
|
// or independently using a bootstrap URL
|
|
Map<String, String> initProps;
|
|
if (networkUnitDefinition.startsWith("http://")) {
|
|
try {
|
|
sb.setConfig(plasmaSwitchboard.loadHashMap(new yacyURL(networkUnitDefinition, null)));
|
|
} catch (final MalformedURLException e) { }
|
|
} else {
|
|
final File networkUnitDefinitionFile = (networkUnitDefinition.startsWith("/")) ? new File(networkUnitDefinition) : new File(sb.getRootPath(), networkUnitDefinition);
|
|
if (networkUnitDefinitionFile.exists()) {
|
|
initProps = serverFileUtils.loadHashMap(networkUnitDefinitionFile);
|
|
sb.setConfig(initProps);
|
|
}
|
|
}
|
|
if (networkGroupDefinition.startsWith("http://")) {
|
|
try {
|
|
sb.setConfig(plasmaSwitchboard.loadHashMap(new yacyURL(networkGroupDefinition, null)));
|
|
} catch (final MalformedURLException e) { }
|
|
} else {
|
|
final File networkGroupDefinitionFile = new File(sb.getRootPath(), networkGroupDefinition);
|
|
if (networkGroupDefinitionFile.exists()) {
|
|
initProps = serverFileUtils.loadHashMap(networkGroupDefinitionFile);
|
|
sb.setConfig(initProps);
|
|
}
|
|
}
|
|
|
|
// set release locations
|
|
while (true) {
|
|
location = sb.getConfig("network.unit.update.location" + i, "");
|
|
if (location.length() == 0) break;
|
|
try {
|
|
yacyVersion.latestReleaseLocations.add(new yacyURL(location, null));
|
|
} catch (final MalformedURLException e) {
|
|
break;
|
|
}
|
|
i++;
|
|
}
|
|
|
|
// initiate url license object
|
|
sb.licensedURLs = new URLLicense(8);
|
|
|
|
// set URL domain acceptance
|
|
sb.acceptGlobalURLs = "global.any".indexOf(sb.getConfig("network.unit.domain", "global")) >= 0;
|
|
sb.acceptLocalURLs = "local.any".indexOf(sb.getConfig("network.unit.domain", "global")) >= 0;
|
|
|
|
}
|
|
|
|
public void switchNetwork(final String networkDefinition) {
|
|
// pause crawls
|
|
final boolean lcp = crawlJobIsPaused(plasmaSwitchboardConstants.CRAWLJOB_LOCAL_CRAWL);
|
|
if (!lcp) pauseCrawlJob(plasmaSwitchboardConstants.CRAWLJOB_LOCAL_CRAWL);
|
|
final boolean rcp = crawlJobIsPaused(plasmaSwitchboardConstants.CRAWLJOB_REMOTE_TRIGGERED_CRAWL);
|
|
if (!rcp) pauseCrawlJob(plasmaSwitchboardConstants.CRAWLJOB_REMOTE_TRIGGERED_CRAWL);
|
|
// trigger online caution
|
|
proxyLastAccess = System.currentTimeMillis() + 10000; // at least 10 seconds online caution to prevent unnecessary action on database meanwhile
|
|
// clean search events which have cached relations to the old index
|
|
plasmaSearchEvent.cleanupEvents(true);
|
|
// switch the networks
|
|
synchronized (this.webIndex) {
|
|
this.webIndex.close();
|
|
}
|
|
synchronized (this) {
|
|
setConfig("network.unit.definition", networkDefinition);
|
|
overwriteNetworkDefinition(this);
|
|
final File indexPrimaryPath = getConfigPath(plasmaSwitchboardConstants.INDEX_PRIMARY_PATH, plasmaSwitchboardConstants.INDEX_PATH_DEFAULT);
|
|
final File indexSecondaryPath = (getConfig(plasmaSwitchboardConstants.INDEX_SECONDARY_PATH, "").length() == 0) ? indexPrimaryPath : new File(getConfig(plasmaSwitchboardConstants.INDEX_SECONDARY_PATH, ""));
|
|
final int wordCacheMaxCount = (int) getConfigLong(plasmaSwitchboardConstants.WORDCACHE_MAX_COUNT, 20000);
|
|
this.webIndex = new plasmaWordIndex(getConfig("network.unit.name", ""), getLog(), indexPrimaryPath, indexSecondaryPath, wordCacheMaxCount);
|
|
}
|
|
// start up crawl jobs
|
|
continueCrawlJob(plasmaSwitchboardConstants.CRAWLJOB_LOCAL_CRAWL);
|
|
continueCrawlJob(plasmaSwitchboardConstants.CRAWLJOB_REMOTE_TRIGGERED_CRAWL);
|
|
this.log.logInfo("switched network to " + networkDefinition);
|
|
// check status of account configuration: when local url crawling is allowed, it is not allowed
|
|
// that an automatic authorization of localhost is done, because in this case crawls from local
|
|
// addresses are blocked to prevent attack szenarios where remote pages contain links to localhost
|
|
// addresses that can steer a YaCy peer
|
|
if ((this.acceptLocalURLs) && (getConfigBool("adminAccountForLocalhost", false))) {
|
|
setConfig("adminAccountForLocalhost", false);
|
|
if (getConfig(httpd.ADMIN_ACCOUNT_B64MD5, "").startsWith("0000")) {
|
|
// the password was set automatically with a random value.
|
|
// We must remove that here to prevent that a user cannot log in any more
|
|
setConfig(httpd.ADMIN_ACCOUNT_B64MD5, "");
|
|
// after this a message must be generated to alert the user to set a new password
|
|
log.logInfo("RANDOM PASSWORD REMOVED! User must set a new password");
|
|
}
|
|
}
|
|
}
|
|
|
|
public void initMessages() {
|
|
this.log.logConfig("Starting Message Board");
|
|
final File messageDbFile = new File(workPath, plasmaSwitchboardConstants.DBFILE_MESSAGE);
|
|
this.messageDB = new messageBoard(messageDbFile);
|
|
this.log.logConfig("Loaded Message Board DB from file " + messageDbFile.getName() +
|
|
", " + this.messageDB.size() + " entries" +
|
|
", " + ppRamString(messageDbFile.length()/1024));
|
|
}
|
|
|
|
public void initWiki() {
|
|
this.log.logConfig("Starting Wiki Board");
|
|
final File wikiDbFile = new File(workPath, plasmaSwitchboardConstants.DBFILE_WIKI);
|
|
this.wikiDB = new wikiBoard(wikiDbFile, new File(workPath, plasmaSwitchboardConstants.DBFILE_WIKI_BKP));
|
|
this.log.logConfig("Loaded Wiki Board DB from file " + wikiDbFile.getName() +
|
|
", " + this.wikiDB.size() + " entries" +
|
|
", " + ppRamString(wikiDbFile.length()/1024));
|
|
}
|
|
|
|
public void initBlog() {
|
|
this.log.logConfig("Starting Blog");
|
|
final File blogDbFile = new File(workPath, plasmaSwitchboardConstants.DBFILE_BLOG);
|
|
this.blogDB = new blogBoard(blogDbFile);
|
|
this.log.logConfig("Loaded Blog DB from file " + blogDbFile.getName() +
|
|
", " + this.blogDB.size() + " entries" +
|
|
", " + ppRamString(blogDbFile.length()/1024));
|
|
|
|
final File blogCommentDbFile = new File(workPath, plasmaSwitchboardConstants.DBFILE_BLOGCOMMENTS);
|
|
this.blogCommentDB = new blogBoardComments(blogCommentDbFile);
|
|
this.log.logConfig("Loaded Blog-Comment DB from file " + blogCommentDbFile.getName() +
|
|
", " + this.blogCommentDB.size() + " entries" +
|
|
", " + ppRamString(blogCommentDbFile.length()/1024));
|
|
}
|
|
|
|
public void initBookmarks(){
|
|
this.log.logConfig("Loading Bookmarks DB");
|
|
final File bookmarksFile = new File(workPath, plasmaSwitchboardConstants.DBFILE_BOOKMARKS);
|
|
final File tagsFile = new File(workPath, plasmaSwitchboardConstants.DBFILE_BOOKMARKS_TAGS);
|
|
final File datesFile = new File(workPath, plasmaSwitchboardConstants.DBFILE_BOOKMARKS_DATES);
|
|
this.bookmarksDB = new bookmarksDB(bookmarksFile, tagsFile, datesFile);
|
|
this.log.logConfig("Loaded Bookmarks DB from files "+ bookmarksFile.getName()+ ", "+tagsFile.getName());
|
|
this.log.logConfig(this.bookmarksDB.tagsSize()+" Tag, "+this.bookmarksDB.bookmarksSize()+" Bookmarks");
|
|
}
|
|
|
|
public static plasmaSwitchboard getSwitchboard(){
|
|
return sb;
|
|
}
|
|
|
|
public boolean isRobinsonMode() {
|
|
// we are in robinson mode, if we do not exchange index by dht distribution
|
|
// we need to take care that search requests and remote indexing requests go only
|
|
// to the peers in the same cluster, if we run a robinson cluster.
|
|
return !getConfigBool(plasmaSwitchboardConstants.INDEX_DIST_ALLOW, false) && !getConfigBool(plasmaSwitchboardConstants.INDEX_RECEIVE_ALLOW, false);
|
|
}
|
|
|
|
public boolean isPublicRobinson() {
|
|
// robinson peers may be member of robinson clusters, which can be public or private
|
|
// this does not check the robinson attribute, only the specific subtype of the cluster
|
|
final String clustermode = getConfig(plasmaSwitchboardConstants.CLUSTER_MODE, plasmaSwitchboardConstants.CLUSTER_MODE_PUBLIC_PEER);
|
|
return (clustermode.equals(plasmaSwitchboardConstants.CLUSTER_MODE_PUBLIC_CLUSTER)) || (clustermode.equals(plasmaSwitchboardConstants.CLUSTER_MODE_PUBLIC_PEER));
|
|
}
|
|
|
|
public boolean isInMyCluster(final String peer) {
|
|
// check if the given peer is in the own network, if this is a robinson cluster
|
|
// depending on the robinson cluster type, the peer String may be a peerhash (b64-hash)
|
|
// or a ip:port String or simply a ip String
|
|
// if this robinson mode does not define a cluster membership, false is returned
|
|
if (peer == null) return false;
|
|
if (!isRobinsonMode()) return false;
|
|
final String clustermode = getConfig(plasmaSwitchboardConstants.CLUSTER_MODE, plasmaSwitchboardConstants.CLUSTER_MODE_PUBLIC_PEER);
|
|
if (clustermode.equals(plasmaSwitchboardConstants.CLUSTER_MODE_PRIVATE_CLUSTER)) {
|
|
// check if we got the request from a peer in the private cluster
|
|
final String network = getConfig(plasmaSwitchboardConstants.CLUSTER_PEERS_IPPORT, "");
|
|
return network.indexOf(peer) >= 0;
|
|
} else if (clustermode.equals(plasmaSwitchboardConstants.CLUSTER_MODE_PUBLIC_CLUSTER)) {
|
|
// check if we got the request from a peer in the public cluster
|
|
return this.clusterhashes.containsKey(peer);
|
|
} else {
|
|
return false;
|
|
}
|
|
}
|
|
|
|
public boolean isInMyCluster(final yacySeed seed) {
|
|
// check if the given peer is in the own network, if this is a robinson cluster
|
|
// if this robinson mode does not define a cluster membership, false is returned
|
|
if (seed == null) return false;
|
|
if (!isRobinsonMode()) return false;
|
|
final String clustermode = getConfig(plasmaSwitchboardConstants.CLUSTER_MODE, plasmaSwitchboardConstants.CLUSTER_MODE_PUBLIC_PEER);
|
|
if (clustermode.equals(plasmaSwitchboardConstants.CLUSTER_MODE_PRIVATE_CLUSTER)) {
|
|
// check if we got the request from a peer in the private cluster
|
|
final String network = getConfig(plasmaSwitchboardConstants.CLUSTER_PEERS_IPPORT, "");
|
|
return network.indexOf(seed.getPublicAddress()) >= 0;
|
|
} else if (clustermode.equals(plasmaSwitchboardConstants.CLUSTER_MODE_PUBLIC_CLUSTER)) {
|
|
// check if we got the request from a peer in the public cluster
|
|
return this.clusterhashes.containsKey(seed.hash);
|
|
} else {
|
|
return false;
|
|
}
|
|
}
|
|
|
|
/**
|
|
* Test a url if it can be used for crawling/indexing
|
|
* This mainly checks if the url is in the declared domain (local/global)
|
|
* @param url
|
|
* @return null if the url can be accepted, a string containing a rejection reason if the url cannot be accepted
|
|
*/
|
|
public String acceptURL(final yacyURL url) {
|
|
// returns true if the url can be accepted accoring to network.unit.domain
|
|
if (url == null) return "url is null";
|
|
final String host = url.getHost();
|
|
if (host == null) return "url.host is null";
|
|
if (this.acceptGlobalURLs && this.acceptLocalURLs) return null; // fast shortcut to avoid dnsResolve
|
|
/*
|
|
InetAddress hostAddress = serverDomains.dnsResolve(host);
|
|
// if we don't know the host, we cannot load that resource anyway.
|
|
// But in case we use a proxy, it is possible that we dont have a DNS service.
|
|
final httpRemoteProxyConfig remoteProxyConfig = httpdProxyHandler.getRemoteProxyConfig();
|
|
if (hostAddress == null) {
|
|
if ((remoteProxyConfig != null) && (remoteProxyConfig.useProxy())) return null; else return "the dns of the host '" + host + "' cannot be resolved";
|
|
}
|
|
*/
|
|
// check if this is a local address and we are allowed to index local pages:
|
|
//boolean local = hostAddress.isSiteLocalAddress() || hostAddress.isLoopbackAddress();
|
|
final boolean local = url.isLocal();
|
|
//assert local == yacyURL.isLocalDomain(url.hash()); // TODO: remove the dnsResolve above!
|
|
if ((this.acceptGlobalURLs && !local) || (this.acceptLocalURLs && local)) return null;
|
|
return (local) ?
|
|
("the host '" + host + "' is local, but local addresses are not accepted") :
|
|
("the host '" + host + "' is global, but global addresses are not accepted");
|
|
}
|
|
|
|
public String urlExists(final String hash) {
|
|
// tests if hash occurrs in any database
|
|
// if it exists, the name of the database is returned,
|
|
// if it not exists, null is returned
|
|
if (webIndex.existsURL(hash)) return "loaded";
|
|
return this.crawlQueues.urlExists(hash);
|
|
}
|
|
|
|
public void urlRemove(final String hash) {
|
|
webIndex.removeURL(hash);
|
|
crawlResults.remove(hash);
|
|
crawlQueues.urlRemove(hash);
|
|
}
|
|
|
|
public yacyURL getURL(final String urlhash) {
|
|
if (urlhash == null) return null;
|
|
if (urlhash.length() == 0) return null;
|
|
final yacyURL ne = crawlQueues.getURL(urlhash);
|
|
if (ne != null) return ne;
|
|
final indexURLReference le = webIndex.getURL(urlhash, null, 0);
|
|
if (le != null) return le.comp().url();
|
|
return null;
|
|
}
|
|
|
|
public plasmaSearchRankingProfile getRanking() {
|
|
return (getConfig("rankingProfile", "").length() == 0) ?
|
|
new plasmaSearchRankingProfile(plasmaSearchQuery.CONTENTDOM_TEXT) :
|
|
new plasmaSearchRankingProfile("", crypt.simpleDecode(sb.getConfig("rankingProfile", ""), null));
|
|
}
|
|
|
|
/**
|
|
* This method changes the HTCache size.<br>
|
|
* @param newCacheSize in MB
|
|
*/
|
|
public final void setCacheSize(final long newCacheSize) {
|
|
plasmaHTCache.setCacheSize(1048576 * newCacheSize);
|
|
}
|
|
|
|
public boolean onlineCaution() {
|
|
return
|
|
(System.currentTimeMillis() - this.proxyLastAccess < Integer.parseInt(getConfig(plasmaSwitchboardConstants.PROXY_ONLINE_CAUTION_DELAY, "30000"))) ||
|
|
(System.currentTimeMillis() - this.localSearchLastAccess < Integer.parseInt(getConfig(plasmaSwitchboardConstants.LOCALSEACH_ONLINE_CAUTION_DELAY, "30000"))) ||
|
|
(System.currentTimeMillis() - this.remoteSearchLastAccess < Integer.parseInt(getConfig(plasmaSwitchboardConstants.REMOTESEARCH_ONLINE_CAUTION_DELAY, "30000")));
|
|
}
|
|
|
|
private static String ppRamString(long bytes) {
|
|
if (bytes < 1024) return bytes + " KByte";
|
|
bytes = bytes / 1024;
|
|
if (bytes < 1024) return bytes + " MByte";
|
|
bytes = bytes / 1024;
|
|
if (bytes < 1024) return bytes + " GByte";
|
|
return (bytes / 1024) + "TByte";
|
|
}
|
|
|
|
/**
|
|
* {@link CrawlProfile Crawl Profiles} are saved independantly from the queues themselves
|
|
* and therefore have to be cleaned up from time to time. This method only performs the clean-up
|
|
* if - and only if - the {@link IndexingStack switchboard},
|
|
* {@link ProtocolLoader loader} and {@link plasmaCrawlNURL local crawl} queues are all empty.
|
|
* <p>
|
|
* Then it iterates through all existing {@link CrawlProfile crawl profiles} and removes
|
|
* all profiles which are not hardcoded.
|
|
* </p>
|
|
* <p>
|
|
* <i>If this method encounters DB-failures, the profile DB will be resetted and</i>
|
|
* <code>true</code><i> will be returned</i>
|
|
* </p>
|
|
* @see #CRAWL_PROFILE_PROXY hardcoded
|
|
* @see #CRAWL_PROFILE_REMOTE hardcoded
|
|
* @see #CRAWL_PROFILE_SNIPPET_TEXT hardcoded
|
|
* @see #CRAWL_PROFILE_SNIPPET_MEDIA hardcoded
|
|
* @return whether this method has done something or not (i.e. because the queues have been filled
|
|
* or there are no profiles left to clean up)
|
|
* @throws <b>InterruptedException</b> if the current thread has been interrupted, i.e. by the
|
|
* shutdown procedure
|
|
*/
|
|
public boolean cleanProfiles() throws InterruptedException {
|
|
if ((crawlQueues.size() > 0) ||
|
|
(crawlStacker != null && crawlStacker.size() > 0) ||
|
|
(crawlQueues.noticeURL.notEmpty()))
|
|
return false;
|
|
return this.webIndex.cleanProfiles();
|
|
}
|
|
|
|
public boolean htEntryStoreProcess(final plasmaHTCache.Entry entry) {
|
|
|
|
if (entry == null) return false;
|
|
|
|
/* =========================================================================
|
|
* PARSER SUPPORT
|
|
*
|
|
* Testing if the content type is supported by the available parsers
|
|
* ========================================================================= */
|
|
final boolean isSupportedContent = plasmaParser.supportedContent(entry.url(),entry.getMimeType());
|
|
log.logFinest(entry.url() +" content of type "+ entry.getMimeType() +" is supported: "+ isSupportedContent);
|
|
|
|
/* =========================================================================
|
|
* INDEX CONTROL HEADER
|
|
*
|
|
* With the X-YACY-Index-Control header set to "no-index" a client could disallow
|
|
* yacy to index the response returned as answer to a request
|
|
* ========================================================================= */
|
|
boolean doIndexing = true;
|
|
if (entry.requestProhibitsIndexing()) {
|
|
doIndexing = false;
|
|
if (this.log.isFine())
|
|
this.log.logFine("Crawling of " + entry.url() + " prohibited by request.");
|
|
}
|
|
|
|
/* =========================================================================
|
|
* LOCAL IP ADDRESS CHECK
|
|
*
|
|
* check if ip is local ip address // TODO: remove this procotol specific code here
|
|
* ========================================================================= */
|
|
final String urlRejectReason = acceptURL(entry.url());
|
|
if (urlRejectReason != null) {
|
|
if (this.log.isFine()) this.log.logFine("Rejected URL '" + entry.url() + "': " + urlRejectReason);
|
|
doIndexing = false;
|
|
}
|
|
|
|
synchronized (webIndex.queuePreStack) {
|
|
/* =========================================================================
|
|
* STORING DATA
|
|
*
|
|
* Now we store the response header and response content if
|
|
* a) the user has configured to use the htcache or
|
|
* b) the content should be indexed
|
|
* ========================================================================= */
|
|
if (((entry.profile() != null) && (entry.profile().storeHTCache())) || (doIndexing && isSupportedContent)) {
|
|
// store response header
|
|
/*
|
|
if (entry.writeResourceInfo()) {
|
|
this.log.logInfo("WROTE HEADER for " + entry.cacheFile());
|
|
}
|
|
*/
|
|
|
|
// work off unwritten files
|
|
if (entry.cacheArray() != null) {
|
|
final String error = entry.shallStoreCacheForProxy();
|
|
if (error == null) {
|
|
plasmaHTCache.writeResourceContent(entry.url(), entry.cacheArray());
|
|
if (this.log.isFine()) this.log.logFine("WROTE FILE (" + entry.cacheArray().length + " bytes) for " + entry.cacheFile());
|
|
} else {
|
|
if (this.log.isFine()) this.log.logFine("WRITE OF FILE " + entry.cacheFile() + " FORBIDDEN: " + error);
|
|
}
|
|
//} else {
|
|
//this.log.logFine("EXISTING FILE (" + entry.cacheFile.length() + " bytes) for " + entry.cacheFile);
|
|
}
|
|
}
|
|
|
|
/* =========================================================================
|
|
* INDEXING
|
|
* ========================================================================= */
|
|
if (doIndexing && isSupportedContent) {
|
|
|
|
// enqueue for further crawling
|
|
enQueue(this.webIndex.queuePreStack.newEntry(
|
|
entry.url(),
|
|
(entry.referrerURL() == null) ? null : entry.referrerURL().hash(),
|
|
entry.ifModifiedSince(),
|
|
entry.requestWithCookie(),
|
|
entry.initiator(),
|
|
entry.depth(),
|
|
entry.profile().handle(),
|
|
entry.name()
|
|
));
|
|
} else {
|
|
if (!entry.profile().storeHTCache() && entry.cacheFile().exists()) {
|
|
plasmaHTCache.deleteURLfromCache(entry.url());
|
|
}
|
|
}
|
|
}
|
|
|
|
return true;
|
|
}
|
|
|
|
public boolean htEntryStoreJob() {
|
|
if (plasmaHTCache.empty()) return false;
|
|
return htEntryStoreProcess(plasmaHTCache.pop());
|
|
}
|
|
|
|
public int htEntrySize() {
|
|
return plasmaHTCache.size();
|
|
}
|
|
|
|
public void close() {
|
|
log.logConfig("SWITCHBOARD SHUTDOWN STEP 1: sending termination signal to managed threads:");
|
|
serverProfiling.stopSystemProfiling();
|
|
moreMemory.cancel();
|
|
terminateAllThreads(true);
|
|
if (transferIdxThread != null) stopTransferWholeIndex(false);
|
|
log.logConfig("SWITCHBOARD SHUTDOWN STEP 2: sending termination signal to threaded indexing");
|
|
// closing all still running db importer jobs
|
|
indexingDocumentProcessor.shutdown(4000);
|
|
indexingCondensementProcessor.shutdown(3000);
|
|
indexingAnalysisProcessor.shutdown(2000);
|
|
indexingStorageProcessor.shutdown(1000);
|
|
this.dbImportManager.close();
|
|
JakartaCommonsHttpClient.closeAllConnections();
|
|
wikiDB.close();
|
|
blogDB.close();
|
|
blogCommentDB.close();
|
|
userDB.close();
|
|
bookmarksDB.close();
|
|
messageDB.close();
|
|
crawlStacker.close();
|
|
robots.close();
|
|
parser.close();
|
|
plasmaHTCache.close();
|
|
webStructure.flushCitationReference("crg");
|
|
webStructure.close();
|
|
crawlQueues.close();
|
|
log.logConfig("SWITCHBOARD SHUTDOWN STEP 3: sending termination signal to database manager (stand by...)");
|
|
webIndex.close();
|
|
if(yacyTray.isShown) yacytray.removeTray();
|
|
log.logConfig("SWITCHBOARD SHUTDOWN TERMINATED");
|
|
}
|
|
|
|
public int queueSize() {
|
|
return webIndex.queuePreStack.size();
|
|
}
|
|
|
|
public void enQueue(final IndexingStack.QueueEntry job) {
|
|
assert job != null;
|
|
try {
|
|
webIndex.queuePreStack.push(job);
|
|
} catch (final IOException e) {
|
|
log.logSevere("IOError in plasmaSwitchboard.enQueue: " + e.getMessage(), e);
|
|
}
|
|
}
|
|
|
|
public void deQueueFreeMem() {
|
|
// flush some entries from the RAM cache
|
|
webIndex.flushCacheSome();
|
|
// empty some caches
|
|
webIndex.clearCache();
|
|
plasmaSearchEvent.cleanupEvents(true);
|
|
// adopt maximum cache size to current size to prevent that further OutOfMemoryErrors occur
|
|
/* int newMaxCount = Math.max(1200, Math.min((int) getConfigLong(WORDCACHE_MAX_COUNT, 1200), wordIndex.dhtOutCacheSize()));
|
|
setConfig(WORDCACHE_MAX_COUNT, Integer.toString(newMaxCount));
|
|
wordIndex.setMaxWordCount(newMaxCount); */
|
|
}
|
|
|
|
public IndexingStack.QueueEntry deQueue() {
|
|
// getting the next entry from the indexing queue
|
|
IndexingStack.QueueEntry nextentry = null;
|
|
synchronized (webIndex.queuePreStack) {
|
|
// do one processing step
|
|
if (this.log.isFine()) log.logFine("DEQUEUE: sbQueueSize=" + webIndex.queuePreStack.size() +
|
|
", coreStackSize=" + crawlQueues.noticeURL.stackSize(NoticedURL.STACK_TYPE_CORE) +
|
|
", limitStackSize=" + crawlQueues.noticeURL.stackSize(NoticedURL.STACK_TYPE_LIMIT) +
|
|
", overhangStackSize=" + crawlQueues.noticeURL.stackSize(NoticedURL.STACK_TYPE_OVERHANG) +
|
|
", remoteStackSize=" + crawlQueues.noticeURL.stackSize(NoticedURL.STACK_TYPE_REMOTE));
|
|
try {
|
|
final int sizeBefore = webIndex.queuePreStack.size();
|
|
nextentry = webIndex.queuePreStack.pop();
|
|
if (nextentry == null) {
|
|
log.logWarning("deQueue: null entry on queue stack.");
|
|
if (webIndex.queuePreStack.size() == sizeBefore) {
|
|
// this is a severe problem: because this time a null is returned, it means that this status will last forever
|
|
// to re-enable use of the sbQueue, it must be emptied completely
|
|
log.logSevere("deQueue: does not shrink after pop() == null. Emergency reset.");
|
|
webIndex.queuePreStack.clear();
|
|
}
|
|
return null;
|
|
}
|
|
} catch (final IOException e) {
|
|
log.logSevere("IOError in plasmaSwitchboard.deQueue: " + e.getMessage(), e);
|
|
return null;
|
|
}
|
|
return nextentry;
|
|
}
|
|
}
|
|
|
|
public boolean deQueueProcess() {
|
|
try {
|
|
// work off fresh entries from the proxy or from the crawler
|
|
if (onlineCaution()) {
|
|
if (this.log.isFine()) log.logFine("deQueue: online caution, omitting resource stack processing");
|
|
return false;
|
|
}
|
|
|
|
boolean doneSomething = false;
|
|
|
|
// flush some entries from the RAM cache
|
|
if (webIndex.queuePreStack.size() == 0) {
|
|
doneSomething = webIndex.flushCacheSome() > 0; // permanent flushing only if we are not busy
|
|
}
|
|
|
|
// possibly delete entries from last chunk
|
|
if ((this.dhtTransferChunk != null) && (this.dhtTransferChunk.getStatus() == plasmaDHTChunk.chunkStatus_COMPLETE)) {
|
|
final String deletedURLs = this.dhtTransferChunk.deleteTransferIndexes();
|
|
if (this.log.isFine()) this.log.logFine("Deleted from " + this.dhtTransferChunk.containers().length + " transferred RWIs locally, removed " + deletedURLs + " URL references");
|
|
this.dhtTransferChunk = null;
|
|
}
|
|
|
|
// generate a dht chunk
|
|
if ((dhtShallTransfer() == null) && (
|
|
(this.dhtTransferChunk == null) ||
|
|
(this.dhtTransferChunk.getStatus() == plasmaDHTChunk.chunkStatus_UNDEFINED) ||
|
|
// (this.dhtTransferChunk.getStatus() == plasmaDHTChunk.chunkStatus_COMPLETE) ||
|
|
(this.dhtTransferChunk.getStatus() == plasmaDHTChunk.chunkStatus_FAILED)
|
|
)) {
|
|
// generate new chunk
|
|
final int minChunkSize = (int) getConfigLong(plasmaSwitchboardConstants.INDEX_DIST_CHUNK_SIZE_MIN, 30);
|
|
dhtTransferChunk = new plasmaDHTChunk(this.log, webIndex, minChunkSize, dhtTransferIndexCount, 5000);
|
|
doneSomething = true;
|
|
}
|
|
|
|
// check for interruption
|
|
checkInterruption();
|
|
|
|
// getting the next entry from the indexing queue
|
|
if (webIndex.queuePreStack.size() == 0) {
|
|
//log.logFine("deQueue: nothing to do, queue is emtpy");
|
|
return doneSomething; // nothing to do
|
|
}
|
|
|
|
if (crawlStacker.size() >= getConfigLong(plasmaSwitchboardConstants.CRAWLSTACK_SLOTS, 2000)) {
|
|
if (this.log.isFine()) log.logFine("deQueue: too many processes in stack crawl thread queue (" + "stackCrawlQueue=" + crawlStacker.size() + ")");
|
|
return doneSomething;
|
|
}
|
|
|
|
// if we were interrupted we should return now
|
|
if (Thread.currentThread().isInterrupted()) {
|
|
if (this.log.isFine()) log.logFine("deQueue: thread was interrupted");
|
|
return false;
|
|
}
|
|
|
|
// get next queue entry and start a queue processing
|
|
final IndexingStack.QueueEntry queueEntry = deQueue();
|
|
assert queueEntry != null;
|
|
if (queueEntry == null) return true;
|
|
if (queueEntry.profile() == null) {
|
|
queueEntry.close();
|
|
return true;
|
|
}
|
|
webIndex.queuePreStack.enQueueToActive(queueEntry);
|
|
|
|
// check for interruption
|
|
checkInterruption();
|
|
|
|
this.indexingDocumentProcessor.enQueue(new indexingQueueEntry(queueEntry, null, null));
|
|
/*
|
|
|
|
// THE FOLLOWING CAN BE CONCURRENT ->
|
|
|
|
// parse and index the resource
|
|
indexingQueueEntry document = parseDocument(new indexingQueueEntry(queueEntry, null, null));
|
|
|
|
// do condensing
|
|
indexingQueueEntry condensement = condenseDocument(document);
|
|
|
|
// do a web structure analysis
|
|
indexingQueueEntry analysis = webStructureAnalysis(condensement);
|
|
|
|
// <- CONCURRENT UNTIL HERE, THEN SERIALIZE AGAIN
|
|
|
|
// store the result
|
|
storeDocumentIndex(analysis);
|
|
*/
|
|
return true;
|
|
} catch (final InterruptedException e) {
|
|
log.logInfo("DEQUEUE: Shutdown detected.");
|
|
return false;
|
|
}
|
|
}
|
|
|
|
public static class indexingQueueEntry extends serverProcessorJob {
|
|
public IndexingStack.QueueEntry queueEntry;
|
|
public plasmaParserDocument document;
|
|
public plasmaCondenser condenser;
|
|
public indexingQueueEntry(
|
|
final IndexingStack.QueueEntry queueEntry,
|
|
final plasmaParserDocument document,
|
|
final plasmaCondenser condenser) {
|
|
super();
|
|
this.queueEntry = queueEntry;
|
|
this.document = document;
|
|
this.condenser = condenser;
|
|
}
|
|
}
|
|
|
|
public int cleanupJobSize() {
|
|
int c = 0;
|
|
if ((crawlQueues.delegatedURL.stackSize() > 1000)) c++;
|
|
if ((crawlQueues.errorURL.stackSize() > 1000)) c++;
|
|
for (int i = 1; i <= 6; i++) {
|
|
if (crawlResults.getStackSize(i) > 1000) c++;
|
|
}
|
|
return c;
|
|
}
|
|
|
|
public boolean cleanupJob() {
|
|
try {
|
|
boolean hasDoneSomething = false;
|
|
|
|
// clear caches if necessary
|
|
if (!serverMemory.request(8000000L, false)) {
|
|
webIndex.clearCache();
|
|
plasmaSearchEvent.cleanupEvents(true);
|
|
}
|
|
|
|
// set a random password if no password is configured
|
|
if (!this.acceptLocalURLs && getConfigBool("adminAccountForLocalhost", false) && getConfig(httpd.ADMIN_ACCOUNT_B64MD5, "").length() == 0) {
|
|
// make a 'random' password
|
|
setConfig(httpd.ADMIN_ACCOUNT_B64MD5, "0000" + serverCodings.encodeMD5Hex(System.getProperties().toString() + System.currentTimeMillis()));
|
|
setConfig("adminAccount", "");
|
|
}
|
|
|
|
// close unused connections
|
|
JakartaCommonsHttpClient.cleanup();
|
|
|
|
// clean up too old connection information
|
|
super.cleanupAccessTracker(1000 * 60 * 60);
|
|
|
|
// do transmission of CR-files
|
|
checkInterruption();
|
|
int count = rankingOwnDistribution.size() / 100;
|
|
if (count == 0) count = 1;
|
|
if (count > 5) count = 5;
|
|
if (rankingOn) {
|
|
rankingOwnDistribution.transferRanking(count);
|
|
rankingOtherDistribution.transferRanking(1);
|
|
}
|
|
|
|
// clean up delegated stack
|
|
checkInterruption();
|
|
if ((crawlQueues.delegatedURL.stackSize() > 1000)) {
|
|
if (this.log.isFine()) log.logFine("Cleaning Delegated-URLs report stack, " + crawlQueues.delegatedURL.stackSize() + " entries on stack");
|
|
crawlQueues.delegatedURL.clearStack();
|
|
hasDoneSomething = true;
|
|
}
|
|
|
|
// clean up error stack
|
|
checkInterruption();
|
|
if ((crawlQueues.errorURL.stackSize() > 1000)) {
|
|
if (this.log.isFine()) log.logFine("Cleaning Error-URLs report stack, " + crawlQueues.errorURL.stackSize() + " entries on stack");
|
|
crawlQueues.errorURL.clearStack();
|
|
hasDoneSomething = true;
|
|
}
|
|
|
|
// clean up loadedURL stack
|
|
for (int i = 1; i <= 6; i++) {
|
|
checkInterruption();
|
|
if (crawlResults.getStackSize(i) > 1000) {
|
|
if (this.log.isFine()) log.logFine("Cleaning Loaded-URLs report stack, " + crawlResults.getStackSize(i) + " entries on stack " + i);
|
|
crawlResults.clearStack(i);
|
|
hasDoneSomething = true;
|
|
}
|
|
}
|
|
// clean up image stack
|
|
ResultImages.clearQueues();
|
|
|
|
// clean up profiles
|
|
checkInterruption();
|
|
if (cleanProfiles()) hasDoneSomething = true;
|
|
|
|
// clean up news
|
|
checkInterruption();
|
|
try {
|
|
if (this.log.isFine()) log.logFine("Cleaning Incoming News, " + this.webIndex.newsPool.size(yacyNewsPool.INCOMING_DB) + " entries on stack");
|
|
if (this.webIndex.newsPool.automaticProcess(webIndex.seedDB) > 0) hasDoneSomething = true;
|
|
} catch (final IOException e) {}
|
|
if (getConfigBool("cleanup.deletionProcessedNews", true)) {
|
|
this.webIndex.newsPool.clear(yacyNewsPool.PROCESSED_DB);
|
|
}
|
|
if (getConfigBool("cleanup.deletionPublishedNews", true)) {
|
|
this.webIndex.newsPool.clear(yacyNewsPool.PUBLISHED_DB);
|
|
}
|
|
|
|
// clean up seed-dbs
|
|
if(getConfigBool("routing.deleteOldSeeds.permission",true)) {
|
|
final long deleteOldSeedsTime = getConfigLong("routing.deleteOldSeeds.time",7)*24*3600000;
|
|
Iterator<yacySeed> e = this.webIndex.seedDB.seedsSortedDisconnected(true,yacySeed.LASTSEEN);
|
|
yacySeed seed = null;
|
|
final ArrayList<String> deleteQueue = new ArrayList<String>();
|
|
checkInterruption();
|
|
//clean passive seeds
|
|
while(e.hasNext()) {
|
|
seed = e.next();
|
|
if(seed != null) {
|
|
//list is sorted -> break when peers are too young to delete
|
|
if(seed.getLastSeenUTC() > (System.currentTimeMillis()-deleteOldSeedsTime))
|
|
break;
|
|
deleteQueue.add(seed.hash);
|
|
}
|
|
}
|
|
for(int i=0;i<deleteQueue.size();++i) this.webIndex.seedDB.removeDisconnected(deleteQueue.get(i));
|
|
deleteQueue.clear();
|
|
e = this.webIndex.seedDB.seedsSortedPotential(true,yacySeed.LASTSEEN);
|
|
checkInterruption();
|
|
//clean potential seeds
|
|
while(e.hasNext()) {
|
|
seed = e.next();
|
|
if(seed != null) {
|
|
//list is sorted -> break when peers are too young to delete
|
|
if(seed.getLastSeenUTC() > (System.currentTimeMillis()-deleteOldSeedsTime))
|
|
break;
|
|
deleteQueue.add(seed.hash);
|
|
}
|
|
}
|
|
for (int i = 0; i < deleteQueue.size(); ++i) this.webIndex.seedDB.removePotential(deleteQueue.get(i));
|
|
}
|
|
|
|
// check if update is available and
|
|
// if auto-update is activated perform an automatic installation and restart
|
|
final yacyVersion updateVersion = yacyVersion.rulebasedUpdateInfo(false);
|
|
if (updateVersion != null) {
|
|
// there is a version that is more recent. Load it and re-start with it
|
|
log.logInfo("AUTO-UPDATE: downloading more recent release " + updateVersion.url);
|
|
final File downloaded = yacyVersion.downloadRelease(updateVersion);
|
|
final boolean devenvironment = yacyVersion.combined2prettyVersion(sb.getConfig("version","0.1")).startsWith("dev");
|
|
if (devenvironment) {
|
|
log.logInfo("AUTO-UPDATE: omiting update because this is a development environment");
|
|
} else if ((downloaded == null) || (!downloaded.exists()) || (downloaded.length() == 0)) {
|
|
log.logInfo("AUTO-UPDATE: omiting update because download failed (file cannot be found or is too small)");
|
|
} else {
|
|
yacyVersion.deployRelease(downloaded);
|
|
terminate(5000);
|
|
log.logInfo("AUTO-UPDATE: deploy and restart initiated");
|
|
}
|
|
}
|
|
|
|
// initiate broadcast about peer startup to spread supporter url
|
|
if (this.webIndex.newsPool.size(yacyNewsPool.OUTGOING_DB) == 0) {
|
|
// read profile
|
|
final Properties profile = new Properties();
|
|
FileInputStream fileIn = null;
|
|
try {
|
|
fileIn = new FileInputStream(new File("DATA/SETTINGS/profile.txt"));
|
|
profile.load(fileIn);
|
|
} catch(final IOException e) {
|
|
} finally {
|
|
if (fileIn != null) try { fileIn.close(); } catch (final Exception e) {}
|
|
}
|
|
final String homepage = (String) profile.get("homepage");
|
|
if ((homepage != null) && (homepage.length() > 10)) {
|
|
final Properties news = new Properties();
|
|
news.put("homepage", profile.get("homepage"));
|
|
this.webIndex.newsPool.publishMyNews(yacyNewsRecord.newRecord(webIndex.seedDB.mySeed(), yacyNewsPool.CATEGORY_PROFILE_BROADCAST, news));
|
|
}
|
|
}
|
|
/*
|
|
// set a maximum amount of memory for the caches
|
|
// long memprereq = Math.max(getConfigLong(INDEXER_MEMPREREQ, 0), wordIndex.minMem());
|
|
// setConfig(INDEXER_MEMPREREQ, memprereq);
|
|
// setThreadPerformance(INDEXER, getConfigLong(INDEXER_IDLESLEEP, 0), getConfigLong(INDEXER_BUSYSLEEP, 0), memprereq);
|
|
kelondroCachedRecords.setCacheGrowStati(40 * 1024 * 1024, 20 * 1024 * 1024);
|
|
kelondroCache.setCacheGrowStati(40 * 1024 * 1024, 20 * 1024 * 1024);
|
|
*/
|
|
// update the cluster set
|
|
this.clusterhashes = this.webIndex.seedDB.clusterHashes(getConfig("cluster.peers.yacydomain", ""));
|
|
|
|
|
|
// after all clean up is done, check the resource usage
|
|
observer.resourceObserverJob();
|
|
|
|
return hasDoneSomething;
|
|
} catch (final InterruptedException e) {
|
|
this.log.logInfo("cleanupJob: Shutdown detected");
|
|
return false;
|
|
}
|
|
}
|
|
|
|
/**
|
|
* With this function the crawling process can be paused
|
|
* @param jobType
|
|
*/
|
|
public void pauseCrawlJob(final String jobType) {
|
|
final Object[] status = this.crawlJobsStatus.get(jobType);
|
|
synchronized(status[plasmaSwitchboardConstants.CRAWLJOB_SYNC]) {
|
|
status[plasmaSwitchboardConstants.CRAWLJOB_STATUS] = Boolean.TRUE;
|
|
}
|
|
setConfig(jobType + "_isPaused", "true");
|
|
}
|
|
|
|
/**
|
|
* Continue the previously paused crawling
|
|
* @param jobType
|
|
*/
|
|
public void continueCrawlJob(final String jobType) {
|
|
final Object[] status = this.crawlJobsStatus.get(jobType);
|
|
synchronized(status[plasmaSwitchboardConstants.CRAWLJOB_SYNC]) {
|
|
if (((Boolean)status[plasmaSwitchboardConstants.CRAWLJOB_STATUS]).booleanValue()) {
|
|
status[plasmaSwitchboardConstants.CRAWLJOB_STATUS] = Boolean.FALSE;
|
|
status[plasmaSwitchboardConstants.CRAWLJOB_SYNC].notifyAll();
|
|
}
|
|
}
|
|
setConfig(jobType + "_isPaused", "false");
|
|
}
|
|
|
|
/**
|
|
* @param jobType
|
|
* @return <code>true</code> if crawling was paused or <code>false</code> otherwise
|
|
*/
|
|
public boolean crawlJobIsPaused(final String jobType) {
|
|
final Object[] status = this.crawlJobsStatus.get(jobType);
|
|
synchronized(status[plasmaSwitchboardConstants.CRAWLJOB_SYNC]) {
|
|
return ((Boolean)status[plasmaSwitchboardConstants.CRAWLJOB_STATUS]).booleanValue();
|
|
}
|
|
}
|
|
|
|
public indexingQueueEntry parseDocument(final indexingQueueEntry in) {
|
|
in.queueEntry.updateStatus(IndexingStack.QUEUE_STATE_PARSING);
|
|
plasmaParserDocument document = null;
|
|
try {
|
|
document = parseDocument(in.queueEntry);
|
|
} catch (final InterruptedException e) {
|
|
document = null;
|
|
}
|
|
if (document == null) {
|
|
in.queueEntry.close();
|
|
return null;
|
|
}
|
|
return new indexingQueueEntry(in.queueEntry, document, null);
|
|
}
|
|
|
|
private plasmaParserDocument parseDocument(final IndexingStack.QueueEntry entry) throws InterruptedException {
|
|
plasmaParserDocument document = null;
|
|
final int processCase = entry.processCase();
|
|
|
|
if (this.log.isFine()) log.logFine("processResourceStack processCase=" + processCase +
|
|
", depth=" + entry.depth() +
|
|
", maxDepth=" + ((entry.profile() == null) ? "null" : Integer.toString(entry.profile().generalDepth())) +
|
|
", filter=" + ((entry.profile() == null) ? "null" : entry.profile().generalFilter()) +
|
|
", initiatorHash=" + entry.initiator() +
|
|
//", responseHeader=" + ((entry.responseHeader() == null) ? "null" : entry.responseHeader().toString()) +
|
|
", url=" + entry.url()); // DEBUG
|
|
|
|
// PARSE CONTENT
|
|
final long parsingStartTime = System.currentTimeMillis();
|
|
|
|
try {
|
|
// parse the document
|
|
document = parser.parseSource(entry.url(), entry.getMimeType(), entry.getCharacterEncoding(), entry.cacheFile());
|
|
assert(document != null) : "Unexpected error. Parser returned null.";
|
|
if (document == null) return null;
|
|
} catch (final ParserException e) {
|
|
this.log.logInfo("Unable to parse the resource '" + entry.url() + "'. " + e.getMessage());
|
|
addURLtoErrorDB(entry.url(), entry.referrerHash(), entry.initiator(), entry.anchorName(), e.getErrorCode());
|
|
if (document != null) {
|
|
document.close();
|
|
document = null;
|
|
}
|
|
return null;
|
|
}
|
|
|
|
final long parsingEndTime = System.currentTimeMillis();
|
|
|
|
// get the document date
|
|
final Date docDate = entry.getModificationDate();
|
|
|
|
// put anchors on crawl stack
|
|
final long stackStartTime = System.currentTimeMillis();
|
|
if (
|
|
((processCase == plasmaSwitchboardConstants.PROCESSCASE_4_PROXY_LOAD) || (processCase == plasmaSwitchboardConstants.PROCESSCASE_5_LOCAL_CRAWLING)) &&
|
|
((entry.profile() == null) || (entry.depth() < entry.profile().generalDepth()))
|
|
) {
|
|
final Map<yacyURL, String> hl = document.getHyperlinks();
|
|
final Iterator<Map.Entry<yacyURL, String>> i = hl.entrySet().iterator();
|
|
yacyURL nextUrl;
|
|
Map.Entry<yacyURL, String> nextEntry;
|
|
while (i.hasNext()) {
|
|
// check for interruption
|
|
checkInterruption();
|
|
|
|
// fetching the next hyperlink
|
|
nextEntry = i.next();
|
|
nextUrl = nextEntry.getKey();
|
|
// enqueue the hyperlink into the pre-notice-url db
|
|
crawlStacker.enqueueEntry(nextUrl, entry.urlHash(), entry.initiator(), nextEntry.getValue(), docDate, entry.depth() + 1, entry.profile());
|
|
}
|
|
final long stackEndTime = System.currentTimeMillis();
|
|
if (log.isInfo()) log.logInfo("CRAWL: ADDED " + hl.size() + " LINKS FROM " + entry.url().toNormalform(false, true) +
|
|
", NEW CRAWL STACK SIZE IS " + crawlQueues.noticeURL.stackSize(NoticedURL.STACK_TYPE_CORE) +
|
|
", STACKING TIME = " + (stackEndTime-stackStartTime) +
|
|
", PARSING TIME = " + (parsingEndTime-parsingStartTime));
|
|
}
|
|
return document;
|
|
}
|
|
|
|
public indexingQueueEntry condenseDocument(final indexingQueueEntry in) {
|
|
in.queueEntry.updateStatus(IndexingStack.QUEUE_STATE_CONDENSING);
|
|
plasmaCondenser condenser = null;
|
|
try {
|
|
condenser = condenseDocument(in.queueEntry, in.document);
|
|
} catch (final InterruptedException e) {
|
|
condenser = null;
|
|
}
|
|
if (condenser == null) {
|
|
in.queueEntry.close();
|
|
return null;
|
|
}
|
|
|
|
// update image result list statistics
|
|
// its good to do this concurrently here, because it needs a DNS lookup
|
|
// to compute a URL hash which is necessary for a double-check
|
|
final CrawlProfile.entry profile = in.queueEntry.profile();
|
|
ResultImages.registerImages(in.document, (profile == null) ? true : !profile.remoteIndexing());
|
|
|
|
return new indexingQueueEntry(in.queueEntry, in.document, condenser);
|
|
}
|
|
|
|
private plasmaCondenser condenseDocument(final IndexingStack.QueueEntry entry, plasmaParserDocument document) throws InterruptedException {
|
|
// CREATE INDEX
|
|
final String dc_title = document.dc_title();
|
|
final yacyURL referrerURL = entry.referrerURL();
|
|
final int processCase = entry.processCase();
|
|
|
|
String noIndexReason = ErrorURL.DENIED_UNSPECIFIED_INDEXING_ERROR;
|
|
if (processCase == plasmaSwitchboardConstants.PROCESSCASE_4_PROXY_LOAD) {
|
|
// proxy-load
|
|
noIndexReason = entry.shallIndexCacheForProxy();
|
|
} else {
|
|
// normal crawling
|
|
noIndexReason = entry.shallIndexCacheForCrawler();
|
|
}
|
|
|
|
if (noIndexReason != null) {
|
|
// check for interruption
|
|
checkInterruption();
|
|
|
|
log.logFine("Not indexed any word in URL " + entry.url() + "; cause: " + noIndexReason);
|
|
addURLtoErrorDB(entry.url(), (referrerURL == null) ? "" : referrerURL.hash(), entry.initiator(), dc_title, noIndexReason);
|
|
/*
|
|
if ((processCase == PROCESSCASE_6_GLOBAL_CRAWLING) && (initiatorPeer != null)) {
|
|
if (clusterhashes != null) initiatorPeer.setAlternativeAddress((String) clusterhashes.get(initiatorPeer.hash));
|
|
yacyClient.crawlReceipt(initiatorPeer, "crawl", "rejected", noIndexReason, null, "");
|
|
}
|
|
*/
|
|
document.close();
|
|
document = null;
|
|
return null;
|
|
}
|
|
|
|
// strip out words
|
|
checkInterruption();
|
|
if (this.log.isFine()) log.logFine("Condensing for '" + entry.url().toNormalform(false, true) + "'");
|
|
plasmaCondenser condenser;
|
|
try {
|
|
condenser = new plasmaCondenser(document, entry.profile().indexText(), entry.profile().indexMedia());
|
|
} catch (final UnsupportedEncodingException e) {
|
|
return null;
|
|
}
|
|
return condenser;
|
|
}
|
|
|
|
public indexingQueueEntry webStructureAnalysis(final indexingQueueEntry in) {
|
|
in.queueEntry.updateStatus(IndexingStack.QUEUE_STATE_STRUCTUREANALYSIS);
|
|
in.document.notifyWebStructure(webStructure, in.condenser, in.queueEntry.getModificationDate());
|
|
return in;
|
|
}
|
|
|
|
public void storeDocumentIndex(final indexingQueueEntry in) {
|
|
in.queueEntry.updateStatus(IndexingStack.QUEUE_STATE_INDEXSTORAGE);
|
|
storeDocumentIndex(in.queueEntry, in.document, in.condenser);
|
|
in.queueEntry.updateStatus(IndexingStack.QUEUE_STATE_FINISHED);
|
|
in.queueEntry.close();
|
|
}
|
|
|
|
private void storeDocumentIndex(final IndexingStack.QueueEntry queueEntry, final plasmaParserDocument document, final plasmaCondenser condenser) {
|
|
|
|
// CREATE INDEX
|
|
final String dc_title = document.dc_title();
|
|
final yacyURL referrerURL = queueEntry.referrerURL();
|
|
final int processCase = queueEntry.processCase();
|
|
|
|
// remove stopwords
|
|
log.logInfo("Excluded " + condenser.excludeWords(stopwords) + " words in URL " + queueEntry.url());
|
|
|
|
// STORE URL TO LOADED-URL-DB
|
|
indexURLReference newEntry = null;
|
|
try {
|
|
newEntry = webIndex.storeDocument(queueEntry, document, condenser);
|
|
} catch (final IOException e) {
|
|
if (this.log.isFine()) log.logFine("Not Indexed Resource '" + queueEntry.url().toNormalform(false, true) + "': process case=" + processCase);
|
|
addURLtoErrorDB(queueEntry.url(), referrerURL.hash(), queueEntry.initiator(), dc_title, "error storing url: " + e.getMessage());
|
|
return;
|
|
}
|
|
|
|
// update url result list statistics
|
|
crawlResults.stack(
|
|
newEntry, // loaded url db entry
|
|
queueEntry.initiator(), // initiator peer hash
|
|
this.webIndex.seedDB.mySeed().hash, // executor peer hash
|
|
processCase // process case
|
|
);
|
|
|
|
// STORE WORD INDEX
|
|
if ((!queueEntry.profile().indexText()) && (!queueEntry.profile().indexMedia())) {
|
|
if (this.log.isFine()) log.logFine("Not Indexed Resource '" + queueEntry.url().toNormalform(false, true) + "': process case=" + processCase);
|
|
addURLtoErrorDB(queueEntry.url(), referrerURL.hash(), queueEntry.initiator(), dc_title, ErrorURL.DENIED_UNKNOWN_INDEXING_PROCESS_CASE);
|
|
return;
|
|
}
|
|
|
|
// increment number of indexed urls
|
|
indexedPages++;
|
|
|
|
// update profiling info
|
|
if (System.currentTimeMillis() - lastPPMUpdate > 30000) {
|
|
// we don't want to do this too often
|
|
updateMySeed();
|
|
serverProfiling.update("ppm", Long.valueOf(currentPPM()));
|
|
serverProfiling.update("wordcache", Long.valueOf(webIndex.cacheSize()));
|
|
lastPPMUpdate = System.currentTimeMillis();
|
|
}
|
|
serverProfiling.update("indexed", queueEntry.url().toNormalform(true, false));
|
|
|
|
// if this was performed for a remote crawl request, notify requester
|
|
final yacySeed initiatorPeer = queueEntry.initiatorPeer();
|
|
if ((processCase == plasmaSwitchboardConstants.PROCESSCASE_6_GLOBAL_CRAWLING) && (initiatorPeer != null)) {
|
|
log.logInfo("Sending crawl receipt for '" + queueEntry.url().toNormalform(false, true) + "' to " + initiatorPeer.getName());
|
|
if (clusterhashes != null) initiatorPeer.setAlternativeAddress(clusterhashes.get(initiatorPeer.hash));
|
|
// start a thread for receipt sending to avoid a blocking here
|
|
new Thread(new receiptSending(initiatorPeer, newEntry)).start();
|
|
}
|
|
}
|
|
|
|
public class receiptSending implements Runnable {
|
|
yacySeed initiatorPeer;
|
|
indexURLReference reference;
|
|
|
|
public receiptSending(final yacySeed initiatorPeer, final indexURLReference reference) {
|
|
this.initiatorPeer = initiatorPeer;
|
|
this.reference = reference;
|
|
}
|
|
public void run() {
|
|
yacyClient.crawlReceipt(webIndex.seedDB.mySeed(), initiatorPeer, "crawl", "fill", "indexed", reference, "");
|
|
}
|
|
}
|
|
|
|
private static SimpleDateFormat DateFormatter = new SimpleDateFormat("EEE, dd MMM yyyy");
|
|
public static String dateString(final Date date) {
|
|
if (date == null) return "";
|
|
return DateFormatter.format(date);
|
|
}
|
|
|
|
// we need locale independent RFC-822 dates at some places
|
|
private static SimpleDateFormat DateFormatter822 = new SimpleDateFormat("EEE, dd MMM yyyy HH:mm:ss Z", Locale.US);
|
|
public static String dateString822(final Date date) {
|
|
if (date == null) return "";
|
|
return DateFormatter822.format(date);
|
|
}
|
|
|
|
|
|
public serverObjects action(final String actionName, final serverObjects actionInput) {
|
|
// perform an action. (not used)
|
|
return null;
|
|
}
|
|
|
|
public String toString() {
|
|
// it is possible to use this method in the cgi pages.
|
|
// actually it is used there for testing purpose
|
|
return "PROPS: " + super.toString() + "; QUEUE: " + webIndex.queuePreStack.toString();
|
|
}
|
|
|
|
// method for index deletion
|
|
public int removeAllUrlReferences(final yacyURL url, final boolean fetchOnline) {
|
|
return removeAllUrlReferences(url.hash(), fetchOnline);
|
|
}
|
|
|
|
public int removeAllUrlReferences(final String urlhash, final boolean fetchOnline) {
|
|
// find all the words in a specific resource and remove the url reference from every word index
|
|
// finally, delete the url entry
|
|
|
|
// determine the url string
|
|
final indexURLReference entry = webIndex.getURL(urlhash, null, 0);
|
|
if (entry == null) return 0;
|
|
final indexURLReference.Components comp = entry.comp();
|
|
if (comp.url() == null) return 0;
|
|
|
|
InputStream resourceContent = null;
|
|
try {
|
|
// get the resource content
|
|
final Object[] resource = plasmaSnippetCache.getResource(comp.url(), fetchOnline, 10000, true, false);
|
|
resourceContent = (InputStream) resource[0];
|
|
final Long resourceContentLength = (Long) resource[1];
|
|
|
|
// parse the resource
|
|
final plasmaParserDocument document = plasmaSnippetCache.parseDocument(comp.url(), resourceContentLength.longValue(), resourceContent);
|
|
|
|
// get the word set
|
|
Set<String> words = null;
|
|
try {
|
|
words = new plasmaCondenser(document, true, true).words().keySet();
|
|
} catch (final UnsupportedEncodingException e) {
|
|
e.printStackTrace();
|
|
}
|
|
|
|
// delete all word references
|
|
int count = 0;
|
|
if (words != null) count = webIndex.removeWordReferences(words, urlhash);
|
|
|
|
// finally delete the url entry itself
|
|
webIndex.removeURL(urlhash);
|
|
return count;
|
|
} catch (final ParserException e) {
|
|
return 0;
|
|
} finally {
|
|
if (resourceContent != null) try { resourceContent.close(); } catch (final Exception e) {/* ignore this */}
|
|
}
|
|
}
|
|
|
|
public int adminAuthenticated(final httpHeader header) {
|
|
|
|
// authorization for localhost, only if flag is set to grant localhost access as admin
|
|
final String clientIP = (String) header.get(httpHeader.CONNECTION_PROP_CLIENTIP, "");
|
|
final String refererHost = header.refererHost();
|
|
final boolean accessFromLocalhost = serverCore.isLocalhost(clientIP) && (refererHost.length() == 0 || serverCore.isLocalhost(refererHost));
|
|
if (getConfigBool("adminAccountForLocalhost", false) && accessFromLocalhost) return 3; // soft-authenticated for localhost
|
|
|
|
// get the authorization string from the header
|
|
final String authorization = ((String) header.get(httpHeader.AUTHORIZATION, "xxxxxx")).trim().substring(6);
|
|
|
|
// security check against too long authorization strings
|
|
if (authorization.length() > 256) return 0;
|
|
|
|
// authorization by encoded password, only for localhost access
|
|
final String adminAccountBase64MD5 = getConfig(httpd.ADMIN_ACCOUNT_B64MD5, "");
|
|
if (accessFromLocalhost && (adminAccountBase64MD5.equals(authorization))) return 3; // soft-authenticated for localhost
|
|
|
|
// authorization by hit in userDB
|
|
if (userDB.hasAdminRight((String) header.get(httpHeader.AUTHORIZATION, "xxxxxx"), ((String) header.get(httpHeader.CONNECTION_PROP_CLIENTIP, "")), header.getHeaderCookies())) return 4; //return, because 4=max
|
|
|
|
// authorization with admin keyword in configuration
|
|
return httpd.staticAdminAuthenticated(authorization, this);
|
|
}
|
|
|
|
public boolean verifyAuthentication(final httpHeader header, final boolean strict) {
|
|
// handle access rights
|
|
switch (adminAuthenticated(header)) {
|
|
case 0: // wrong password given
|
|
try { Thread.sleep(3000); } catch (final InterruptedException e) { } // prevent brute-force
|
|
return false;
|
|
case 1: // no password given
|
|
return false;
|
|
case 2: // no password stored
|
|
return !strict;
|
|
case 3: // soft-authenticated for localhost only
|
|
return true;
|
|
case 4: // hard-authenticated, all ok
|
|
return true;
|
|
}
|
|
return false;
|
|
}
|
|
|
|
public void setPerformance(int wantedPPM) {
|
|
// we consider 3 cases here
|
|
// wantedPPM <= 10: low performance
|
|
// 10 < wantedPPM < 1000: custom performance
|
|
// 1000 <= wantedPPM : maximum performance
|
|
if (wantedPPM <= 10) wantedPPM = 10;
|
|
if (wantedPPM >= 6000) wantedPPM = 6000;
|
|
final int newBusySleep = 60000 / wantedPPM; // for wantedPPM = 10: 6000; for wantedPPM = 1000: 60
|
|
|
|
serverBusyThread thread;
|
|
|
|
thread = getThread(plasmaSwitchboardConstants.INDEX_DIST);
|
|
if (thread != null) {
|
|
setConfig(plasmaSwitchboardConstants.INDEX_DIST_BUSYSLEEP , thread.setBusySleep(Math.max(2000, thread.setBusySleep(newBusySleep * 2))));
|
|
thread.setIdleSleep(30000);
|
|
}
|
|
|
|
thread = getThread(plasmaSwitchboardConstants.CRAWLJOB_LOCAL_CRAWL);
|
|
if (thread != null) {
|
|
setConfig(plasmaSwitchboardConstants.CRAWLJOB_LOCAL_CRAWL_BUSYSLEEP , thread.setBusySleep(newBusySleep));
|
|
thread.setIdleSleep(2000);
|
|
}
|
|
|
|
thread = getThread(plasmaSwitchboardConstants.PROXY_CACHE_ENQUEUE);
|
|
if (thread != null) {
|
|
setConfig(plasmaSwitchboardConstants.PROXY_CACHE_ENQUEUE_BUSYSLEEP , thread.setBusySleep(0));
|
|
thread.setIdleSleep(2000);
|
|
}
|
|
|
|
thread = getThread(plasmaSwitchboardConstants.INDEXER);
|
|
if (thread != null) {
|
|
setConfig(plasmaSwitchboardConstants.INDEXER_BUSYSLEEP , thread.setBusySleep(newBusySleep / 8));
|
|
thread.setIdleSleep(2000);
|
|
}
|
|
|
|
}
|
|
|
|
public static int accessFrequency(final HashMap<String, TreeSet<Long>> tracker, final String host) {
|
|
// returns the access frequency in queries per hour for a given host and a specific tracker
|
|
final long timeInterval = 1000 * 60 * 60;
|
|
final TreeSet<Long> accessSet = tracker.get(host);
|
|
if (accessSet == null) return 0;
|
|
return accessSet.tailSet(Long.valueOf(System.currentTimeMillis() - timeInterval)).size();
|
|
}
|
|
|
|
public void startTransferWholeIndex(final yacySeed seed, final boolean delete) {
|
|
if (transferIdxThread == null) {
|
|
this.transferIdxThread = new plasmaDHTFlush(this.log, this.webIndex, seed, delete,
|
|
"true".equalsIgnoreCase(getConfig(plasmaSwitchboardConstants.INDEX_TRANSFER_GZIP_BODY, "false")),
|
|
(int) getConfigLong(plasmaSwitchboardConstants.INDEX_TRANSFER_TIMEOUT, 60000));
|
|
this.transferIdxThread.start();
|
|
}
|
|
}
|
|
|
|
public void stopTransferWholeIndex(final boolean wait) {
|
|
if ((transferIdxThread != null) && (transferIdxThread.isAlive()) && (!transferIdxThread.isFinished())) {
|
|
try {
|
|
this.transferIdxThread.stopIt(wait);
|
|
} catch (final InterruptedException e) { }
|
|
}
|
|
}
|
|
|
|
public void abortTransferWholeIndex(final boolean wait) {
|
|
if (transferIdxThread != null) {
|
|
if (!transferIdxThread.isFinished())
|
|
try {
|
|
this.transferIdxThread.stopIt(wait);
|
|
} catch (final InterruptedException e) { }
|
|
transferIdxThread = null;
|
|
}
|
|
}
|
|
|
|
public String dhtShallTransfer() {
|
|
if (this.webIndex.seedDB == null) {
|
|
return "no DHT distribution: seedDB == null";
|
|
}
|
|
if (this.webIndex.seedDB.mySeed() == null) {
|
|
return "no DHT distribution: mySeed == null";
|
|
}
|
|
if (this.webIndex.seedDB.mySeed().isVirgin()) {
|
|
return "no DHT distribution: status is virgin";
|
|
}
|
|
if (this.webIndex.seedDB.noDHTActivity()) {
|
|
return "no DHT distribution: network too small";
|
|
}
|
|
if (!this.getConfigBool("network.unit.dht", true)) {
|
|
return "no DHT distribution: disabled by network.unit.dht";
|
|
}
|
|
if (getConfig(plasmaSwitchboardConstants.INDEX_DIST_ALLOW, "false").equalsIgnoreCase("false")) {
|
|
return "no DHT distribution: not enabled (ser setting)";
|
|
}
|
|
if (webIndex.countURL() < 10) {
|
|
return "no DHT distribution: loadedURL.size() = " + webIndex.countURL();
|
|
}
|
|
if (webIndex.size() < 100) {
|
|
return "no DHT distribution: not enough words - wordIndex.size() = " + webIndex.size();
|
|
}
|
|
if ((getConfig(plasmaSwitchboardConstants.INDEX_DIST_ALLOW_WHILE_CRAWLING, "false").equalsIgnoreCase("false")) && (crawlQueues.noticeURL.notEmpty())) {
|
|
return "no DHT distribution: crawl in progress: noticeURL.stackSize() = " + crawlQueues.noticeURL.size() + ", sbQueue.size() = " + webIndex.queuePreStack.size();
|
|
}
|
|
if ((getConfig(plasmaSwitchboardConstants.INDEX_DIST_ALLOW_WHILE_INDEXING, "false").equalsIgnoreCase("false")) && (webIndex.queuePreStack.size() > 1)) {
|
|
return "no DHT distribution: indexing in progress: noticeURL.stackSize() = " + crawlQueues.noticeURL.size() + ", sbQueue.size() = " + webIndex.queuePreStack.size();
|
|
}
|
|
return null; // this means; yes, please do dht transfer
|
|
}
|
|
|
|
public boolean dhtTransferJob() {
|
|
final String rejectReason = dhtShallTransfer();
|
|
if (rejectReason != null) {
|
|
if (this.log.isFine()) log.logFine(rejectReason);
|
|
return false;
|
|
}
|
|
if (this.dhtTransferChunk == null) {
|
|
if (this.log.isFine()) log.logFine("no DHT distribution: no transfer chunk defined");
|
|
return false;
|
|
}
|
|
if ((this.dhtTransferChunk != null) && (this.dhtTransferChunk.getStatus() != plasmaDHTChunk.chunkStatus_FILLED)) {
|
|
if (this.log.isFine()) log.logFine("no DHT distribution: index distribution is in progress, status=" + this.dhtTransferChunk.getStatus());
|
|
return false;
|
|
}
|
|
|
|
// do the transfer
|
|
final int peerCount = Math.max(1, (this.webIndex.seedDB.mySeed().isJunior()) ?
|
|
(int) getConfigLong("network.unit.dhtredundancy.junior", 1) :
|
|
(int) getConfigLong("network.unit.dhtredundancy.senior", 1)); // set redundancy factor
|
|
final long starttime = System.currentTimeMillis();
|
|
|
|
final boolean ok = dhtTransferProcess(dhtTransferChunk, peerCount);
|
|
|
|
final boolean success;
|
|
if (ok) {
|
|
dhtTransferChunk.setStatus(plasmaDHTChunk.chunkStatus_COMPLETE);
|
|
if (this.log.isFine()) log.logFine("DHT distribution: transfer COMPLETE");
|
|
// adopt transfer count
|
|
if ((System.currentTimeMillis() - starttime) > (10000 * peerCount)) {
|
|
dhtTransferIndexCount--;
|
|
} else {
|
|
if (dhtTransferChunk.indexCount() >= dhtTransferIndexCount) dhtTransferIndexCount++;
|
|
}
|
|
final int minChunkSize = (int) getConfigLong(plasmaSwitchboardConstants.INDEX_DIST_CHUNK_SIZE_MIN, 30);
|
|
final int maxChunkSize = (int) getConfigLong(plasmaSwitchboardConstants.INDEX_DIST_CHUNK_SIZE_MAX, 3000);
|
|
if (dhtTransferIndexCount < minChunkSize) dhtTransferIndexCount = minChunkSize;
|
|
if (dhtTransferIndexCount > maxChunkSize) dhtTransferIndexCount = maxChunkSize;
|
|
|
|
// show success
|
|
success = true;
|
|
} else {
|
|
dhtTransferChunk.incTransferFailedCounter();
|
|
final int maxChunkFails = (int) getConfigLong(plasmaSwitchboardConstants.INDEX_DIST_CHUNK_FAILS_MAX, 1);
|
|
if (dhtTransferChunk.getTransferFailedCounter() >= maxChunkFails) {
|
|
//System.out.println("DEBUG: " + dhtTransferChunk.getTransferFailedCounter() + " of " + maxChunkFails + " sendings failed for this chunk, aborting!");
|
|
dhtTransferChunk.setStatus(plasmaDHTChunk.chunkStatus_FAILED);
|
|
if (this.log.isFine()) log.logFine("DHT distribution: transfer FAILED");
|
|
}
|
|
else {
|
|
//System.out.println("DEBUG: " + dhtTransferChunk.getTransferFailedCounter() + " of " + maxChunkFails + " sendings failed for this chunk, retrying!");
|
|
if (this.log.isFine()) log.logFine("DHT distribution: transfer FAILED, sending this chunk again");
|
|
}
|
|
success = false;
|
|
}
|
|
return success;
|
|
}
|
|
|
|
public boolean dhtTransferProcess(final plasmaDHTChunk dhtChunk, final int peerCount) {
|
|
if ((this.webIndex.seedDB == null) || (this.webIndex.seedDB.sizeConnected() == 0)) return false;
|
|
|
|
try {
|
|
// find a list of DHT-peers
|
|
final double maxDist = 0.2;
|
|
final ArrayList<yacySeed> seeds = webIndex.peerActions.dhtAction.getDHTTargets(webIndex.seedDB, log, peerCount, Math.min(8, (int) (this.webIndex.seedDB.sizeConnected() * maxDist)), dhtChunk.firstContainer().getWordHash(), dhtChunk.lastContainer().getWordHash(), maxDist);
|
|
if (seeds.size() < peerCount) {
|
|
log.logWarning("found not enough (" + seeds.size() + ") peers for distribution for dhtchunk [" + dhtChunk.firstContainer().getWordHash() + " .. " + dhtChunk.lastContainer().getWordHash() + "]");
|
|
return false;
|
|
}
|
|
|
|
// send away the indexes to all these peers
|
|
int hc1 = 0;
|
|
|
|
// getting distribution configuration values
|
|
final boolean gzipBody = getConfig(plasmaSwitchboardConstants.INDEX_DIST_GZIP_BODY, "false").equalsIgnoreCase("true");
|
|
final int timeout = (int)getConfigLong(plasmaSwitchboardConstants.INDEX_DIST_TIMEOUT, 60000);
|
|
final int retries = 0;
|
|
|
|
// starting up multiple DHT transfer threads
|
|
final Iterator<yacySeed> seedIter = seeds.iterator();
|
|
final ArrayList<plasmaDHTTransfer> transfer = new ArrayList<plasmaDHTTransfer>(peerCount);
|
|
while (hc1 < peerCount && (transfer.size() > 0 || seedIter.hasNext())) {
|
|
|
|
// starting up some transfer threads
|
|
final int transferThreadCount = transfer.size();
|
|
for (int i=0; i < peerCount-hc1-transferThreadCount; i++) {
|
|
// check for interruption
|
|
checkInterruption();
|
|
|
|
if (seedIter.hasNext()) {
|
|
final plasmaDHTTransfer t = new plasmaDHTTransfer(log, webIndex.seedDB, webIndex.peerActions, seedIter.next(), dhtChunk,gzipBody,timeout,retries);
|
|
t.start();
|
|
transfer.add(t);
|
|
} else {
|
|
break;
|
|
}
|
|
}
|
|
|
|
// waiting for the transfer threads to finish
|
|
final Iterator<plasmaDHTTransfer> transferIter = transfer.iterator();
|
|
while (transferIter.hasNext()) {
|
|
// check for interruption
|
|
checkInterruption();
|
|
|
|
final plasmaDHTTransfer t = transferIter.next();
|
|
if (!t.isAlive()) {
|
|
// remove finished thread from the list
|
|
transferIter.remove();
|
|
|
|
// count successful transfers
|
|
if (t.getStatus() == plasmaDHTChunk.chunkStatus_COMPLETE) {
|
|
this.log.logInfo("DHT distribution: transfer to peer " + t.getSeed().getName() + " finished.");
|
|
hc1++;
|
|
}
|
|
}
|
|
}
|
|
|
|
if (hc1 < peerCount) Thread.sleep(100);
|
|
}
|
|
|
|
|
|
// clean up and finish with deletion of indexes
|
|
if (hc1 >= peerCount) {
|
|
// success
|
|
return true;
|
|
}
|
|
this.log.logSevere("Index distribution failed. Too few peers (" + hc1 + ") received the index, not deleted locally.");
|
|
return false;
|
|
} catch (final InterruptedException e) {
|
|
return false;
|
|
}
|
|
}
|
|
|
|
private void addURLtoErrorDB(
|
|
final yacyURL url,
|
|
final String referrerHash,
|
|
final String initiator,
|
|
final String name,
|
|
final String failreason
|
|
) {
|
|
assert initiator != null;
|
|
// create a new errorURL DB entry
|
|
final CrawlEntry bentry = new CrawlEntry(
|
|
initiator,
|
|
url,
|
|
referrerHash,
|
|
(name == null) ? "" : name,
|
|
new Date(),
|
|
null,
|
|
0,
|
|
0,
|
|
0);
|
|
final ZURL.Entry ee = crawlQueues.errorURL.newEntry(
|
|
bentry, initiator, new Date(),
|
|
0, failreason);
|
|
// store the entry
|
|
ee.store();
|
|
// push it onto the stack
|
|
crawlQueues.errorURL.push(ee);
|
|
}
|
|
|
|
public int currentPPM() {
|
|
final long uptime = (System.currentTimeMillis() - serverCore.startupTime) / 1000;
|
|
final long uptimediff = uptime - lastseedcheckuptime;
|
|
final long indexedcdiff = indexedPages - lastindexedPages;
|
|
totalPPM = (int) (indexedPages * 60 / Math.max(uptime, 1));
|
|
return Math.round(Math.max(indexedcdiff, 0f) * 60f / Math.max(uptimediff, 1f));
|
|
}
|
|
|
|
public void updateMySeed() {
|
|
if (getConfig("peerName", "anomic").equals("anomic")) {
|
|
// generate new peer name
|
|
setConfig("peerName", yacySeed.makeDefaultPeerName());
|
|
}
|
|
webIndex.seedDB.mySeed().put(yacySeed.NAME, getConfig("peerName", "nameless"));
|
|
webIndex.seedDB.mySeed().put(yacySeed.PORT, Integer.toString(serverCore.getPortNr(getConfig("port", "8080"))));
|
|
|
|
final long uptime = (System.currentTimeMillis() - serverCore.startupTime) / 1000;
|
|
final long uptimediff = uptime - lastseedcheckuptime;
|
|
final long indexedcdiff = indexedPages - lastindexedPages;
|
|
//double requestcdiff = requestedQueries - lastrequestedQueries;
|
|
if (uptimediff > 300 || uptimediff <= 0 || lastseedcheckuptime == -1 ) {
|
|
lastseedcheckuptime = uptime;
|
|
lastindexedPages = indexedPages;
|
|
lastrequestedQueries = requestedQueries;
|
|
}
|
|
|
|
//the speed of indexing (pages/minute) of the peer
|
|
totalPPM = (int) (indexedPages * 60 / Math.max(uptime, 1));
|
|
webIndex.seedDB.mySeed().put(yacySeed.ISPEED, Long.toString(Math.round(Math.max(indexedcdiff, 0f) * 60f / Math.max(uptimediff, 1f))));
|
|
totalQPM = requestedQueries * 60d / Math.max(uptime, 1d);
|
|
webIndex.seedDB.mySeed().put(yacySeed.RSPEED, Double.toString(totalQPM /*Math.max((float) requestcdiff, 0f) * 60f / Math.max((float) uptimediff, 1f)*/ ));
|
|
|
|
webIndex.seedDB.mySeed().put(yacySeed.UPTIME, Long.toString(uptime/60)); // the number of minutes that the peer is up in minutes/day (moving average MA30)
|
|
webIndex.seedDB.mySeed().put(yacySeed.LCOUNT, Integer.toString(webIndex.countURL())); // the number of links that the peer has stored (LURL's)
|
|
webIndex.seedDB.mySeed().put(yacySeed.NCOUNT, Integer.toString(crawlQueues.noticeURL.size())); // the number of links that the peer has noticed, but not loaded (NURL's)
|
|
webIndex.seedDB.mySeed().put(yacySeed.RCOUNT, Integer.toString(crawlQueues.noticeURL.stackSize(NoticedURL.STACK_TYPE_LIMIT))); // the number of links that the peer provides for remote crawling (ZURL's)
|
|
webIndex.seedDB.mySeed().put(yacySeed.ICOUNT, Integer.toString(webIndex.size())); // the minimum number of words that the peer has indexed (as it says)
|
|
webIndex.seedDB.mySeed().put(yacySeed.SCOUNT, Integer.toString(webIndex.seedDB.sizeConnected())); // the number of seeds that the peer has stored
|
|
webIndex.seedDB.mySeed().put(yacySeed.CCOUNT, Double.toString(((int) ((webIndex.seedDB.sizeConnected() + webIndex.seedDB.sizeDisconnected() + webIndex.seedDB.sizePotential()) * 60.0 / (uptime + 1.01)) * 100) / 100.0)); // the number of clients that the peer connects (as connects/hour)
|
|
webIndex.seedDB.mySeed().put(yacySeed.VERSION, getConfig("version", ""));
|
|
webIndex.seedDB.mySeed().setFlagDirectConnect(true);
|
|
webIndex.seedDB.mySeed().setLastSeenUTC();
|
|
webIndex.seedDB.mySeed().put(yacySeed.UTC, serverDate.UTCDiffString());
|
|
webIndex.seedDB.mySeed().setFlagAcceptRemoteCrawl(getConfig("crawlResponse", "").equals("true"));
|
|
webIndex.seedDB.mySeed().setFlagAcceptRemoteIndex(getConfig("allowReceiveIndex", "").equals("true"));
|
|
//mySeed.setFlagAcceptRemoteIndex(true);
|
|
}
|
|
|
|
public void loadSeedLists() {
|
|
// uses the superseed to initialize the database with known seeds
|
|
|
|
yacySeed ys;
|
|
String seedListFileURL;
|
|
yacyURL url;
|
|
ArrayList<String> seedList;
|
|
Iterator<String> enu;
|
|
int lc;
|
|
final int sc = webIndex.seedDB.sizeConnected();
|
|
httpHeader header;
|
|
|
|
yacyCore.log.logInfo("BOOTSTRAP: " + sc + " seeds known from previous run");
|
|
|
|
// - use the superseed to further fill up the seedDB
|
|
int ssc = 0, c = 0;
|
|
while (true) {
|
|
if (Thread.currentThread().isInterrupted()) break;
|
|
seedListFileURL = sb.getConfig("network.unit.bootstrap.seedlist" + c, "");
|
|
if (seedListFileURL.length() == 0) break;
|
|
c++;
|
|
if (
|
|
seedListFileURL.startsWith("http://") ||
|
|
seedListFileURL.startsWith("https://")
|
|
) {
|
|
// load the seed list
|
|
try {
|
|
final httpHeader reqHeader = new httpHeader();
|
|
reqHeader.put(httpHeader.PRAGMA, "no-cache");
|
|
reqHeader.put(httpHeader.CACHE_CONTROL, "no-cache");
|
|
reqHeader.put(httpHeader.USER_AGENT, HTTPLoader.yacyUserAgent);
|
|
|
|
url = new yacyURL(seedListFileURL, null);
|
|
final long start = System.currentTimeMillis();
|
|
header = HttpClient.whead(url.toString(), reqHeader);
|
|
final long loadtime = System.currentTimeMillis() - start;
|
|
if (header == null) {
|
|
if (loadtime > getConfigLong("bootstrapLoadTimeout", 6000)) {
|
|
yacyCore.log.logWarning("BOOTSTRAP: seed-list URL " + seedListFileURL + " not available, time-out after " + loadtime + " milliseconds");
|
|
} else {
|
|
yacyCore.log.logWarning("BOOTSTRAP: seed-list URL " + seedListFileURL + " not available, no content");
|
|
}
|
|
} else if (header.lastModified() == null) {
|
|
yacyCore.log.logWarning("BOOTSTRAP: seed-list URL " + seedListFileURL + " not usable, last-modified is missing");
|
|
} else if ((header.age() > 86400000) && (ssc > 0)) {
|
|
yacyCore.log.logInfo("BOOTSTRAP: seed-list URL " + seedListFileURL + " too old (" + (header.age() / 86400000) + " days)");
|
|
} else {
|
|
ssc++;
|
|
final byte[] content = HttpClient.wget(url.toString(), reqHeader, (int) getConfigLong("bootstrapLoadTimeout", 20000));
|
|
seedList = nxTools.strings(content, "UTF-8");
|
|
enu = seedList.iterator();
|
|
lc = 0;
|
|
while (enu.hasNext()) {
|
|
ys = yacySeed.genRemoteSeed(enu.next(), null, false);
|
|
if ((ys != null) &&
|
|
((!webIndex.seedDB.mySeedIsDefined()) || (!webIndex.seedDB.mySeed().hash.equals(ys.hash)))) {
|
|
if (webIndex.peerActions.connectPeer(ys, false)) lc++;
|
|
//seedDB.writeMap(ys.hash, ys.getMap(), "init");
|
|
//System.out.println("BOOTSTRAP: received peer " + ys.get(yacySeed.NAME, "anonymous") + "/" + ys.getAddress());
|
|
//lc++;
|
|
}
|
|
}
|
|
yacyCore.log.logInfo("BOOTSTRAP: " + lc + " seeds from seed-list URL " + seedListFileURL + ", AGE=" + (header.age() / 3600000) + "h");
|
|
}
|
|
|
|
} catch (final IOException e) {
|
|
// this is when wget fails, commonly because of timeout
|
|
yacyCore.log.logWarning("BOOTSTRAP: failed (1) to load seeds from seed-list URL " + seedListFileURL + ": " + e.getMessage());
|
|
} catch (final Exception e) {
|
|
// this is when wget fails; may be because of missing internet connection
|
|
yacyCore.log.logSevere("BOOTSTRAP: failed (2) to load seeds from seed-list URL " + seedListFileURL + ": " + e.getMessage(), e);
|
|
}
|
|
}
|
|
}
|
|
yacyCore.log.logInfo("BOOTSTRAP: " + (webIndex.seedDB.sizeConnected() - sc) + " new seeds while bootstraping.");
|
|
}
|
|
|
|
public void checkInterruption() throws InterruptedException {
|
|
final Thread curThread = Thread.currentThread();
|
|
if ((curThread instanceof serverThread) && ((serverThread)curThread).shutdownInProgress()) throw new InterruptedException("Shutdown in progress ...");
|
|
else if (this.terminate || curThread.isInterrupted()) throw new InterruptedException("Shutdown in progress ...");
|
|
}
|
|
|
|
public void terminate(final long delay) {
|
|
if (delay <= 0) throw new IllegalArgumentException("The shutdown delay must be greater than 0.");
|
|
(new delayedShutdown(this,delay)).start();
|
|
}
|
|
|
|
public void terminate() {
|
|
this.terminate = true;
|
|
this.shutdownSync.V();
|
|
}
|
|
|
|
public boolean isTerminated() {
|
|
return this.terminate;
|
|
}
|
|
|
|
public boolean waitForShutdown() throws InterruptedException {
|
|
this.shutdownSync.P();
|
|
return this.terminate;
|
|
}
|
|
|
|
/**
|
|
* loads the url as Map
|
|
*
|
|
* Strings like abc=123 are parsed as pair: abc => 123
|
|
*
|
|
* @param url
|
|
* @return
|
|
*/
|
|
public static Map<String, String> loadHashMap(final yacyURL url) {
|
|
try {
|
|
// sending request
|
|
final httpHeader reqHeader = new httpHeader();
|
|
reqHeader.put(httpHeader.USER_AGENT, HTTPLoader.yacyUserAgent);
|
|
final HashMap<String, String> result = nxTools.table(HttpClient.wget(url.toString(), reqHeader, 10000), "UTF-8");
|
|
if (result == null) return new HashMap<String, String>();
|
|
return result;
|
|
} catch (final Exception e) {
|
|
return new HashMap<String, String>();
|
|
}
|
|
}
|
|
}
|
|
|
|
class MoreMemory extends TimerTask {
|
|
public final void run() {
|
|
serverMemory.gc(10000, "MoreMemory()");
|
|
}
|
|
}
|
|
|
|
class delayedShutdown extends Thread {
|
|
private final plasmaSwitchboard sb;
|
|
private final long delay;
|
|
public delayedShutdown(final plasmaSwitchboard sb, final long delay) {
|
|
this.sb = sb;
|
|
this.delay = delay;
|
|
}
|
|
|
|
public void run() {
|
|
try {
|
|
Thread.sleep(delay);
|
|
} catch (final InterruptedException e) {
|
|
sb.getLog().logInfo("interrupted delayed shutdown");
|
|
}
|
|
this.sb.terminate();
|
|
}
|
|
}
|