fixed news; added news appearance on Network and IndexCreate page; added intention string to global crawl

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@466 6c8d7289-2bf4-0310-a012-ef5d649a1542
pull/1/head
orbiter 20 years ago
parent 5672709ef3
commit d34eb23e4e

@ -18,15 +18,13 @@ You can define URLs as start points for Web page crawling and start that crawlin
<form action="IndexCreate_p.html" method="post" enctype="multipart/form-data">
<tr valign="top" class="TableCellDark">
<td width="120"></td>
<td></td>
<td width="120"></td>
<td></td>
<td width="120" colspan="2"></td>
<td></td>
</tr>
<tr valign="top" class="TableCellDark">
<td class=small>Crawling Depth:</td>
<td class=small><input name="crawlingDepth" type="text" size="2" maxlength="2" value="#[crawlingDepth]#"></td>
<td class=small colspan="3">
<td class=small colspan="2">
A minimum of 1 is recommended.
Be careful with the prefetch number. Consider a branching factor of average 20;
A prefetch-depth of 8 would index 25.600.000.000 pages, maybe the whole WWW.
@ -35,7 +33,7 @@ You can define URLs as start points for Web page crawling and start that crawlin
<tr valign="top" class="TableCellDark">
<td class=small>Crawling Filter:</td>
<td class=small><input name="crawlingFilter" type="text" size="20" maxlength="100" value="#[crawlingFilter]#"></td>
<td class=small colspan="3">
<td class=small colspan="2">
This is an emacs-like regular expression that must match with the crawled URL.
Use this i.e. to crawl a single domain. If you set this filter it would make sense to increase
the crawl depth.
@ -44,7 +42,7 @@ You can define URLs as start points for Web page crawling and start that crawlin
<tr valign="top" class="TableCellDark">
<td class=small>Accept URL's with '?' / dynamic URL's:</td>
<td class=small><input type="checkbox" name="crawlingQ" align="top" #(crawlingQChecked)#::checked#(/crawlingQChecked)#></td>
<td class=small colspan="3">
<td class=small colspan="2">
URL's pointing to dynamic content should usually not be crawled. However, there are sometimes web pages with static content that
is accessed with URL's containing question marks. If you are unsure, do not check this to avoid crawl loops.
</td>
@ -52,7 +50,7 @@ You can define URLs as start points for Web page crawling and start that crawlin
<tr valign="top" class="TableCellDark">
<td class=small>Store to Proxy Cache:</td>
<td class=small><input type="checkbox" name="storeHTCache" align="top" #(storeHTCacheChecked)#::checked#(/storeHTCacheChecked)#></td>
<td class=small colspan="3">
<td class=small colspan="2">
This option is used by default for proxy prefetch, but is not needed for explicit crawling.
We recommend to leave this switched off unless you want to control the crawl results with the
<a href="CacheAdmin_p.html" class=small>Cache Monitor</a>.
@ -61,25 +59,29 @@ You can define URLs as start points for Web page crawling and start that crawlin
<tr valign="top" class="TableCellDark">
<td class=small>Do Local Indexing:</td>
<td class=small><input type="checkbox" name="localIndexing" align="top" #(localIndexingChecked)#::checked#(/localIndexingChecked)#></td>
<td class=small colspan="3">
<td class=small colspan="2">
This should be switched on by default, unless you want to crawl only to fill the
<a href="CacheAdmin_p.html" class=small>Proxy Cache</a> without indexing.
</td>
</tr>
<tr valign="top" class="TableCellDark">
<td class=small>Do Remote Indexing:</td>
<td class=small><input type="checkbox" name="crawlOrder" align="top" #(crawlOrderChecked)#::checked#(/crawlOrderChecked)#></td>
<td class=small colspan="3">
<td class=small><input type="checkbox" name="crawlOrder" align="top" #(crawlOrderChecked)#::checked#(/crawlOrderChecked)#><br>
Describe your intention to start this global crawl (optional):<br>
<input name="intention" type="text" size="40" maxlength="100" value=""><br>
This message will appear in the 'Other Peer Crawl Start' table of other peers.
</td>
<td class=small colspan="2">
If checked, the crawl will try to assign the leaf nodes of the search tree to remote peers.
If you need your crawling results locally, you must switch this off.
Only senior and principal peer's can initiate or receive remote crawls.
A News message will be created to inform all peers about a global crawl, so they can ommit starting a crawl with the same start point.
<b>A YaCyNews message will be created to inform all peers about a global crawl</b>, so they can ommit starting a crawl with the same start point.
</td>
</tr>
<tr valign="top" class="TableCellDark">
<td class=small>Exclude <i>static</i> Stop-Words</td>
<td class=small><input type="checkbox" name="xsstopw" align="top" #(xsstopwChecked)#::checked#(/xsstopwChecked)#></td>
<td class=small colspan="3">
<td class=small colspan="2">
To exclude all words given in the file <tt class=small>yacy.stopwords</tt> from indexing,
check this box.
</td>
@ -103,8 +105,8 @@ You can define URLs as start points for Web page crawling and start that crawlin
</tr>
-->
<tr valign="top" class="TableCellLight">
<td class="small" rowspan="3">Starting Point:</td>
<td class="small">
<td class="small">Starting Point:</td>
<td class="small" colspan="2">
<table cellpadding="0" cellspacing="0">
<tr><td class="small">From&nbsp;File:</td>
<td class="small"><input type="radio" name="crawlingMode" value="file"></td>
@ -134,22 +136,22 @@ Crawling and indexing can be done by remote peers.
Your peer can search and index for other peers and they can search for you.</div>
<table border="0" cellpadding="5" cellspacing="0" width="100%">
<tr valign="top" class="TableCellDark">
<td width="10%">
<td class=small width="10%">
<input type="radio" name="dcr" value="acceptCrawlMax" align="top" #(acceptCrawlMaxChecked)#::checked#(/acceptCrawlMaxChecked)#>
</td><td>
</td><td class=small>
Accept remote crawling requests and perform crawl at maximum load
</td>
</tr><tr valign="top" class="TableCellDark">
<td width="10%">
<td class=small width="10%">
<input type="radio" name="dcr" value="acceptCrawlLimited" align="top" #(acceptCrawlLimitedChecked)#::checked#(/acceptCrawlLimitedChecked)#>
</td><td>
</td><td class=small>
Accept remote crawling requests and perform crawl at maximum of
<input name="acceptCrawlLimit" type="text" size="4" maxlength="4" value="#[PPM]#"> Pages Per Minute (minimum is 1, low system load at PPM <= 30)
</td>
</tr><tr valign="top" class="TableCellDark">
<td width="10%">
<td class=small width="10%">
<input type="radio" name="dcr" value="acceptCrawlDenied" align="top" #(acceptCrawlDeniedChecked)#::checked#(/acceptCrawlDeniedChecked)#>
</td><td>
</td><td class=small>
Do not accept remote crawling requests (please set this only if you cannot accept to crawl only one page per minute; see option above)</td>
</td>
</tr><tr valign="top" class="TableCellLight">
@ -210,6 +212,14 @@ Continue crawling.
</form>
<br>
#(/refreshbutton)#
<br>
<form action="IndexCreate_p.html" method="post" enctype="multipart/form-data">
#(crawler-paused)#
<input type="submit" name="continuecrawlqueue" value="continue crawling">
::
<input type="submit" name="pausecrawlqueue" value="pause crawling">
#(/crawler-paused)#
</form>
<b id="crawlingProfiles">Crawl Profile List:</b><br>
<table border="0" cellpadding="2" cellspacing="1" width="100%">
<tr class="TableHeader">
@ -236,6 +246,28 @@ Continue crawling.
#{/crawlProfiles}#
</table>
<br>
<b id="crawlingStarts">Other Peer Crawl Starts (recently started with remote indexing; this is a YaCyNews Service):</b><br>
<table border="0" cellpadding="2" cellspacing="1" width="100%">
<tr class="TableHeader">
<td class="small"><b>Start Time</b></td>
<td class="small"><b>Peer Name</b></td>
<td class="small"><b>Start URL</b></td>
<td class="small"><b>Intention/Description</b></td>
<td class="small"><b>Depth</b></td>
<td class="small"><b>Accept '?'</b></td>
</tr>
#{otherCrawlStart}#
<tr class="TableCell#(dark)#Light::Dark#(/dark)#" class="small">
<td class="small">#[cre]#</td>
<td class="small">#[peername]#</td>
<td class="small"><a class="small" href="#[startURL]#">#[startURL]#</a></td>
<td class="small">#[intention]#</td>
<td class="small">#[generalDepth]#</td>
<td class="small">#(crawlingQ)#no::yes#(/withQuery)#</td>
</tr>
#{/otherCrawlStart}#
</table>
<br>
<b id="remoteCrawlPeers">Remote Crawling Peers:</b>&nbsp;
#(remoteCrawlPeers)#
No remote crawl peers availible.<br>
@ -256,14 +288,6 @@ No remote crawl peers availible.<br>
</tr>
</table>
#(/remoteCrawlPeers)#
<br>
<form action="IndexCreate_p.html" method="post" enctype="multipart/form-data">
#(crawler-paused)#
<input type="submit" name="continuecrawlqueue" value="continue crawling">
::
<input type="submit" name="pausecrawlqueue" value="pause crawling">
#(/crawler-paused)#
</form>
</p>
#[footer]#

@ -45,6 +45,7 @@
import java.io.File;
import java.io.OutputStream;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import java.text.SimpleDateFormat;
@ -72,6 +73,7 @@ import de.anomic.tools.bitfield;
import de.anomic.yacy.yacyCore;
import de.anomic.yacy.yacySeed;
import de.anomic.yacy.yacyNewsRecord;
import de.anomic.yacy.yacyNewsPool;
public class IndexCreate_p {
@ -168,6 +170,7 @@ public class IndexCreate_p {
m.remove("storeHTCache");
m.remove("generalFilter");
m.remove("specificFilter");
m.put("intention", ((String) post.get("intention", "")));
yacyCore.newsPool.publishMyNews(new yacyNewsRecord("crwlstrt", m));
}
@ -354,6 +357,35 @@ public class IndexCreate_p {
//}catch(IOException e){};
prop.put("crawlProfiles", count);
// create other peer crawl table using YaCyNews
int availableNews = yacyCore.newsPool.size(yacyNewsPool.INCOMING_DB);
int showedCrawl = 0;
yacyNewsRecord record;
yacySeed peer;
String peername;
try {
for (int c = 0; c < availableNews; c++) {
record = yacyCore.newsPool.get(yacyNewsPool.INCOMING_DB, c);
if (record.category().equals("crwlstrt")) {
peer = yacyCore.seedDB.get(record.originator());
if (peer == null) peername = record.originator(); else peername = peer.getName();
prop.put("otherCrawlStart_" + showedCrawl + "_dark", ((dark) ? 1 : 0));
prop.put("otherCrawlStart_" + showedCrawl + "_cre", record.created());
prop.put("otherCrawlStart_" + showedCrawl + "_peername", peername);
prop.put("otherCrawlStart_" + showedCrawl + "_startURL", record.attributes().get("startURL"));
prop.put("otherCrawlStart_" + showedCrawl + "_intention", record.attributes().get("intention"));
prop.put("otherCrawlStart_" + showedCrawl + "_generalDepth", record.attributes().get("generalDepth"));
prop.put("otherCrawlStart_" + showedCrawl + "_crawlingQ", (record.attributes().get("crawlingQ").equals("true")) ? 1 : 0);
showedCrawl++;
if (showedCrawl > 20) break;
}
}
} catch (IOException e) {}
prop.put("otherCrawlStart", showedCrawl);
// remote crawl peers
if (yacyCore.seedDB == null) {
//table += "Sorry, cannot show any crawl output now because the system is not completely initialised. Please re-try.";

@ -38,33 +38,32 @@
<p>Showing #[num]# entries from a total of #[total]# peers.<br>
<table border="0" cellpadding="2" cellspacing="1">
<tr class="TableHeader" valign="bottom">
<td class="small">Profile<br>&nbsp;</td>
<td class="small">Message<br>&nbsp;</td>
<td class="small">Name<br>&nbsp;</td>
<td class="small">send&nbsp;<b>M</b>essage/<br>show&nbsp;<b>P</b>rofile/<br>edit&nbsp;<b>W</b>iki<br>(* = updated)</td>
<td class="small"><b>Name</b><br>&nbsp;</td>
#(complete)#::
<td class="small">Address<br>&nbsp;</td>
<td class="small">Hash<br>&nbsp;</td>
<td class="small"><b>Address</b><br>&nbsp;</td>
<td class="small"><b>Hash</b><br>&nbsp;</td>
#(/complete)#
<td class="small">Type<br>&nbsp;</td>
<td class="small">Release/<br>SVN<br>&nbsp;</td>
<td class="small">Contact<br>&nbsp;</td>
<td class="small">Last Seen<br>&nbsp;&nbsp;<a href="/Network.html?page=#[page]#&sort=LastSeen&order=up">&lt;</a>&nbsp;<a href="/Network.html?page=#[page]#&sort=LastSeen&order=down">&gt;</a></td>
<td class="small">Uptime<br>&nbsp;&nbsp;<a href="/Network.html?page=#[page]#&sort=Uptime&order=up">&lt;</a>&nbsp;<a href="/Network.html?page=#[page]#&sort=Uptime&order=down">&gt;</a></td>
<td class="small">#Links<br>&nbsp;&nbsp;<a href="/Network.html?page=#[page]#&sort=LCount&order=up">&lt;</a>&nbsp;<a href="/Network.html?page=#[page]#&sort=LCount&order=down">&gt;</a></td>
<td class="small">#RWIs<br>&nbsp;&nbsp;<a href="/Network.html?page=#[page]#&sort=ICount&order=up">&lt;</a>&nbsp;<a href="/Network.html?page=#[page]#&sort=ICount&order=down">&gt;</a></td>
<td class="small">Accept<br>Crawl<br>&nbsp;</td>
<td class="small">Accept<br>Index<br>&nbsp;</td>
<td class="small">Sent<br>Words<br>&nbsp;</td>
<td class="small">Sent<br>URLs<br>&nbsp;</td>
<td class="small">Received<br>Words<br>&nbsp;</td>
<td class="small">Received<br>URLs<br>&nbsp;</td>
<td class="small">Indexed Pages<br>per Minute<br>&nbsp;</td>
<td class="small">#Seeds<br>&nbsp;</td>
<td class="small">#Connects<br>per hour<br>&nbsp;</td></tr>
<td class="small"><b>Type</b><br>&nbsp;</td>
<td class="small"><b>Release/<br>SVN</b><br>&nbsp;</td>
<td class="small"><b>Contact</b><br>&nbsp;</td>
<td class="small"><b>Last Seen</b><br>&nbsp;&nbsp;<a href="/Network.html?page=#[page]#&sort=LastSeen&order=up">&lt;</a>&nbsp;<a href="/Network.html?page=#[page]#&sort=LastSeen&order=down">&gt;</a></td>
<td class="small"><b>Uptime</b><br>&nbsp;&nbsp;<a href="/Network.html?page=#[page]#&sort=Uptime&order=up">&lt;</a>&nbsp;<a href="/Network.html?page=#[page]#&sort=Uptime&order=down">&gt;</a></td>
<td class="small"><b>#Links</b><br>&nbsp;&nbsp;<a href="/Network.html?page=#[page]#&sort=LCount&order=up">&lt;</a>&nbsp;<a href="/Network.html?page=#[page]#&sort=LCount&order=down">&gt;</a></td>
<td class="small"><b>#RWIs</b><br>&nbsp;&nbsp;<a href="/Network.html?page=#[page]#&sort=ICount&order=up">&lt;</a>&nbsp;<a href="/Network.html?page=#[page]#&sort=ICount&order=down">&gt;</a></td>
<td class="small">accept<br><b>Crawl</b>/<br><b>Index</b><br>&nbsp;</td>
<td class="small"><b>Sent<br>Words</b><br>&nbsp;</td>
<td class="small"><b>Sent<br>URLs</b><br>&nbsp;</td>
<td class="small"><b>Received<br>Words</b><br>&nbsp;</td>
<td class="small"><b>Received<br>URLs</b><br>&nbsp;</td>
<td class="small"><b>PPM</b><br>&nbsp;</td>
<td class="small"><b>#Seeds</b><br>&nbsp;</td>
<td class="small"><b>#Connects<br>per hour</b><br>&nbsp;</td></tr>
#{list}#
<tr class="TableCell#(dark)#Light::Dark::Summary#(/dark)#">
<td class="small"><a href="ViewProfile.html?hash=#[hash]#" class="small">view</a></td>
<td class="small"><a href="MessageSend_p.html?hash=#[hash]#" class="small">send</a></td>
<td class="small"><a href="MessageSend_p.html?hash=#[hash]#" class="small">m</a>&nbsp;&nbsp;&nbsp;
<a href="ViewProfile.html?hash=#[hash]#" class="small">p</a>#(updatedProfile)#&nbsp;::*#(/updatedProfile)#&nbsp;&nbsp;
<a href="http://#[fullname]#.yacy/Wiki.html" class="small">w</a></td>
<td class="small"><a href="http://www.#[fullname]#.yacy" class="small">#[shortname]#</a></td>
#(complete)#
::
@ -78,8 +77,8 @@
<td class="small" align="right"><NOBR>#[uptime]#</NOBR></td>
<td class="small" align="right">#[links]#</td>
<td class="small" align="right">#[words]#</td>
<td class="small" align="right">#(acceptcrawl)#no::yes#(/acceptcrawl)#</td>
<td class="small" align="right">#(acceptindex)#no::yes#(/acceptindex)#</td>
<td class="small" align="right">#(acceptcrawl)#-::C#(/acceptcrawl)#&nbsp;/&nbsp;
#(acceptindex)#-::I#(/acceptindex)#</td>
<td class="small" align="right">#[sI]#</td>
<td class="small" align="right">#[sU]#</td>
<td class="small" align="right">#[rI]#</td>

@ -45,6 +45,8 @@
import java.util.Enumeration;
import java.util.HashMap;
import java.util.HashSet;
import java.io.IOException;
import de.anomic.http.httpHeader;
import de.anomic.server.serverObjects;
@ -53,6 +55,8 @@ import de.anomic.server.serverDate;
import de.anomic.yacy.yacyClient;
import de.anomic.yacy.yacyCore;
import de.anomic.yacy.yacySeed;
import de.anomic.yacy.yacyNewsRecord;
import de.anomic.yacy.yacyNewsPool;
public class Network {
@ -211,6 +215,22 @@ public class Network {
yacyCore.peerActions.updateMySeed();
yacyCore.seedDB.addConnected(yacyCore.seedDB.mySeed);
}
// find updated Information using YaCyNews
HashSet updatedProfile = new HashSet();
int availableNews = yacyCore.newsPool.size(yacyNewsPool.INCOMING_DB);
if (availableNews > 500) availableNews = 500;
yacyNewsRecord record;
try {
for (int c = 0; c < availableNews; c++) {
record = yacyCore.newsPool.get(yacyNewsPool.INCOMING_DB, c);
if (record.category().equals("prfleupd")) {
updatedProfile.add(record.originator());
}
}
} catch (IOException e) {}
boolean dark = true;
yacySeed seed;
boolean complete = post.containsKey("ip");
@ -228,7 +248,8 @@ public class Network {
prop.put("table_list_"+conCount+"_dark", 2);
} else {
prop.put("table_list_"+conCount+"_dark", ((dark) ? 1 : 0) ); dark=!dark;
}
}
prop.put("table_list_"+conCount+"_updatedProfile", (((updatedProfile.contains(seed.hash))) ? 1 : 0) );
long links, words;
try {
links = Long.parseLong(seed.get("LCount", "0"));

@ -36,7 +36,29 @@
<p>
#(table)#
<p>
This is the news system (currently under testing).
This is the YaCyNews system (currently under testing).
</p>
<p>
The news service is controlled by several entry points:
<ul>
<li>A crawl start with remote indexing will automatically create a news entry.
Other peers may use this information to prevent double-crawls from the same start point.
A table with recently started crawls is presented on the Index Create - page</li>
<li>A change in the personal profile will create a news entry. You can see recently made changes of
profile entries on the Network page, where that profile change is visualized with a '*' beside the 'P' (profile) - selector.</li>
</ul><br>
More news services will follow.</p>
<p>
Above you can see four menues:
<ul>
<li><b>Incoming News</b>: latest news that arrived your peer.
Only these news will be used to display specific news services as explained above.
You can process these news with a button on the page to remove their appearance from the IndexCreate and Network page</li>
<li><b>Processed News</b>: this is simply an archive of incoming news that you removed by processing.</li>
<li><b>Outgoing News</b>: here your can see news entries that you have created. These news are currently broadcasted to other peers.
you can stop the broadcast if you want.</li>
<li><b>Published News</b>: your news that have been broadcasted sufficiently or that you have removed from the broadcast list.</li>
</ul><br>
</p>
::
<p>

@ -296,7 +296,7 @@ public final class serverCodings {
public static void main(String[] s) {
serverCodings b64 = new serverCodings(true);
if (s.length == 0) {System.out.println("usage: -[ec|dc|es|ds] <arg>"); System.exit(0);}
if (s.length == 0) {System.out.println("usage: -[ec|dc|es|ds|s2m] <arg>"); System.exit(0);}
if (s[0].equals("-ec")) {
// generate a b64 encoding from a given cardinal
System.out.println(b64.encodeBase64Long(Long.parseLong(s[1]), 4));
@ -313,6 +313,10 @@ public final class serverCodings {
// generate a b64 decoding from a given string
System.out.println(b64.decodeBase64String(s[1]));
}
if (s[0].equals("-s2m")) {
// generate a b64 decoding from a given string
System.out.println(string2map(s[1]).toString());
}
}
}

@ -42,6 +42,7 @@
package de.anomic.yacy;
import java.io.IOException;
import de.anomic.server.serverCodings;
public class yacyNewsAction implements yacyPeerAction {
@ -58,8 +59,14 @@ public class yacyNewsAction implements yacyPeerAction {
String decodedString = de.anomic.tools.crypt.simpleDecode(recordString, "");
yacyNewsRecord record = new yacyNewsRecord(decodedString);
System.out.println("### news arrival from peer " + peer.getName() + ", decoded=" + decodedString + ", record=" + recordString + ", news=" + record.toString());
String cre1 = (String) serverCodings.string2map(decodedString).get("cre");
String cre2 = (String) serverCodings.string2map(record.toString()).get("cre");
if ((cre1 == null) || (cre2 == null) || (!(cre1.equals(cre2)))) {
System.out.println("### ERROR - cre are not equal: cre1=" + cre1 + ", cre2=" + cre2);
return;
}
try {
this.pool.enqueueIncomingNews(record);
synchronized (pool) {this.pool.enqueueIncomingNews(record);}
} catch (IOException e) {e.printStackTrace();}
}

@ -145,4 +145,8 @@ public class yacyNewsRecord {
public Map attributes() {
return attributes;
}
public static void main(String[] args) {
System.out.println((new yacyNewsRecord(args[0])).toString());
}
}

@ -459,6 +459,17 @@ public class yacySeedDB {
return get(hash, seedPassiveDB);
}
public yacySeed getPotential(String hash) {
return get(hash, seedPotentialDB);
}
public yacySeed get(String hash) {
yacySeed seed = getConnected(hash);
if (seed == null) seed = getDisconnected(hash);
if (seed == null) seed = getPotential(hash);
return seed;
}
public yacySeed lookupByName(String peerName) {
// reads a seed by searching by name

Loading…
Cancel
Save