hold a date for each URL to record when a url was first seen. This is
then used to overwrite the modification date for urls upon recrawl in
case that the first-seen date is before the latest document date. This
behaviour is necessary due to the common behaviour of content management
systems which attach always the current date to all documents. Using the
firstSeen database it is possible to approximate a real first document
creation date in case that the crawler starts frequently for the same
domain. As a result the search results ordered by date have a much
better quality and the usage of YaCy as search agent for latest news has
a better quality.
#(cleanuprwi)#::<inputtype="checkbox"name="deleteRWI"id="deleteRWI"onclick="x=document.getElementById('deleteRWI').checked;c='disabled';if(x){c='';};document.getElementById('deletecomplete').disabled=c;"/> Delete RWI Index (DHT transmission words)<br/>#(/cleanuprwi)#
#(cleanupcitation)#::<inputtype="checkbox"name="deleteCitation"id="deleteCitation"onclick="x=document.getElementById('deleteCitation').checked;c='disabled';if(x){c='';};document.getElementById('deletecomplete').disabled=c;"/> Delete Citation Index (linking between URLs)<br/>#(/cleanupcitation)#
<inputtype="checkbox"name="deleteFirstSeen"id="deleteFirstSeen"disabled="disabled"/> Delete First-Seen Date Table<br/>
<inputtype="checkbox"name="deleteCache"id="deleteCache"disabled="disabled"/> Delete HTTP & FTP Cache<br/>
<inputtype="checkbox"name="deleteCrawlQueues"id="deleteCrawlQueues"disabled="disabled"/> Stop Crawler and delete Crawl Queues<br/>
<dt>Hash:</dt><dd><ahref="solr/select?defType=edismax&start=0&rows=3&core=collection1&wt=html&q=id:%22#[hash]#%22">#[hash]#</a> (click this for full metadata)</dd>
setFirstSeenTime(url.hash(),Math.min(document.getDate().getTime(),System.currentTimeMillis()));// should exist already in the index at this time, but just to make sure
// write the edges to the citation reference index