java -Xmx1000m -cp classes de.anomic.data.URLAnalysis -incollection DATA/INDEX/freeworld/TEXT/RICOLLECTION used.dump
java -Xmx1000m -cp classes de.anomic.data.URLAnalysis -diffurlcol DATA/INDEX/freeworld/TEXT used.dump diffurlcol.dump
java -Xmx1000m -cp classes de.anomic.data.URLAnalysis -export DATA/INDEX/freeworld/TEXT xml urls.xml diffurlcol.dump
java -Xmx1000m -cp classes de.anomic.data.URLAnalysis -delete DATA/INDEX/freeworld/TEXT diffurlcol.dump
The export-feature is optional, the purpose of that function is to provide a back-up function for URLs to be deleted. The export function can also be used to create html files with embedded links and simple text-files. Simply replace the 'xml' word with 'html' or 'text'. The last argument in the cann, the diffurlcol.dump value, can also be omitted. This will cause that the complete URL database is exported. This is an alternative to the Web-Interface based export function.
The delete-feature is the only destructive method of the four presented here. Please use it with care. It is better to make a back-up of the url database files before starting the deletion.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5694 6c8d7289-2bf4-0310-a012-ef5d649a1542
to use this, you must user the -incollection command before (see SVN 5687) and you need a
used.dump file that has been produced with that process.
Now you can use that file, to do a URL-hash compare with the urls in the URL-DB. To do that, execute
java -Xmx1000m -cp classes de.anomic.data.URLAnalysis -diffurlcol DATA/INDEX/freeworld/TEXT used.dump diffurlcol.dump
or use different names for the dump files or more memory.
As a result, you get the file diffurlcol.dump which contains all the url hashes that occur in the URL database, but not in the collections.
The file has the format
{hash-12}*
that means: 12 byte long hashes are listed without any separation.
The next step could be to process this file and delete all these URLs with the computed hashes, or to export them before deletion.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5692 6c8d7289-2bf4-0310-a012-ef5d649a1542
- fix for problem in httpdFileHandler: mising close of open Files if tempate cache was disabled
- more memory for DHT selection required
- stub for URL reference hash statistics in index collections
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5682 6c8d7289-2bf4-0310-a012-ef5d649a1542
especially loading of favicons in search results. This is a fix that
affects only searches in intranet/repository configurations.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5670 6c8d7289-2bf4-0310-a012-ef5d649a1542
- check the list and kick out entries with lines that contain not valid urls
- normalize the urls
- remove doubles
- sort the list
- split the list in smaller chunks
This is all done in one process which can be called with a new -sort option
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5655 6c8d7289-2bf4-0310-a012-ef5d649a1542
url lists can also be compressed with gzip
If such a file is handed over to URLAnalysis, the output will also be written as .gz-file
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5652 6c8d7289-2bf4-0310-a012-ef5d649a1542
This can be used like
java -Xmx2000m -cp classes de.anomic.data.URLAnalysis -host DATA/EXPORT/20090224213823.txt
changed als the call method to generate statistics, please use now
java -Xmx2000m -cp classes de.anomic.data.URLAnalysis -stat DATA/EXPORT/20090224213823.txt
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5650 6c8d7289-2bf4-0310-a012-ef5d649a1542
without an order by the primary key. The result is a very fast enumeration of the Eco table data structure. Other table data types are not affected.
The new enumerator is used for the url export function that can be accessed from the online interface (Index Administration -> URL References -> Export). This export should now be much faster, if all url database files are from type Eco
The new enumeration is also used at other functions in YaCy, i.e. the initialization of the crawl balancer and the initialization of YaCy News.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5647 6c8d7289-2bf4-0310-a012-ef5d649a1542