- the file generation logic is slightly different: files may now have only a maximum size of one gigabyte and a maximum age of one month.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5773 6c8d7289-2bf4-0310-a012-ef5d649a1542
- added automatic http client reset. this was necessary because excessive intranet crawling caused deadlocks. this hack solved the problem.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5768 6c8d7289-2bf4-0310-a012-ef5d649a1542
This is the start of a testing phase for IndexCell data structure which will replace
the collections and caching strategy. IndexCall creation and maintenance is fast, has
no caching overhead, very low IO load and is the basis for the next data structure,
index segments.
IndexCell files are stored at DATA/<network>/TEXT/RICELL
With this commit still the old data structures are used, until a flag in yacy.conf is set.
To switch to the new data structure, set
useCell = true
in yacy.conf. Then you will have no access any more to TEXT/RICACHE and TEXT/RICOLLECTION
This code is still bleeding-edge development. Please do not use the new data structure for
production now. Future versions may have changed data types, or other storage locations.
The next main release will have a migration feature for old data structures.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5724 6c8d7289-2bf4-0310-a012-ef5d649a1542
this option was never used and there is also no use to set other columns but the first as the primary key. as a result, access methods to the key do not need to compute key positions, and they work faster.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5711 6c8d7289-2bf4-0310-a012-ef5d649a1542
java -Xmx1000m -cp classes de.anomic.data.URLAnalysis -incollection DATA/INDEX/freeworld/TEXT/RICOLLECTION used.dump
java -Xmx1000m -cp classes de.anomic.data.URLAnalysis -diffurlcol DATA/INDEX/freeworld/TEXT used.dump diffurlcol.dump
java -Xmx1000m -cp classes de.anomic.data.URLAnalysis -export DATA/INDEX/freeworld/TEXT xml urls.xml diffurlcol.dump
java -Xmx1000m -cp classes de.anomic.data.URLAnalysis -delete DATA/INDEX/freeworld/TEXT diffurlcol.dump
The export-feature is optional, the purpose of that function is to provide a back-up function for URLs to be deleted. The export function can also be used to create html files with embedded links and simple text-files. Simply replace the 'xml' word with 'html' or 'text'. The last argument in the cann, the diffurlcol.dump value, can also be omitted. This will cause that the complete URL database is exported. This is an alternative to the Web-Interface based export function.
The delete-feature is the only destructive method of the four presented here. Please use it with care. It is better to make a back-up of the url database files before starting the deletion.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5694 6c8d7289-2bf4-0310-a012-ef5d649a1542
to use this, you must user the -incollection command before (see SVN 5687) and you need a
used.dump file that has been produced with that process.
Now you can use that file, to do a URL-hash compare with the urls in the URL-DB. To do that, execute
java -Xmx1000m -cp classes de.anomic.data.URLAnalysis -diffurlcol DATA/INDEX/freeworld/TEXT used.dump diffurlcol.dump
or use different names for the dump files or more memory.
As a result, you get the file diffurlcol.dump which contains all the url hashes that occur in the URL database, but not in the collections.
The file has the format
{hash-12}*
that means: 12 byte long hashes are listed without any separation.
The next step could be to process this file and delete all these URLs with the computed hashes, or to export them before deletion.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5692 6c8d7289-2bf4-0310-a012-ef5d649a1542
- fix for problem in httpdFileHandler: mising close of open Files if tempate cache was disabled
- more memory for DHT selection required
- stub for URL reference hash statistics in index collections
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5682 6c8d7289-2bf4-0310-a012-ef5d649a1542