- added javadoc for new concurrent intialization in kelondroBytesLongMap
- switched default value for commons storage to false
- version step
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5361 6c8d7289-2bf4-0310-a012-ef5d649a1542
- applied knowledge about concurrent files stream reading and index processing from the wikimedia reader
to the EcoTable initialization process: the file reader is now concurrent to the index generation
- changed also some initialization processes to avoid some pauses during initialization
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5354 6c8d7289-2bf4-0310-a012-ef5d649a1542
files exported from mediawiki using the xml schema according to
http://www.mediawiki.org/xml/export-0.3/
can be processed to be viewed in a YaCy servlet.
To acces such a file, place it into
DATA/HTCACHE/mediawiki/
i.e. the export from german wikipedia would be:
DATA/HTCACHE/mediawiki/wikipedia.de.xml
This file can then be accessed using the URL
http://localhost:8080/mediawiki_p.html?dump=wikipedia.de.xml&title=YaCy
if this is done the first time, an index file is created
(for this case: more than 4 million lines must be written, this takes about 15 minutes)
Then try the same url again.
- enhanced also the md5 computation speed
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5352 6c8d7289-2bf4-0310-a012-ef5d649a1542
- removed never-used secondary crawl depth
- added a must-not-match filter that can be used to exclude urls from a crawl
- added stub for crawl tags which will be used to identify search results that had been produced from specific crawls
please update the yacybar: replace property name 'crawlFilter' with 'mustmatch'.
Additionally, a new parameter named 'mustnotmatch' can be used, which should be by default the empty sring (match-never)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5342 6c8d7289-2bf4-0310-a012-ef5d649a1542
- added hash to HTCACHE storage files which will make it possible to join separate caches by just copying files
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5329 6c8d7289-2bf4-0310-a012-ef5d649a1542
- do not display indexed html (solves xss issues)
the single words are analyzed for already marked parts. this is needed to avoid false encoding of the marker (<b>) tags.
- improved speed for existing routine
heavy used regex pattern are precompiled now
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5322 6c8d7289-2bf4-0310-a012-ef5d649a1542
- for redirector and remote crawling place crawling url on notice queue instead of direct enqueueing in crawler queue
- when a request to a remote crawl provider fails, remove the peer from the network to prevent that the url fetcher gets stuck another time again
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5320 6c8d7289-2bf4-0310-a012-ef5d649a1542
- implemented vertical DHT acceptance ("my own DHT") to accept new targets
- added new target computation for global search: addresses vertical targets also
- enhanced remote crawling: collection of remote crawl urls if queue has less than 100 entries (was: 0 entries)
- better performance value computations for PPM selection in network configuration
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5319 6c8d7289-2bf4-0310-a012-ef5d649a1542
- two different computations (but mathematical equivalent) of the DHT distance had been consolidated
- moved from 0.0 .. 1.0 double-range position computation to 0 .. Long.Max range for DHT targets
- added fast Long - to - hash computation
- high-precision target computation of gaps for new peers
- added new target computation for horizontal and vertical DHT targets (not yet in use)
- old horizontal-only DHT targets will be upwards compatible to new horizontal and vertical DHT positions
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5318 6c8d7289-2bf4-0310-a012-ef5d649a1542
with index.storeCommons=false all currently stored commons are deleted!
Default is now 'true', but in future full releases it will be switched to 'false'
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5315 6c8d7289-2bf4-0310-a012-ef5d649a1542
- returns high/medium/low disk space
- pauses crawling on medium disk space
- disables index receive on low disk space
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5310 6c8d7289-2bf4-0310-a012-ef5d649a1542
*) fixed display of values of information for which part of YaCy (crawler, proxy, ...) blacklist is activated for
*) replaced regular put() with putXML() in several cases
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5305 6c8d7289-2bf4-0310-a012-ef5d649a1542
*) a blacklist will only be created if no blacklist with same name exists (some refactoring has been necessary for this)
*) further minor fixes
*) to be continued...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5301 6c8d7289-2bf4-0310-a012-ef5d649a1542
The old process used a not really efficient way to detect html encoding strings in texts.
All calling methods had been adoped to call the new class in an enhanced way with less parameters.
Many classes in interfaces used a XML encoding only (instead of full html conversion from unicode to html); this behavior was not changed with this commit but should be controlled again since it points out possible XSS leaks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5295 6c8d7289-2bf4-0310-a012-ef5d649a1542