- more hacks to check that files are closed propertly and filehandles do not exist after files are closed.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5757 6c8d7289-2bf4-0310-a012-ef5d649a1542
because of the strongly enhanced indexing speed when using the new IndexCell RWI data structures (> 2000PPM on my notebook), it is now necessary to control the crawling speed depending on the response time of the target server (which is also YaCy in case of some intranet indexing use cases).
The latency factor in crawl delay times is derived from the time that a target hosts takes to answer on http requests. For internet domains, the crawl delay is a minimum of twice the response time, in intranet cases the delay time is now a halve of the response time.
- added API to monitor the latency times of the crawler:
a new api at /api/latency_p.xml returns the current response times of domains, the time when the domain was accessed by the crawler the last time and many more attributes.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5733 6c8d7289-2bf4-0310-a012-ef5d649a1542
This is the start of a testing phase for IndexCell data structure which will replace
the collections and caching strategy. IndexCall creation and maintenance is fast, has
no caching overhead, very low IO load and is the basis for the next data structure,
index segments.
IndexCell files are stored at DATA/<network>/TEXT/RICELL
With this commit still the old data structures are used, until a flag in yacy.conf is set.
To switch to the new data structure, set
useCell = true
in yacy.conf. Then you will have no access any more to TEXT/RICACHE and TEXT/RICOLLECTION
This code is still bleeding-edge development. Please do not use the new data structure for
production now. Future versions may have changed data types, or other storage locations.
The next main release will have a migration feature for old data structures.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5724 6c8d7289-2bf4-0310-a012-ef5d649a1542
this option was never used and there is also no use to set other columns but the first as the primary key. as a result, access methods to the key do not need to compute key positions, and they work faster.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5711 6c8d7289-2bf4-0310-a012-ef5d649a1542
The speed of the kelondro indexing class ObjectIndexCache can be compared with Javas standard TreeMap with the main method in IntegerHandleIndex. The result is, that the kelondro indexing needs only 1/5 of the memory that TreeMap uses! In exchange, the kelondro classes are slower than TreeMap, about four (!) times slower. However, this is not so bad because the better use of the memory is a strong advantage and makes it possible that YaCy can maintain such a large number of document (> 50 million) in one peer.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5705 6c8d7289-2bf4-0310-a012-ef5d649a1542
Java standard classes provide a Map Interface, that has a put() method that returns the object that was replaced by the object that was the argument of the put call. The kelondro ObjectIndex defined a put method in the same way, that means it also returned the previous value of the Entry object before the put call. However, this value was not used by the calling code in the most cases. Omitting a return of the previous value would cause some performance benefit. This change implements a put method that does not return the previous value to reflect the common use. Omitting the return of previous values will cause some benefit in performance. The functionality to get the previous value is still maintained, and provided with a new 'replace' method.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5700 6c8d7289-2bf4-0310-a012-ef5d649a1542
java -Xmx1000m -cp classes de.anomic.data.URLAnalysis -incollection DATA/INDEX/freeworld/TEXT/RICOLLECTION used.dump
java -Xmx1000m -cp classes de.anomic.data.URLAnalysis -diffurlcol DATA/INDEX/freeworld/TEXT used.dump diffurlcol.dump
java -Xmx1000m -cp classes de.anomic.data.URLAnalysis -export DATA/INDEX/freeworld/TEXT xml urls.xml diffurlcol.dump
java -Xmx1000m -cp classes de.anomic.data.URLAnalysis -delete DATA/INDEX/freeworld/TEXT diffurlcol.dump
The export-feature is optional, the purpose of that function is to provide a back-up function for URLs to be deleted. The export function can also be used to create html files with embedded links and simple text-files. Simply replace the 'xml' word with 'html' or 'text'. The last argument in the cann, the diffurlcol.dump value, can also be omitted. This will cause that the complete URL database is exported. This is an alternative to the Web-Interface based export function.
The delete-feature is the only destructive method of the four presented here. Please use it with care. It is better to make a back-up of the url database files before starting the deletion.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5694 6c8d7289-2bf4-0310-a012-ef5d649a1542
to use this, you must user the -incollection command before (see SVN 5687) and you need a
used.dump file that has been produced with that process.
Now you can use that file, to do a URL-hash compare with the urls in the URL-DB. To do that, execute
java -Xmx1000m -cp classes de.anomic.data.URLAnalysis -diffurlcol DATA/INDEX/freeworld/TEXT used.dump diffurlcol.dump
or use different names for the dump files or more memory.
As a result, you get the file diffurlcol.dump which contains all the url hashes that occur in the URL database, but not in the collections.
The file has the format
{hash-12}*
that means: 12 byte long hashes are listed without any separation.
The next step could be to process this file and delete all these URLs with the computed hashes, or to export them before deletion.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5692 6c8d7289-2bf4-0310-a012-ef5d649a1542
- fix for problem in httpdFileHandler: mising close of open Files if tempate cache was disabled
- more memory for DHT selection required
- stub for URL reference hash statistics in index collections
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5682 6c8d7289-2bf4-0310-a012-ef5d649a1542
especially loading of favicons in search results. This is a fix that
affects only searches in intranet/repository configurations.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5670 6c8d7289-2bf4-0310-a012-ef5d649a1542
- check the list and kick out entries with lines that contain not valid urls
- normalize the urls
- remove doubles
- sort the list
- split the list in smaller chunks
This is all done in one process which can be called with a new -sort option
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5655 6c8d7289-2bf4-0310-a012-ef5d649a1542
url lists can also be compressed with gzip
If such a file is handed over to URLAnalysis, the output will also be written as .gz-file
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5652 6c8d7289-2bf4-0310-a012-ef5d649a1542
This can be used like
java -Xmx2000m -cp classes de.anomic.data.URLAnalysis -host DATA/EXPORT/20090224213823.txt
changed als the call method to generate statistics, please use now
java -Xmx2000m -cp classes de.anomic.data.URLAnalysis -stat DATA/EXPORT/20090224213823.txt
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5650 6c8d7289-2bf4-0310-a012-ef5d649a1542
without an order by the primary key. The result is a very fast enumeration of the Eco table data structure. Other table data types are not affected.
The new enumerator is used for the url export function that can be accessed from the online interface (Index Administration -> URL References -> Export). This export should now be much faster, if all url database files are from type Eco
The new enumeration is also used at other functions in YaCy, i.e. the initialization of the crawl balancer and the initialization of YaCy News.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5647 6c8d7289-2bf4-0310-a012-ef5d649a1542
Prevent NPE:
I 2009/02/20 15:15:56 PLASMA check for Session_77.37.19.225:38812#0: 86515 ms alive, stopping thread
I 2009/02/20 15:15:56 PLASMA Closing main socket of thread 'Session_77.37.19.225:38812#0'
E 2009/02/20 15:15:56 SERVER receive interrupted - exception 2 = Socket closed
Exception in thread "Session_77.37.19.225:38812#0" java.lang.NullPointerException
at de.anomic.server.serverCore$Session.run(serverCore.java:623)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5624 6c8d7289-2bf4-0310-a012-ef5d649a1542
With soLinger=true the httpd looses connections
The effect can be seen when crawling the internal repository:
lost connections filled the client process queue until it was full
and no more connections were possible.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5620 6c8d7289-2bf4-0310-a012-ef5d649a1542
- some hacks for less memory usage:
-- less usage of buffer and cache memory in EcoFS
-- buffer allocation on-demand in BufferedIOChunks
-- removed largest ybr idx
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5595 6c8d7289-2bf4-0310-a012-ef5d649a1542
- after a index selection is made, the index is splitted into its vertical components
- from differrent index selctions the splitted components can be accumulated before they are placed into the transmission queue
- each splitted chunk gets its own transmission thread
- multiple transmission threads are started concurrently
- the process can be monitored with the blocking queue servlet
To implement that, a new package de.anomic.yacy.dht was created. Some old files have been removed.
The new index distribution model using a vertical DHT was implemented. An abstraction of this model
is implemented in the new dht package as interface. The freeworld network has now a configuration
of two vertial partitions; sixteen partitions are planned and will be configured if the process is bug-free.
This modification has three main targets:
- enhance the DHT transmission speed
- with a vertical DHT, a search will speed up. With two partitions, two times. With sixteen, sixteen times.
- the vertical DHT will apply a semi-dht for URLs, and peers will receive a fraction of the overall URLs they received before.
with two partitions, the fractions will be halve. With sixteen partitions, a 1/16 of the previous number of URLs.
BE CAREFULL, THIS IS A MAJOR CODE CHANGE, POSSIBLY FULL OF BUGS AND HARMFUL THINGS.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5586 6c8d7289-2bf4-0310-a012-ef5d649a1542
* added possibility to find maximum possible heap size
you can get it via getWin32MaxHeap.bat
this may cause high system load
moreover the found limit is no guarantee for stable startups since it depends on system configuration
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5583 6c8d7289-2bf4-0310-a012-ef5d649a1542
from the forum discussion in
http://forum.yacy-websuche.de/viewtopic.php?p=12612#p12612
The feature will provide two basic entities:
- you can integrate image links which point to your yacy installation anywhere in the web.
the image can be loaded with
<img src="http://<yourpeer>:<yourport>/cytag.png?icon=invisible&nick=<yournickname_or_community_id>&tag=<anything>">
This will place a invisible 1-pixel image. If you change the icon=invisible to icon=redpill, you will see a red pill
Use this, to track your activity in the web.
- you can view your tracks at
http://localhost:8080/Tracks.html
- There is a public api to your tracks at
http://localhost:8080/api/tracks_p.json
which needs authentication
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5581 6c8d7289-2bf4-0310-a012-ef5d649a1542
- fixes to httpd server response header generation
- fixes to a server date computation bug
- new Button in indexControl to view content of url in ViewFile
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5576 6c8d7289-2bf4-0310-a012-ef5d649a1542
- added a configuration option to set the pop up page in Config Appearance
- added a minimized header option to yacyinteractive
- fixed a bug in yacysearch: default values when no query is done
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5569 6c8d7289-2bf4-0310-a012-ef5d649a1542
which can be used to map any path to http://localhost:8080/repository/
This can be used to do an intranet-indexing without the setting of
symbolic links - which does not work in Windows environment.
Now also Windows users can index their file system easily
using the intranet use case.
- fixed some problems with the identification of the alternative
path in DATA/HTDOCS in the httpd file server
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5538 6c8d7289-2bf4-0310-a012-ef5d649a1542
the file server contains a patch that temporary matches all xml paths to api,
that means all interfaces still work. Please adopt all your interfaces to the new path.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5497 6c8d7289-2bf4-0310-a012-ef5d649a1542
in the network configuration, you can configure a whiteliste and a blacklist
- blacklistet clients cannot search
- whitelistet client get never any search restrictions
- for all other clients: apply DoS search restrictions
Please see the example configuriation in yacy.network.freeworld.unit
by default, all clients from localhosts get whitlistet.
If you have your own YaCy network, please put all the IPs of your peers into the whitelist
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5475 6c8d7289-2bf4-0310-a012-ef5d649a1542
- ensuring that ordered indexes stay ordered during remove
- no unnecessary ordering checks
- better test logic in crawl stacker
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5457 6c8d7289-2bf4-0310-a012-ef5d649a1542
- added dummy servlet class, because otherwise the template engine is not triggered.
thats so because the yacy httpd works much faster as normal file server without a scan
of the served pages. Therefore each page with templates must now have a class file associated to it.
- fixed json output format of yacysearch
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5449 6c8d7289-2bf4-0310-a012-ef5d649a1542
The cache is now flushed only for one second every ten seconds. During a crawl the cache
fills up completely, and is only flushed if space is needed for more documents.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5446 6c8d7289-2bf4-0310-a012-ef5d649a1542
this is a migration step to support a new method to store the web index, which will also based on the same data structure. made also a lot of refactoring for a better structuring of the BLOBHeap class.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5430 6c8d7289-2bf4-0310-a012-ef5d649a1542
*) set cgi.allow to true in yacy.conf to enable CGI (CGI is disabled by default)
*) edit cgi.suffixes in yacy.conf if necessary to use additional script types
ATTENTION: This is a rather experimental feature, not all environment variables are set yet.
Only enable CGI if you know what you are doing. Poorly implemented CGI scripts can put a system's integrity at risk!
Implementation of more environment variables and documentation due for the next days.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5428 6c8d7289-2bf4-0310-a012-ef5d649a1542
- introduced blocking queues in CrawlStacker to make it ready for concurrency
- added a second busy thread for the CrawlStacker
The CrawlStacker is multithreaded. It shall be transformed into a BlockingThread in another step.
The concurrency of the stacker will hopefully solve some problems with cases where DNS blocks.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5395 6c8d7289-2bf4-0310-a012-ef5d649a1542
- added a write buffer to BLOBHeap
- modified the BLOBBuffer (is now only to buffer non-compressed content)
- added content compression to the HTCache
The new read cache will decrease the start/initialization time of BLOB files,
like the HTCache, RobotsTxt and other BLOBHeap structures.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5386 6c8d7289-2bf4-0310-a012-ef5d649a1542
- modified and activated write buffer
- increased cache flush factor
- fixed a problem with deadlocking of indexing process
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5382 6c8d7289-2bf4-0310-a012-ef5d649a1542
- during the user types search queries, the local database is searched
- results are presented interactively
This was implemented using a new JSON result format for search results in YaCy
- added JSON as file format for servlets
- refactoring of current search servlets (xml and html)
- added JSON output format for search results
- added AJAX-based search page, that uses the yacysearch.json selrvlet to print results as a query is typed
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5373 6c8d7289-2bf4-0310-a012-ef5d649a1542
- added javadoc for new concurrent intialization in kelondroBytesLongMap
- switched default value for commons storage to false
- version step
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5361 6c8d7289-2bf4-0310-a012-ef5d649a1542
- applied knowledge about concurrent files stream reading and index processing from the wikimedia reader
to the EcoTable initialization process: the file reader is now concurrent to the index generation
- changed also some initialization processes to avoid some pauses during initialization
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5354 6c8d7289-2bf4-0310-a012-ef5d649a1542
files exported from mediawiki using the xml schema according to
http://www.mediawiki.org/xml/export-0.3/
can be processed to be viewed in a YaCy servlet.
To acces such a file, place it into
DATA/HTCACHE/mediawiki/
i.e. the export from german wikipedia would be:
DATA/HTCACHE/mediawiki/wikipedia.de.xml
This file can then be accessed using the URL
http://localhost:8080/mediawiki_p.html?dump=wikipedia.de.xml&title=YaCy
if this is done the first time, an index file is created
(for this case: more than 4 million lines must be written, this takes about 15 minutes)
Then try the same url again.
- enhanced also the md5 computation speed
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5352 6c8d7289-2bf4-0310-a012-ef5d649a1542
- removed never-used secondary crawl depth
- added a must-not-match filter that can be used to exclude urls from a crawl
- added stub for crawl tags which will be used to identify search results that had been produced from specific crawls
please update the yacybar: replace property name 'crawlFilter' with 'mustmatch'.
Additionally, a new parameter named 'mustnotmatch' can be used, which should be by default the empty sring (match-never)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5342 6c8d7289-2bf4-0310-a012-ef5d649a1542
- added hash to HTCACHE storage files which will make it possible to join separate caches by just copying files
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5329 6c8d7289-2bf4-0310-a012-ef5d649a1542
- do not display indexed html (solves xss issues)
the single words are analyzed for already marked parts. this is needed to avoid false encoding of the marker (<b>) tags.
- improved speed for existing routine
heavy used regex pattern are precompiled now
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5322 6c8d7289-2bf4-0310-a012-ef5d649a1542
- for redirector and remote crawling place crawling url on notice queue instead of direct enqueueing in crawler queue
- when a request to a remote crawl provider fails, remove the peer from the network to prevent that the url fetcher gets stuck another time again
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5320 6c8d7289-2bf4-0310-a012-ef5d649a1542
- implemented vertical DHT acceptance ("my own DHT") to accept new targets
- added new target computation for global search: addresses vertical targets also
- enhanced remote crawling: collection of remote crawl urls if queue has less than 100 entries (was: 0 entries)
- better performance value computations for PPM selection in network configuration
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5319 6c8d7289-2bf4-0310-a012-ef5d649a1542
- two different computations (but mathematical equivalent) of the DHT distance had been consolidated
- moved from 0.0 .. 1.0 double-range position computation to 0 .. Long.Max range for DHT targets
- added fast Long - to - hash computation
- high-precision target computation of gaps for new peers
- added new target computation for horizontal and vertical DHT targets (not yet in use)
- old horizontal-only DHT targets will be upwards compatible to new horizontal and vertical DHT positions
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5318 6c8d7289-2bf4-0310-a012-ef5d649a1542
with index.storeCommons=false all currently stored commons are deleted!
Default is now 'true', but in future full releases it will be switched to 'false'
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5315 6c8d7289-2bf4-0310-a012-ef5d649a1542
- returns high/medium/low disk space
- pauses crawling on medium disk space
- disables index receive on low disk space
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5310 6c8d7289-2bf4-0310-a012-ef5d649a1542
*) fixed display of values of information for which part of YaCy (crawler, proxy, ...) blacklist is activated for
*) replaced regular put() with putXML() in several cases
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5305 6c8d7289-2bf4-0310-a012-ef5d649a1542
*) a blacklist will only be created if no blacklist with same name exists (some refactoring has been necessary for this)
*) further minor fixes
*) to be continued...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5301 6c8d7289-2bf4-0310-a012-ef5d649a1542
The old process used a not really efficient way to detect html encoding strings in texts.
All calling methods had been adoped to call the new class in an enhanced way with less parameters.
Many classes in interfaces used a XML encoding only (instead of full html conversion from unicode to html); this behavior was not changed with this commit but should be controlled again since it points out possible XSS leaks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5295 6c8d7289-2bf4-0310-a012-ef5d649a1542
- removed the now superfluous HT storage thread
- reduced number of file decompression by shifting the compression moment to the future
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5286 6c8d7289-2bf4-0310-a012-ef5d649a1542
- entries are buffered and written as stream with many entries at once (saves many IO accesses)
- entries are compressed with gzip: increases capacity of cache
- concurrency for stream-writing and compression: all writings to the cache are non-blocking
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5284 6c8d7289-2bf4-0310-a012-ef5d649a1542