two main changes must be implemented to enable mass remote crawls:
- shift control of robots.txt to crawl queue (away from stacker). This is necessary since remote
crawls can contain unchecked urls. Each peer must check the robots to prevent that it is misused
as crawl agent for unwanted file retrieval
- implement new index files that control double-check of remotely crawled urls
After removal of robots.txt checking from stacker threads, the multi-threading of this process is void.
Multithreading has been removed. Also the thread pools for the crawl threads had been removed, since
creation of these threads is not resource-consuming, for a detailed explanation see svn 4106
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4181 6c8d7289-2bf4-0310-a012-ef5d649a1542
and replaced old fist hash computation by new method that tries to find a gap in the current dht
to do this, it is necessary that the network bootstraping is done before the own hash is computed
this made further redesigns in peer initialization order necessary
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4117 6c8d7289-2bf4-0310-a012-ef5d649a1542
search profiling showed, that a major amount of time is wasted by computing url hashes. The computation does an intranet-check, which needs a DNS lookup. This caused that each urlhash computation needed 100-200 milliseconds, which caused remote searches to delay at least 1 second more that necessary. The solution to this problem is to attach a URL hash to the URL data structure, because that means that the url hash value can be filled after retrieval of the URL from the database. The redesign of the url/urlhash management caused a major redesign of many parts of the software. Since some parts had been decided to be given up they had been removed during this change to avoid unnecessary maintenance of unused code.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4074 6c8d7289-2bf4-0310-a012-ef5d649a1542
- refactoring of kelondroRecords; this class is now divided into
kelondroAbstractRecords, kelondroRecords, kelondroCachedRecords, kelondroHandle and kelondroNode
- better abstraction of kelondroNodes, such nodes may now be crated by different classes
- a new Node defining class kelondroEcoRecords defines Nodes that do not need so much allocation and System.arraycopy
- there is less memory transfer on the bus, especially for collection index
- now half of memory needed for web index access
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4024 6c8d7289-2bf4-0310-a012-ef5d649a1542
- no more contact to yacy.net (no remote superseed any more)
- moved superseed file into new network unit definition
- fixed build; includes new network bootstraping files now
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3922 6c8d7289-2bf4-0310-a012-ef5d649a1542
- cluster definitions can now contain an addition for local ip addresses
- cluster-cluster communication uses the local ip address instead the global address, if one is given
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3624 6c8d7289-2bf4-0310-a012-ef5d649a1542
automatically aquire release information from download archives
web pages from latest.yacy-forum.net and yacy.net are retrieved, parsed,
links wihin are analysed, sorted and the most recent developer and main
releases are provided as direct download link on the status page, if it was
discovered that a more recent version than the current version is available.
This process is done only once during run-time of a peer, to protect our
download archives from DoS by YaCy peers.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3606 6c8d7289-2bf4-0310-a012-ef5d649a1542
the field contains now a time in UDC-0 (instead relative to local UDC offset)
this fixes a bug in peer selection, where an iteration over all seeds
ordered by lastseen did not work correctly.
Problems may occur because the new meaning of this field may mix with
the different meaning of that field in older peers
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3322 6c8d7289-2bf4-0310-a012-ef5d649a1542
*) added missing private IP-ranges for APIPA/Zeroconf and 172.16.0.0–172.31.255.255
*) Changed some seed-download-errors to warnings
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3086 6c8d7289-2bf4-0310-a012-ef5d649a1542
- removed 'deleteComplete' flag; this was used especially for WORDS indexes
- shifted methods from plasmaSwitchboard to plasmaWordIndex
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3051 6c8d7289-2bf4-0310-a012-ef5d649a1542
- added an assortment importer. the old database structures can
be imported with
java -classpath classes yacy -migrateassortments
- modified wordmigration. The indexes from WORDS are now imported
to the collection database. The call is
java -classpath classes yacy -migratewords
(as it was)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3044 6c8d7289-2bf4-0310-a012-ef5d649a1542
This shall be seen as an experiment to exclude all cases where
there could be a DNS lookup during URL comparisment.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2290 6c8d7289-2bf4-0310-a012-ef5d649a1542
all yacy-to-yacy communication now send the <peer-hexhash>.yacyh
virtual domain inside the http 'Host' property field.
This shall enable running a yacy peer on a virtual host.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2074 6c8d7289-2bf4-0310-a012-ef5d649a1542
- null pointer exception during startup of a robinson-configured peer
- wrong time calculation of default value of re-crawl option
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2005 6c8d7289-2bf4-0310-a012-ef5d649a1542
-Renaming writeandzip to writeandgzip to avoid confusion about type of compression
-Adding new startup message to windows script
-The usual language "enhancements" ;-)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1953 6c8d7289-2bf4-0310-a012-ef5d649a1542
The location is computed from the userAgent string of connecting peers.
Therefore this information is not available right after start-up.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1241 6c8d7289-2bf4-0310-a012-ef5d649a1542
- solving problems with unkown certificates by implementing a dummy trust Manager
- adding https support to robots-parser
- Seed File can now be downloaded from https resources
- adapting plasmaHTCache.java to support https URLs properly
*) URL Normalization
- sub URLs are now normalized properly during indexing
- pointing urlNormalForm function of plasmaParser to htmlFilterContentScraper function
- normalizing URLs which were received by a crawlOrder request
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1024 6c8d7289-2bf4-0310-a012-ef5d649a1542
to disallow yacy to index the response that belongs to the request where
X-YACY-Index-Contro is set to "no-index"
*) Bugfix for Seed-List download via Remote Proxy.
Now the pragma and cache-control http headers of the request are properly set to "no-cache"
See: http://www.yacy-forum.de/viewtopic.php?p=11639#11639
*) Bugfix for http-Proxy
yacy has ignored "no-cache"- pragma and cache-control http headers that were send in requests.
Now, these request headers are evaluated properly
TODO: Missing evaluation of "no-store" request headers
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@971 6c8d7289-2bf4-0310-a012-ef5d649a1542
- remote proxy configuration can now be "really" changed on the fly and takes effect immediately
- adding possibility to disable remote proxy usage for yacy->yacy communication
- adding possibility to disable remote proxy usage for ssl
- restructuring proxy configuration so that it is stored in a single place now
*) Adding possibility to import a foreign word DB (or even more of them in parallel)
at runtime into the peers DB
- this can be done by calling IndexImport_p.html
- ATTENTION: please not that at the moment this thread must be aborted via gui
before a normal server shutdown is done.
- TODO: integrating IndexImport Thread into normal server shutdown
- TODO: Adding posibility to import crawl-queues, etc. from foreign peers
- TODO: removing old import function from yacy.java and calling the new routines instead
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@968 6c8d7289-2bf4-0310-a012-ef5d649a1542
various checks like the blacklist check or the robots.txt disallow check are now
done by a separate thread to unburden the indexer thread(s)
TODO: maybe we have to introduce a threadpool here if it turn out that this single
thread is a bottleneck because of the time consuming robots.txt downloads
*) improved index transfer
The index selection and transmission is done in parallel now to improve index
transfer performance.
TODO: maybe we could speed up performance by unsing multiple transmission threads in
parallel instead of only a single one.
*) gzip encoded post requests
it is now configureable if a gzip encoded post request should be send on
intex transfer/distribution
*) storage Peer (very experimentell and not optimized yet)
Now it's possible to send the result of the yacy indexer thread to a remote peer
istead of storing the indexed words locally.
This could be done by setting the property "storagePeerHash" in the yacy config file
- Please note that if the index transfer fails, the index ist stored locally.
- TODO: currently this index transfer is done by the indexer thread.
To seedup the indexer
a) this transmission should be done in parallel and
b) multiple chunks should be bundled and transfered together
*) general performance improvements
- better memory cleanup after http request processing has finished
- replacing some string concatenations with stringBuffers
- replacing BufferedInputStreams with serverByteBuffer
- replacing vectors with arraylists wherever possible
- replacing hashtables with hashmaps wherever possible
This was done because function calls to verctor or hashtable functions
take 3 time longer than calls to functions of arraylists or hashmaps.
TODO: we should take a look on the class serverObject which is inherited from hashmap
Do we realy need a synchronization for this class?
TODO: replace arraylists with linkedLists if random access to the list elements is not needed
*) Robots Parser supports if-modified-since downloads now
If the downloaded robots.txt file is older than 7 days the robots parser tries to
download the robots.txt with the if-modified-since header to avoid unnecessary downloads
if the file was not changed. Additionally the ETag header is used to detect changes.
*) Crawler: better handling of unsupported mimeTypes + FileExtension
*) Bugfix: plasmaWordIndexEntity was not closed correctly in
- query.java
- plasmaswitchboard.java
*) function minimizeUrlDB added to yacy.java
this function tests the current urlHashDB for unused urls
ATTENTION: please don't use this function at the moment because
it causes the wordIndexDB to flush all words into the
word directory!
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@853 6c8d7289-2bf4-0310-a012-ef5d649a1542
- Now it's possible to interrupt pending httpc-actions on server shutdown
- this is possible because of a newly introduced registration mechanism for
open sockets
*) yacyCore.java
- blocking peerPing threads can now be interrupted on server shutdown
*) serverCore.java
- restructuring shutdown code
*) error.html
- port number is now set correctly if port forwarding was enabled
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@389 6c8d7289-2bf4-0310-a012-ef5d649a1542
See: http://www.yacy-forum.de/viewtopic.php?t=516
- removing NIO from server/serverCore.java because of massive problems
with socket close issues
*) Adding support for remote port forwarding via sch
@Orbiter: Please take a look into
- hello.java
- server/serverCore.java.publicIP()
- yacy/yacyClient.java.publishMySeed(...)
*) Making startup loading of additional content parsers more failsafe
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@281 6c8d7289-2bf4-0310-a012-ef5d649a1542
optional content parsers, thread pool configuration ...
Please help me testing if everything works correct.
*) Migration of yacy seedUpload functionality
See: http://www.yacy-forum.de/viewtopic.php?t=256
- new uploaders can now be easily introduced because of a new modulare uploader system
- default uploaders are: none, file, ftp
- adding optional uploader for scp
- each uploader provides its own configuration file that will be
included into the settings page using the new template include feature
- Each uploader can define its libx dependencies. If not all needed libs are
available, the uploader is deactivated automatically.
*) Migration of optional parsers
See: http://www.yacy-forum.de/viewtopic.php?t=198
- Parsers can now also define there libx dependencies
- adding parser for bzip compressed content
- adding parser for gzip compressed content
- adding parser for zip files
- adding parser for tar files
- adding parser to detect the mime-type of a file
this is needed by the bzip/gzip Parser.java
- adding parser for rtf files
- removing extra configuration file yacy.parser
the list of enabled parsers is now stored in the main config file
*) Adding configuration option in the performance dialog to configure
See: http://www.yacy-forum.de/viewtopic.php?t=267
- maxActive / maxIdle / minIdle values for httpd-session-threadpool
- maxActive / maxIdle / minIdle values for crawler-threadpool
*) Changing Crawling Filter behaviour
See: http://www.yacy-forum.de/viewtopic.php?p=2631
*) Replacing some hardcoded strings with the proper constants of the httpHeader class
*) Adding new libs to libx directory. This libs are
- needed by new content parsers
- needed by new optional seed uploader
- needed by SOAP API (which will be committed later)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@126 6c8d7289-2bf4-0310-a012-ef5d649a1542