- added test migration method to migrate the old LURL to a new LURL
the new LURL will be splitted into different tables for each month
this solves several problems:
- the biggest table in YaCy is splitted in different parts and can
also be managed in filesystems that are limited to 2GB
- the oldest entries can easily be identified, used for re-crawl und
deleted
- The complete database can be limited to a specific size (as wanted many times)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2755 6c8d7289-2bf4-0310-a012-ef5d649a1542
- serverFileUtils.java:
-- adding methods to copy from stream to writer and readers to writers
-- moving httpc writeX methods into serverFileUtils class
- serverCharBuffer.java: removing inheritance from Writer class
- replacing htmlFilterOutputStream by htmlFilterWriter class which handles
content as char stream
- htmlFilterContentTransformer.java: deactivating getText mode
(still needs to be migrated to use char streams instead of byte streams)
- changes in several classes to use htmlFilterWriter instead of htmlFilterOutputStream
- changes in Scraper and Transformer classes to operate on chars instead of bytes
- httpdProxyHandler.java: bugfix. clientTimeout setting was missing in config file
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2617 6c8d7289-2bf4-0310-a012-ef5d649a1542
* replaced kelondroTree for NURLs by kelondroFlex
* replaced kelondroTree for EURLs by kelondroFlex
take care, may be very buggy
please finish crawls before updating. crawls will be lost.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2452 6c8d7289-2bf4-0310-a012-ef5d649a1542
* cleaned up code
* added unit test code
* migrated ranking RCI computation to kelondroFlex and kelondroCollectionIndex
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2414 6c8d7289-2bf4-0310-a012-ef5d649a1542
* adopted all code to use the declaration form of kelondroRow
* fixed a bug in kelondroRow which caused wrong parsing of encoding type
* the bug caused bad database behaviour in new indexCollection data structure.
because of this bug, all test databases are now already void. A new database is created
* the kelondroFlexTable and indexCollection data structures now store a declaration of the row definition
into a properties file along the database files.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2375 6c8d7289-2bf4-0310-a012-ef5d649a1542
This shall be seen as an experiment to exclude all cases where
there could be a DNS lookup during URL comparisment.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2290 6c8d7289-2bf4-0310-a012-ef5d649a1542
- implemented hand-over of new word index attributes during remote search
- implemented word-distance computation during search
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1382 6c8d7289-2bf4-0310-a012-ef5d649a1542
the kelondro database needs specific information about the order of
base64-encoded keys. Since no other package depends on base64
(only the httpd uses base64 for encryption, but does not need to encode these strings)
it is good to move base64 encoding to the new ordering classes in kelondro.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1284 6c8d7289-2bf4-0310-a012-ef5d649a1542
(no more NULL values are returned, instead, an IOException is thrown)
- removed ugly damagedURLS implementation from plasmaCrawlLURL.java
(this inserted a static value into the Object which is not really a good style)
- re-coded damagedURLS collection in yacy.java by catching an exception and evaluating the exception message
to do:
- the urldbcleanup feature must be re-tested
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1200 6c8d7289-2bf4-0310-a012-ef5d649a1542
various checks like the blacklist check or the robots.txt disallow check are now
done by a separate thread to unburden the indexer thread(s)
TODO: maybe we have to introduce a threadpool here if it turn out that this single
thread is a bottleneck because of the time consuming robots.txt downloads
*) improved index transfer
The index selection and transmission is done in parallel now to improve index
transfer performance.
TODO: maybe we could speed up performance by unsing multiple transmission threads in
parallel instead of only a single one.
*) gzip encoded post requests
it is now configureable if a gzip encoded post request should be send on
intex transfer/distribution
*) storage Peer (very experimentell and not optimized yet)
Now it's possible to send the result of the yacy indexer thread to a remote peer
istead of storing the indexed words locally.
This could be done by setting the property "storagePeerHash" in the yacy config file
- Please note that if the index transfer fails, the index ist stored locally.
- TODO: currently this index transfer is done by the indexer thread.
To seedup the indexer
a) this transmission should be done in parallel and
b) multiple chunks should be bundled and transfered together
*) general performance improvements
- better memory cleanup after http request processing has finished
- replacing some string concatenations with stringBuffers
- replacing BufferedInputStreams with serverByteBuffer
- replacing vectors with arraylists wherever possible
- replacing hashtables with hashmaps wherever possible
This was done because function calls to verctor or hashtable functions
take 3 time longer than calls to functions of arraylists or hashmaps.
TODO: we should take a look on the class serverObject which is inherited from hashmap
Do we realy need a synchronization for this class?
TODO: replace arraylists with linkedLists if random access to the list elements is not needed
*) Robots Parser supports if-modified-since downloads now
If the downloaded robots.txt file is older than 7 days the robots parser tries to
download the robots.txt with the if-modified-since header to avoid unnecessary downloads
if the file was not changed. Additionally the ETag header is used to detect changes.
*) Crawler: better handling of unsupported mimeTypes + FileExtension
*) Bugfix: plasmaWordIndexEntity was not closed correctly in
- query.java
- plasmaswitchboard.java
*) function minimizeUrlDB added to yacy.java
this function tests the current urlHashDB for unused urls
ATTENTION: please don't use this function at the moment because
it causes the wordIndexDB to flush all words into the
word directory!
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@853 6c8d7289-2bf4-0310-a012-ef5d649a1542
Crawler StartURLs will now also added to the errorURL-DB if an error occures on this url
*) kelondroStack.java, plasmaSwitchboardQueue.java
Adding method which returns a list of all entries in the queue. This list is used by IndexCreate_p.java
instead of an iterator to display the indexing-list.
Advantages: avoid concurrent modifications of the list while displaying it.
Speedup because now we have to access only one sync function instead of multiple ones
(one for each entry)
*) IndexCreateIndexingQueue_p.java
Using new list() function of plasmaSwitchboardQueue
*) httpdFileHandler.java
If a servelet returns the special value "LOCATION" the httpFileHandler does a Redirection of
the Browser to the URL specified by the servelet. This can e.g. be used when a http get request is
used insead of a post request, but a refresh should not be allowed.
*) IndexCreateWWWLocalQueue_p.html
Now it's possible to delete single entries of the local crawler queue
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@626 6c8d7289-2bf4-0310-a012-ef5d649a1542