various checks like the blacklist check or the robots.txt disallow check are now
done by a separate thread to unburden the indexer thread(s)
TODO: maybe we have to introduce a threadpool here if it turn out that this single
thread is a bottleneck because of the time consuming robots.txt downloads
*) improved index transfer
The index selection and transmission is done in parallel now to improve index
transfer performance.
TODO: maybe we could speed up performance by unsing multiple transmission threads in
parallel instead of only a single one.
*) gzip encoded post requests
it is now configureable if a gzip encoded post request should be send on
intex transfer/distribution
*) storage Peer (very experimentell and not optimized yet)
Now it's possible to send the result of the yacy indexer thread to a remote peer
istead of storing the indexed words locally.
This could be done by setting the property "storagePeerHash" in the yacy config file
- Please note that if the index transfer fails, the index ist stored locally.
- TODO: currently this index transfer is done by the indexer thread.
To seedup the indexer
a) this transmission should be done in parallel and
b) multiple chunks should be bundled and transfered together
*) general performance improvements
- better memory cleanup after http request processing has finished
- replacing some string concatenations with stringBuffers
- replacing BufferedInputStreams with serverByteBuffer
- replacing vectors with arraylists wherever possible
- replacing hashtables with hashmaps wherever possible
This was done because function calls to verctor or hashtable functions
take 3 time longer than calls to functions of arraylists or hashmaps.
TODO: we should take a look on the class serverObject which is inherited from hashmap
Do we realy need a synchronization for this class?
TODO: replace arraylists with linkedLists if random access to the list elements is not needed
*) Robots Parser supports if-modified-since downloads now
If the downloaded robots.txt file is older than 7 days the robots parser tries to
download the robots.txt with the if-modified-since header to avoid unnecessary downloads
if the file was not changed. Additionally the ETag header is used to detect changes.
*) Crawler: better handling of unsupported mimeTypes + FileExtension
*) Bugfix: plasmaWordIndexEntity was not closed correctly in
- query.java
- plasmaswitchboard.java
*) function minimizeUrlDB added to yacy.java
this function tests the current urlHashDB for unused urls
ATTENTION: please don't use this function at the moment because
it causes the wordIndexDB to flush all words into the
word directory!
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@853 6c8d7289-2bf4-0310-a012-ef5d649a1542
*) Replacing PDFBox 0.7.1 lib with newer version 0.7.2
*) Refactoring of classes httpd/httpc/httpHeaders to
make many methods for httpHeader/Requestline parsing
reusable for new icap implementation
*) adding chunked input stream support
- needed by new icap implementation
- needed by future httpc HTTP/1.1 support
*) httpd.java
- moving all connection property contants to class httpHeader
- moving readHeader function to class httpHeader
- moving parseQuery function to class httpHeader
- moving handleTransparentProxy function to class httpHeader
*) httpHeader.java
- adding new fuction to parse the http response line
- adding new function to converte http headers to a string that
can be send to the client
- adding a function that generates a proper url using all parsed
connection properties
*) ICAP Support
- yacy now supports handling of icap response modification requests
- this feature can be used by other icap enabled proxies to contact
yacy as icap server, and to handover the downloaded content to yacy.logging
for indexing
- functionality was successfully tested with squid 2.5Stable 10 + icap patch
- further icap services e.g. URL filtering based on yacy's blacklists are possible
*) plasmaSwitchboard.java
- htcache entries that are still needed for indexing are now properly registered
as in use after system restart
- extended logging: log message now shows parsing and indexing time for each sb. entry
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@757 6c8d7289-2bf4-0310-a012-ef5d649a1542
See: http://www.yacy-forum.de/viewtopic.php?t=1118&highlight=xforwardedfor
*) httpc.java: Bugfix for incorrect http response statuscode parsing
In some situations the statustext whas chopped
*) Adding a lot of fileheaders containing YaCy copyright and license
*) httpd.java: Adding additional debugging http header that should help du detect
the "binary data in browser window" bug.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@653 6c8d7289-2bf4-0310-a012-ef5d649a1542
looks very disordered? Inner classes and methods mixed together. Maybe the code
should be cleaned up a little bit?
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@503 6c8d7289-2bf4-0310-a012-ef5d649a1542
yacy-javadoc-documentation in doc/api. Just do ant create-doc and point your
favourite browser to doc/api/index.html. As most of the classes are not
documented right now this just gives a great overview of all classes.
Hopefully this helps stimulating the creation of
javadoc-insource-documentation.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@502 6c8d7289-2bf4-0310-a012-ef5d649a1542
- Now it's possible to interrupt pending httpc-actions on server shutdown
- this is possible because of a newly introduced registration mechanism for
open sockets
*) yacyCore.java
- blocking peerPing threads can now be interrupted on server shutdown
*) serverCore.java
- restructuring shutdown code
*) error.html
- port number is now set correctly if port forwarding was enabled
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@389 6c8d7289-2bf4-0310-a012-ef5d649a1542
- Adding function to manually force peer ping to remote yacy peer
See:Network.html?page=4
- for debugging purpose only!
*) serverAbstractThread.java:
- Adding posibility to notify a server thread via a synchronization object
- this is needed e.g. by the port forwarding feature to send a notification
to the peerPing thread to redo peer-ping with the new ip/port Settings_p.html
*) Port Forwarding Feature (it should work now)
- adding a serverThread which is responsible to detect broken port forwarding
connections and to do reconnect if needed
- serverCore.java: moving port forwarding initialization into a separate function
- adding positility to configure the ssh port
- moving configuration section on the gui into a separate fieldset
- hello.java: only trying to do a second connect to the clientIp address during
peer handshake if either remote port forwarding is not enabled locally or
the clientIP is not equal to any local ip
*) httpdFileHandler.java:
- printout a more verbose errormessage
*) httpc.java
- allowing to deactivate content encoding from outside
*) plasmaCrawlWorker.java
- the crawler worker now tries to refetch the content of a website without
gzip content encoding if a gzip error occured
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@368 6c8d7289-2bf4-0310-a012-ef5d649a1542
*) hello.java: reportedip my be empty at peer startup
*) httpc.java: adding method to determine if the connection was already closed or is broken
*) httpdProxyHandler.java: trying to do a better errorhandling
*) server/serverCore.java
- setting myseed ip-address and port correctly if port-forwarding is on
- doing a more failsafe close and adding some debugging output
*) yacyClient.java: adding some logging statements to allow a better detection of
"degraded to senior"-bug
*) yacyCore.java: restructuring publishMySeed
(@Orbiter: pleas take a look)
- to avoid buzy waiting
- to allow a gracefull shutdown on server shutdown
- new seed count was not calculated correctly in the previous version
*) yacySeedDB.java: host ip and port was not initialized correctly if port-forwarding
was activated
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@318 6c8d7289-2bf4-0310-a012-ef5d649a1542
- httpc
- response
*) simplifying gzip encoding
*) remembering http version of contacted server
(neede for later support of keep alive by httpc)
*) moving function shallTransportZipped to httpd.java
because this function is used multiple times
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@242 6c8d7289-2bf4-0310-a012-ef5d649a1542
*) Making Seed-Upload configuration more verbose.
*) Some Changes in SOAP Search API (not finished yet).
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@158 6c8d7289-2bf4-0310-a012-ef5d649a1542
- introduction of a threadpool for crawling
- introduction of a job queue to avoid buzy waiting for a free crawler slot
*) New classes added
- queue for receiving of crawler jobs
- semaphore class to do reader/writer synchronization (mutual exclusion)
- message object to hold all needed data about a crawler job
*) Trying to solve session-thread shutdown problem
- session thread stopped variable is now set from outside before interrupting the
session thread.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@39 6c8d7289-2bf4-0310-a012-ef5d649a1542
can be used instead of a ByteArrayOutputStream
*) Using a serverByteBuffer for lineBuffering in class httpc
instead of a ByteArrayOutputStream
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@35 6c8d7289-2bf4-0310-a012-ef5d649a1542
- httpc: wrong error-message on 404
- httpc: error message was accidentally shown when object
was released from pool
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@31 6c8d7289-2bf4-0310-a012-ef5d649a1542
- many classes set to final
- implementation of a session-thread pool
- reusage of the server handler class (normally the httpd object)
within the session thread
- implementation of a httpc object pool
- introduction of a linebuffer in httpd which can be reused
- reusing the properties table in the httpc
- added to apache libs (commons-collections, commons-pool) which
are needed for the object/thread pool implementation
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@26 6c8d7289-2bf4-0310-a012-ef5d649a1542