- all web page parsing operations will now increase a web structure file
- the file is computed in memory and dumped at shutdown-time to PLASMASB/webStructure.map in readable form (not a database)
- the file can be used externally to analyse the link structure of the crawled pages
- the web structure can also be retrieved using a xml-interface at http://localhost:8080/xml/webstructure.xml
- the short-term purpose is the computation of a link-graph image (before linuxtag!)
- a long-term purpose could be a decentralized computation of the citation rank
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3746 6c8d7289-2bf4-0310-a012-ef5d649a1542
redesign for better IO performance
enhanced database seek-time by avoiding write operations at distant
positions of a database file. until now, a USEDC counter was written
at the head-section of a kelondroRecords database file (which is the
basic data structure of all kelondro database files) to store the
actual number of records that are contained in the database. Now, this
value is computed from the database file size. This is either done
only once at start-time, or continuously when run in asserts enabled.
The counter is then updated only in RAM, and written at close of the
file. If the close fails, the correct number can be computed from the
file size, and if this is not equal to the stored number it is a strong
evidence that YaCY was not shut down properly.
To preserve consistency, the complete storage-routine had to be re-written.
Another change enhances read of nodes in some cases, where the data-tail
can be read together with the data-head. This saves another IO lookup during
each DB node fetch.
Includes also many small bugfixes.
IF ANYTHING GOES WRONG, ALL YOUR DATA IS LOST: PLEASE MAKE A BACK-UP
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3375 6c8d7289-2bf4-0310-a012-ef5d649a1542
add the following to DATA/LOG/yacy.logging:
---
# Properties for the LogalizerHandler
de.anomic.server.logging.LogalizerHandler.enabled = true
de.anomic.server.logging.LogalizerHandler.debug = false
de.anomic.server.logging.LogalizerHandler.parserPackage = de.anomic.server.logging.logParsers
---
and "de.anomic.server.logging.LogalizerHandler" to the list of global handlers
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3219 6c8d7289-2bf4-0310-a012-ef5d649a1542
- generalized object caching and added new object caching class
- added object caching wherever kelondroTree was used
- added object caching also to usage of kelondroFlex
- added object buffering (a write cache) to NURLs
- added many assert statements; fixed bugs here and there
- added missing close methods to latest added classes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2858 6c8d7289-2bf4-0310-a012-ef5d649a1542
- restructuring of mimeTypes based on the parsers
- displaying parser usage count
- displaying human readably parser names
- displaying parser version information
*) httpdFileHandler.java
- adding possibility to support "streaming" servlets
which are special servlets that can communicate with
the client via the connection streams autonomous
- the name of these new servlet types must end with the
file extension .stream
- this feature will be needed by the yacy ScreenSaver
class to fetch statistic data from the peer without the
need to reconnect to the server all the time
*) Adding human readable names and version information for
all supported parsers
*) plasmaParser.java
- adding new structure to store parser statistic data
*) Adding openDocument parser
- can be used to parse odt files
*) jmimemagic
- adding rules to detect openDocument formats properly
*) serverLog.java
- adding functions that can be used to query if a given
logging level is enabled or not.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1140 6c8d7289-2bf4-0310-a012-ef5d649a1542
- Bugfix: old dns cache did not handle case insensitive hostnames correctly.
- adding a possibility to set domain name patterns defining hostnames that should not be cached by the httpc dns cache
e.g. borg-300.dyndns.org
This can be done by setting the new httpc.nameCacheNoCachingPatterns property
- using httpc.dnsResolve wherever possible within the sourcecode
[httpd.java,plasmaCrawlStacker.java]
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1044 6c8d7289-2bf4-0310-a012-ef5d649a1542
*) Replacing PDFBox 0.7.1 lib with newer version 0.7.2
*) Refactoring of classes httpd/httpc/httpHeaders to
make many methods for httpHeader/Requestline parsing
reusable for new icap implementation
*) adding chunked input stream support
- needed by new icap implementation
- needed by future httpc HTTP/1.1 support
*) httpd.java
- moving all connection property contants to class httpHeader
- moving readHeader function to class httpHeader
- moving parseQuery function to class httpHeader
- moving handleTransparentProxy function to class httpHeader
*) httpHeader.java
- adding new fuction to parse the http response line
- adding new function to converte http headers to a string that
can be send to the client
- adding a function that generates a proper url using all parsed
connection properties
*) ICAP Support
- yacy now supports handling of icap response modification requests
- this feature can be used by other icap enabled proxies to contact
yacy as icap server, and to handover the downloaded content to yacy.logging
for indexing
- functionality was successfully tested with squid 2.5Stable 10 + icap patch
- further icap services e.g. URL filtering based on yacy's blacklists are possible
*) plasmaSwitchboard.java
- htcache entries that are still needed for indexing are now properly registered
as in use after system restart
- extended logging: log message now shows parsing and indexing time for each sb. entry
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@757 6c8d7289-2bf4-0310-a012-ef5d649a1542
See: http://www.yacy-forum.de/viewtopic.php?t=1118&highlight=xforwardedfor
*) httpc.java: Bugfix for incorrect http response statuscode parsing
In some situations the statustext whas chopped
*) Adding a lot of fileheaders containing YaCy copyright and license
*) httpd.java: Adding additional debugging http header that should help du detect
the "binary data in browser window" bug.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@653 6c8d7289-2bf4-0310-a012-ef5d649a1542