yacy_search_server

Commit Graph

Author	SHA1	Message	Date
orbiter	b79e06615d	- added new LURL.Entry class for next database migration - refactoring of affected classes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2802 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
orbiter	a5dd0d41af	- refactoring of plasmaCrawlLURL.Entry to prepare new Entry format - added test migration method to migrate the old LURL to a new LURL the new LURL will be splitted into different tables for each month this solves several problems: - the biggest table in YaCy is splitted in different parts and can also be managed in filesystems that are limited to 2GB - the oldest entries can easily be identified, used for re-crawl und deleted - The complete database can be limited to a specific size (as wanted many times) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2755 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
theli	26dfbb7499	*) Bugfix for UTF-8: url names are now stored properly in stackcrawl, crawler, indexing queue and should be displayed correct on the gui git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2630 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
theli	97d2a08ef1	*) restructuring needed to support parsing of documents using various charsets - serverFileUtils.java: -- adding methods to copy from stream to writer and readers to writers -- moving httpc writeX methods into serverFileUtils class - serverCharBuffer.java: removing inheritance from Writer class - replacing htmlFilterOutputStream by htmlFilterWriter class which handles content as char stream - htmlFilterContentTransformer.java: deactivating getText mode (still needs to be migrated to use char streams instead of byte streams) - changes in several classes to use htmlFilterWriter instead of htmlFilterOutputStream - changes in Scraper and Transformer classes to operate on chars instead of bytes - httpdProxyHandler.java: bugfix. clientTimeout setting was missing in config file git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2617 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
theli	d0a5a53789	*) changes needed for multi-language support - parsers may need to know the charset of the byte stream git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2591 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
orbiter	41e27b85b7	fix for crawler condition git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2583 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
orbiter	9340dbb501	fixed all possible problems with nullpointer exception for LURLs git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2513 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
orbiter	141f9e5bb4	fix for new plasmaCrawlLURL.load behavior git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2509 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
orbiter	4866868c0e	added write cache for LURLs This was necessary to speed up the index receive process during global search git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2498 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
theli	dae763d8e3	git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2495 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
theli	7a35b8e237	*) direct access to responseheaders of sbQueue.Entry removed to make it more http independent git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2487 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
orbiter	db1eae0227	* simplified initialization of database objects * replaced kelondroTree for NURLs by kelondroFlex * replaced kelondroTree for EURLs by kelondroFlex take care, may be very buggy please finish crawls before updating. crawls will be lost. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2452 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
orbiter	3e9d509c39	some small fixes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2425 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	6ad471ef96	* applied many compiler warning recommendations * cleaned up code * added unit test code * migrated ranking RCI computation to kelondroFlex and kelondroCollectionIndex git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2414 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	26116cabde	added missing rowdef assignment git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2379 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	abf22f6e60	removed url normalform computation from htmlFilterContentScraper. This method was implemented in de.anomic.net.URL git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2377 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	740d49751d	* strict type and size check in kelondroRow handling * adopted all code to use the declaration form of kelondroRow * fixed a bug in kelondroRow which caused wrong parsing of encoding type * the bug caused bad database behaviour in new indexCollection data structure. because of this bug, all test databases are now already void. A new database is created * the kelondroFlexTable and indexCollection data structures now store a declaration of the row definition into a properties file along the database files. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2375 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	9f298083cd	*) adding more urls to the error url - old error strings where replaced with there corresponding constants See: http://www.yacy-forum.de/viewtopic.php?t=2638 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2360 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	3879a0ecd0	replaced java.net.URL usage by use of new class de.anomic.net.URL This shall be seen as an experiment to exclude all cases where there could be a DNS lookup during URL comparisment. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2290 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	92f4cb4d73	added option to configure the start-up delay time for kelondro database files. the start-up delay is used to pre-load the database node cache git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2276 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	c36e9fc8d3	full integration of kelondroRow git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2167 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	09f780df27	more bugfixes for the new row/stack handling changes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2160 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	3c3c047d0a	integrated kelondroRow into kelondroStack git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2156 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	90d569d70f	refactoring of index management: url storage is part of index management; moved plasmaURL to indexURL git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2122 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	d6213f8a85	quickfix for http://www.yacy-forum.de/viewtopic.php?p=19482#19482 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2042 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
borg-0300	92110aea32	nullpointer fix for profile(); other minor change; git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2009 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	47843e69e2	auto-reset for switchboard queue stack bugfix for http://www.yacy-forum.de/viewtopic.php?p=15684#15684 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1414 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	f4ffa9aee5	- implemented more attributes to index entries - implemented hand-over of new word index attributes during remote search - implemented word-distance computation during search git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1382 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	9544c47684	added some UTF-8 handling. hope this will help somehow.. for shure not THE solution to our UTF-8 problem git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1308 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	9086261476	refactoring of base64 encoding: the kelondro database needs specific information about the order of base64-encoded keys. Since no other package depends on base64 (only the httpd uses base64 for encryption, but does not need to encode these strings) it is good to move base64 encoding to the new ordering classes in kelondro. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1284 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	8c594841a8	*) Bugfix for incorrectly indexing of URLs that were requested with Cookies in the Request header See: http://www.yacy-forum.de/viewtopic.php?p=14077 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1214 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	4500506735	fixed some bugs concerning url entry retrieval and intexControl interface git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1212 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	0c762daf4b	better startup failure handling git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1205 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	bb79fb5d91	- changed handling of error cases retrieving urls from database (no more NULL values are returned, instead, an IOException is thrown) - removed ugly damagedURLS implementation from plasmaCrawlLURL.java (this inserted a static value into the Object which is not really a good style) - re-coded damagedURLS collection in yacy.java by catching an exception and evaluating the exception message to do: - the urldbcleanup feature must be re-tested git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1200 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	ec2b39c1ce	code cleanup git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1175 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
borg-0300	00ab4d8723	cleaned, small change, Properties git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1026 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
hydrox	56b9f34411	*)removed unused imports git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1015 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	a2fa75e688	) Asynchronous queuing of crawl job URLs (stackCrawl) various checks like the blacklist check or the robots.txt disallow check are now done by a separate thread to unburden the indexer thread(s) TODO: maybe we have to introduce a threadpool here if it turn out that this single thread is a bottleneck because of the time consuming robots.txt downloads ) improved index transfer The index selection and transmission is done in parallel now to improve index transfer performance. TODO: maybe we could speed up performance by unsing multiple transmission threads in parallel instead of only a single one. ) gzip encoded post requests it is now configureable if a gzip encoded post request should be send on intex transfer/distribution ) storage Peer (very experimentell and not optimized yet) Now it's possible to send the result of the yacy indexer thread to a remote peer istead of storing the indexed words locally. This could be done by setting the property "storagePeerHash" in the yacy config file - Please note that if the index transfer fails, the index ist stored locally. - TODO: currently this index transfer is done by the indexer thread. To seedup the indexer a) this transmission should be done in parallel and b) multiple chunks should be bundled and transfered together ) general performance improvements - better memory cleanup after http request processing has finished - replacing some string concatenations with stringBuffers - replacing BufferedInputStreams with serverByteBuffer - replacing vectors with arraylists wherever possible - replacing hashtables with hashmaps wherever possible This was done because function calls to verctor or hashtable functions take 3 time longer than calls to functions of arraylists or hashmaps. TODO: we should take a look on the class serverObject which is inherited from hashmap Do we realy need a synchronization for this class? TODO: replace arraylists with linkedLists if random access to the list elements is not needed ) Robots Parser supports if-modified-since downloads now If the downloaded robots.txt file is older than 7 days the robots parser tries to download the robots.txt with the if-modified-since header to avoid unnecessary downloads if the file was not changed. Additionally the ETag header is used to detect changes. ) Crawler: better handling of unsupported mimeTypes + FileExtension ) Bugfix: plasmaWordIndexEntity was not closed correctly in - query.java - plasmaswitchboard.java *) function minimizeUrlDB added to yacy.java this function tests the current urlHashDB for unused urls ATTENTION: please don't use this function at the moment because it causes the wordIndexDB to flush all words into the word directory! git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@853 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	7fc822a59b	changed handling of time-zones git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@801 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	cddd9aaa33	fixed SERIOUS bug with kelondroStack; affected all stack processing since 729 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@732 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	3587407039	*) Fixing problems of list operation if index and queue size are both 0. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@687 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	dc0a2d4c11	) Bugfix for Loader Queue: Job count was not displayed correctly ) IndexingQueue: - now it's possible to delete single entries from the queue - now it's possible to clear the whole queue See: http://www.yacy-forum.de/viewtopic.php?t=995 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@641 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	bead8a32aa	) IndexCreate_p.java: Crawler StartURLs will now also added to the errorURL-DB if an error occures on this url ) kelondroStack.java, plasmaSwitchboardQueue.java Adding method which returns a list of all entries in the queue. This list is used by IndexCreate_p.java instead of an iterator to display the indexing-list. Advantages: avoid concurrent modifications of the list while displaying it. Speedup because now we have to access only one sync function instead of multiple ones (one for each entry) ) IndexCreateIndexingQueue_p.java Using new list() function of plasmaSwitchboardQueue ) httpdFileHandler.java If a servelet returns the special value "LOCATION" the httpFileHandler does a Redirection of the Browser to the URL specified by the servelet. This can e.g. be used when a http get request is used insead of a post request, but a refresh should not be allowed. *) IndexCreateWWWLocalQueue_p.html Now it's possible to delete single entries of the local crawler queue git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@626 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	4fd5b95b1f	*) Renaming Logger function names to reflect the proper Java Logging API Loglevels - please use logFine instead of logDebug - please use logSevere instead of logFailure and logError See: http://www.yacy-forum.de/viewtopic.php?p=8726#8726 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@615 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	6adf8a4bde	*) Renaming Logger function names to reflect the proper Java Logging API Loglevels - please use logFine instead of logDebug - please use logFailure instead of logError See: http://www.yacy-forum.de/viewtopic.php?p=8726#8726 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@614 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	5f55dff297	*) Bugfix for "Binäre Nullen auf der page: Index Creation: Indexing Queue" See: http://www.yacy-forum.de/viewtopic.php?p=6877#6877 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@577 6c8d7289-2bf4-0310-a012-ef5d649a1542	20 years ago
theli	ab894d26bc	*) Bugfix for "plasmaSwitchboard.deQueue: null" Bug (hopefully) See: http://www.yacy-forum.de/viewtopic.php?p=8135#8135 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@570 6c8d7289-2bf4-0310-a012-ef5d649a1542	20 years ago
orbiter	ba0a486328	moved printStackTrace() to logging git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@539 6c8d7289-2bf4-0310-a012-ef5d649a1542	20 years ago
orbiter	91163db52e	fix for more time-related problems in proxy git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@486 6c8d7289-2bf4-0310-a012-ef5d649a1542	20 years ago
orbiter	7e3e9ba0de	fix for http://www.yacy-forum.de/viewtopic.php?p=6563#6563 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@472 6c8d7289-2bf4-0310-a012-ef5d649a1542	20 years ago

1 2

53 Commits (6412c926bce37dca1e605d4f197685957fe89b31)