yacy_search_server

Commit Graph

Author	SHA1	Message	Date
orbiter	df1629b05a	- code cleanup - version 0.471 - moved surftipps to own web page git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2676 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
theli	f3ac4dbbb9	*) better handling of server shutdown See: e.g. http://www.yacy-forum.de/viewtopic.php?t=2584 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2468 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
orbiter	cfbacbbf08	reverted change in robotsParser git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2378 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	abf22f6e60	removed url normalform computation from htmlFilterContentScraper. This method was implemented in de.anomic.net.URL git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2377 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	3879a0ecd0	replaced java.net.URL usage by use of new class de.anomic.net.URL This shall be seen as an experiment to exclude all cases where there could be a DNS lookup during URL comparisment. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2290 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
allo	99110e6fd2	Fixed some of the copyright headers. Please add yourself, if you contributed to these files, and i forgot you. ;-) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2086 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	015d044c25	tried to fix some problems with latest changes to httpc very experimental! git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2078 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	34c075c1c7	testcommit with subversive git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1886 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	d3da7c9a08	*) Adding support for robots Allow directive git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1872 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	734d18f283	*) more correct robots.txt validation - isDisallowed now uses getFile instead of getPath git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1870 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	f0ad0d2b2b	*) better robots.txt support - previously rules for all crawlers and special rules for yacy where combined using AND. Now the general rule will be ignored if there is a special rule for yacy (according to rfc) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1867 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	915812f597	*) Undoing robots parser policy changes from svn rev. 1421 - crawling is not allowed if server returned a 403 statuscode (according to rfc) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1864 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	eeba8b055e	*) guessing, testing and suggesting alternative hostnames on "unknown host" error See: http://www.yacy-forum.de/viewtopic.php?t=1879 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1636 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	5c56b9ed59	*) catch exceptions that could occur during url decoding git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1451 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	754a35877f	*) Changing robots parser cxclusion policy - crawling is now allowed if server returned a 403 statuscode when trying to download the robots.txt See: http://www.yacy-forum.de/viewtopic.php?t=1612 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1421 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	7920e1547d	code cleanup git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1163 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	9649d08171	) More tolerant robots parser - converting tabs to spaces - cutting of '' in the disallow section git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1056 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	93cadb47b9	*) More tolerant robots parser for robots-files which missing empty lines between rule blocks See: http://www.yacy-forum.de/viewtopic.php?p=12471 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1048 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	f9fb284fb7	*) Better handling of robots.txt files with incorrect keywords See: http://www.yacy-forum.de/viewtopic.php?p=12292#12292 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1035 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	b8ceb1ffde	) Adding better https support for crawler - solving problems with unkown certificates by implementing a dummy trust Manager - adding https support to robots-parser - Seed File can now be downloaded from https resources - adapting plasmaHTCache.java to support https URLs properly ) URL Normalization - sub URLs are now normalized properly during indexing - pointing urlNormalForm function of plasmaParser to htmlFilterContentScraper function - normalizing URLs which were received by a crawlOrder request git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1024 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	3b5d0eb053	*) Synchronizing robots.txt downloads to avoid parallel downloads of the same file by separate threads git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@998 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	6c48c3ce39	*) Bugfix for ArithmeticException during IndexTransfer See: http://www.yacy-forum.de/viewtopic.php?t=1362 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@974 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	02d9af1a70	) Restructuring and extending of Remote Proxy Support - remote proxy configuration can now be "really" changed on the fly and takes effect immediately - adding possibility to disable remote proxy usage for yacy->yacy communication - adding possibility to disable remote proxy usage for ssl - restructuring proxy configuration so that it is stored in a single place now ) Adding possibility to import a foreign word DB (or even more of them in parallel) at runtime into the peers DB - this can be done by calling IndexImport_p.html - ATTENTION: please not that at the moment this thread must be aborted via gui before a normal server shutdown is done. - TODO: integrating IndexImport Thread into normal server shutdown - TODO: Adding posibility to import crawl-queues, etc. from foreign peers - TODO: removing old import function from yacy.java and calling the new routines instead git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@968 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	40777556c5	) Connection Tracking - adding automatic refresh - accepts new parameter nameLookup which can be used to deactivate yacy-peer name lookup (because we have problems with this on large seed-dbs) ) ViewFile New page that can be used to view - original content - plain text content - parsed content - parsed sentences of a webpage specified by there url hash Mainly for debugging purpose at the moment ) Robots.txt Bugfix for if-modified-since usage TODO: synchronization of downloads to avoid loading the same robots-file multiple times in parallel by different threads ) Shutdown Better abortion of transferRWI and transferURL sessions on server shutdown *) Status Page Adding icon to start/stop crawling via status page git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@950 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	959eefbc4f	) Robots.txt parser/ppt cutting of comments at the line end ) Adding Threadpool for stackCrawl Thread to speedup robots.txt download and double url checks git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@882 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	a2fa75e688	) Asynchronous queuing of crawl job URLs (stackCrawl) various checks like the blacklist check or the robots.txt disallow check are now done by a separate thread to unburden the indexer thread(s) TODO: maybe we have to introduce a threadpool here if it turn out that this single thread is a bottleneck because of the time consuming robots.txt downloads ) improved index transfer The index selection and transmission is done in parallel now to improve index transfer performance. TODO: maybe we could speed up performance by unsing multiple transmission threads in parallel instead of only a single one. ) gzip encoded post requests it is now configureable if a gzip encoded post request should be send on intex transfer/distribution ) storage Peer (very experimentell and not optimized yet) Now it's possible to send the result of the yacy indexer thread to a remote peer istead of storing the indexed words locally. This could be done by setting the property "storagePeerHash" in the yacy config file - Please note that if the index transfer fails, the index ist stored locally. - TODO: currently this index transfer is done by the indexer thread. To seedup the indexer a) this transmission should be done in parallel and b) multiple chunks should be bundled and transfered together ) general performance improvements - better memory cleanup after http request processing has finished - replacing some string concatenations with stringBuffers - replacing BufferedInputStreams with serverByteBuffer - replacing vectors with arraylists wherever possible - replacing hashtables with hashmaps wherever possible This was done because function calls to verctor or hashtable functions take 3 time longer than calls to functions of arraylists or hashmaps. TODO: we should take a look on the class serverObject which is inherited from hashmap Do we realy need a synchronization for this class? TODO: replace arraylists with linkedLists if random access to the list elements is not needed ) Robots Parser supports if-modified-since downloads now If the downloaded robots.txt file is older than 7 days the robots parser tries to download the robots.txt with the if-modified-since header to avoid unnecessary downloads if the file was not changed. Additionally the ETag header is used to detect changes. ) Crawler: better handling of unsupported mimeTypes + FileExtension ) Bugfix: plasmaWordIndexEntity was not closed correctly in - query.java - plasmaswitchboard.java *) function minimizeUrlDB added to yacy.java this function tests the current urlHashDB for unused urls ATTENTION: please don't use this function at the moment because it causes the wordIndexDB to flush all words into the word directory! git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@853 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	023be89586	*) Bugfix for "Robots.txt wird immer wieder geladen" See: http://www.yacy-forum.de/viewtopic.php?p=10241#10233 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@794 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	dc474aa22f	various bug-fixes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@792 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
rramthun	9dfbd93c7b	Updated german language file git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@748 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	2cd695f376	*) Bugfix path-entries of robots.txt were not decoded correctly git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@676 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	f8ad65eae1	*) First trial implementation of robots.txt support git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@674 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
allo	9300689dde	bugfix gr git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@662 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
allo	ebc39a7b9a	minor fixes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@659 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
allo	f90f699ab1	missing package line. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@655 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
allo	06a451768f	a simple robotsParser. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@652 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago

35 Commits (fa012789b2f6744392ef6d64687aa3d374c8c3bd)