yacy_search_server

Commit Graph

Author	SHA1	Message	Date
orbiter	ccbfb15b6b	enhancement to crawl stacker enqueue order git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4192 6c8d7289-2bf4-0310-a012-ef5d649a1542	17 years ago
orbiter	a31b9097a4	preparations for mass remote crawls: two main changes must be implemented to enable mass remote crawls: - shift control of robots.txt to crawl queue (away from stacker). This is necessary since remote crawls can contain unchecked urls. Each peer must check the robots to prevent that it is misused as crawl agent for unwanted file retrieval - implement new index files that control double-check of remotely crawled urls After removal of robots.txt checking from stacker threads, the multi-threading of this process is void. Multithreading has been removed. Also the thread pools for the crawl threads had been removed, since creation of these threads is not resource-consuming, for a detailed explanation see svn 4106 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4181 6c8d7289-2bf4-0310-a012-ef5d649a1542	17 years ago
fuchsi	0e1738899f	* Complete number localization and provide a more reasonable interface to serverObjects: - put(key, value) methods are now used if a value added to the map should be kept as it is. Numbers are transformed (but not formatted) to an equivalent String representation. - putASIS(...) have been removed, now done with simple put(...) (see above). - puNum(...) can be used for number values which should be stored in a formatted way, either depending on the current locale setting for yacy (default) or in a "none" locale (see javadocs and setLocalize()). - putHTML(...) escapes special characters into corresponding HTML enities ('<' => '<') which was done with put(...) before and so was called too often, becauses it is necessary only for very few cases. Additionally there is a "forXML" mode which only replaces < > & ". In short: Use put(...) for almost everything, use putXY(...) if you need some special transformation of the value. A few bugs have been fixed as well, and there should be a small performance improvement for complex pages with a lot of values. * added additional Sum/Avg rows to access tracker pages, see http://forum.yacy-websuche.de/viewtopic.php?f=5&t=456 * removed duplicate code (mostly related to the big changes above). TODO: - make sure, number formats work as expected _everywhere_, report overseen stuff http://forum.yacy-websuche.de/viewtopic.php?f=5&t=437 - probably a good idea to add special putDate() methods as they are used in many pages and create duplicated formatting code + maybe some centralized handling for memory value formatting. - further improve the speed of page creation for the WatchCrawler. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4178 6c8d7289-2bf4-0310-a012-ef5d649a1542	17 years ago
orbiter	b183bf6f42	- fixed opensearch bugs - added 'full domain' button to expert crawl start - removed not-workin 'only one domain' button, the regex allowed crawling of other domains git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4125 6c8d7289-2bf4-0310-a012-ef5d649a1542	17 years ago
orbiter	98abe0804d	another enhancement to crawl starts with link files git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4123 6c8d7289-2bf4-0310-a012-ef5d649a1542	17 years ago
orbiter	1b42152a76	fixed and enhanced some details in crawl start with file git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4120 6c8d7289-2bf4-0310-a012-ef5d649a1542	17 years ago
orbiter	01e0669264	re-designed some parts of DHT position calculation (effect is the same as before) and replaced old fist hash computation by new method that tries to find a gap in the current dht to do this, it is necessary that the network bootstraping is done before the own hash is computed this made further redesigns in peer initialization order necessary git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4117 6c8d7289-2bf4-0310-a012-ef5d649a1542	17 years ago
orbiter	842308ea97	- redesigned crawl start menu, integrated monitoring pages - removed web structure picture from indexing menu and grouped it together with htcache monitor - added a database for terminated crawls, when a crawl is finished it is automatically moved to the new database - extended crawl profile edit servlet, shows now also terminated crawls - option that was used to delete profiles is now redesigned to a function that moves the current crawl to the terminated crawls and removes all urls from the current queues! - fixed here and there problems with indexing queues - enhances indexing speed by changing cache flush sizes. - changed behaviour of crawl result servlet: the list of crawled urls is shown if there is one, othevise the overview window is shown attention: the new profile databases are not compatible with the old one. current crawls will be lost! the web index is not touched. next steps: the database of terminated crawls can be used to start with them a new crawl. This is useful if one wants to re-crawl specific pages and wants to use a old crawl profile. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4113 6c8d7289-2bf4-0310-a012-ef5d649a1542	17 years ago
orbiter	e27aeb7fdc	patch for bad crawl filter at crawl start git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4086 6c8d7289-2bf4-0310-a012-ef5d649a1542	17 years ago
orbiter	daf0f74361	joined anomic.net.URL, plasmaURL and url hash computation: search profiling showed, that a major amount of time is wasted by computing url hashes. The computation does an intranet-check, which needs a DNS lookup. This caused that each urlhash computation needed 100-200 milliseconds, which caused remote searches to delay at least 1 second more that necessary. The solution to this problem is to attach a URL hash to the URL data structure, because that means that the url hash value can be filled after retrieval of the URL from the database. The redesign of the url/urlhash management caused a major redesign of many parts of the software. Since some parts had been decided to be given up they had been removed during this change to avoid unnecessary maintenance of unused code. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4074 6c8d7289-2bf4-0310-a012-ef5d649a1542	17 years ago
orbiter	34858be5ef	added option to simple crawl start: complete domain crawl git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4070 6c8d7289-2bf4-0310-a012-ef5d649a1542	17 years ago
orbiter	40b0547611	- documentaton changes (removed old forum links) - different handling of link quotation - different handling of link normalization - enhanced html/unicode en/de-coding git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3993 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
orbiter	3cacb3bc95	fix for http://forum.yacy-websuche.de/viewtopic.php?f=6&t=168#p861 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3981 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
orbiter	a45216b479	fix to prevent bad-formed news messages git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3960 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
orbiter	3b46f0460f	moved crawl profile table from watch crawler to profile editor git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3824 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
orbiter	139c59ebbd	- fixed dht selction problem: the seed tables used a wrong ordering - cleaned some code git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3693 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
low012	a50256aba2	*) removed surplus replacements of HTML git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3686 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
theli	6f46245a51	) Bookmarks: Ajax icon is displayed while loading title ) First version of a sitemap parser added - currently only autodetection of sitemap files is supported *) DB-Import restructured - pause/resume should work again now git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3666 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
orbiter	dd44a1394f	disabled automatic performance setting change - during crawl start - each indexing cycle - for delay values - for short memory cycles git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3634 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
orbiter	e192f616a2	collection of small bugfixes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3600 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
(no author)	4f4d3d71dd	) Faster appearance of ConfigBasic by bypassing UPNP-scan in case of existing external connects ) Marked two deprecated source-points *) Added possibility to dump words from indexing to file. Should not affect performance in the current form. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3592 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
karlchenofhell	c5c3ecc67e	- fixed display of last entered value at IndexCreate_p plus minor usability/HTML adjustments - removed double XML-escaping from CacheAdmin_p git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3588 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
theli	589cbd8cbf	*) replacing all yacy-news-category strings with corresponding constants Note: please use these constants from now on git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3495 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
orbiter	861f41e67e	redesigned NURL-handling: - the general NURL-index for all crawl stack types was splitted into separate indexes for these stacks - the new NURL-index is managed by the crawl balancer - the crawl balancer does not need an internal index any more, it is replaced by the NURL-index - the NURL.Entry was generalized and is now a new class plasmaCrawlEntry - the new class plasmaCrawlEntry replaces also the preNURL.Entry class, and will also replace the switchboardEntry class in the future - the new class plasmaCrawlEntry is more accurate for date entries (holds milliseconds) and can contain larger 'name' entries (anchor tag names) - the EURL object was replaced by a new ZURL object, which is a container for the plasmaCrawlEntry and some tracking information - the EURL index is now filled with ZURL objects - a new index delegatedURL holds ZURL objects about plasmaCrawlEntry obects to track which url is handed over to other peers - redesigned handling of plasmaCrawlEntry - handover, because there is no need any more to convert one entry object into another - found and fixed numerous bugs in the context of crawl state handling - fixed a serious bug in kelondroCache which caused that entries could not be removed - fixed some bugs in online interface and adopted monitor output to new entry objects - adopted yacy protocol to handle new delegatedURL entries all old crawl queues will disappear after this update! git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3483 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
orbiter	a5d668c0c6	added speed-buttons for easy performance setting appears in crawl start and on indexing monitor page git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3473 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
low012	ce360ef43e	) no more HTML in plasmaCrawlProfile.java anymore ) <br> will not be displayed in items in Auto Filter Content on WatchCrawler_p.html anymore *) removed unnecessary replaceHTML() git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3425 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
karlchenofhell	bf7a69197d	- fix for possible NPE in queues_p - WatchCrawler_p: - display crawler traffic - pause/resume local- and global crawler git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3389 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
karlchenofhell	18c841b3c0	- fix for http://www.yacy-forum.de/viewtopic.php?t=3269 [don't put 2 template-expressions back-to-back => bug?] git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3120 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
orbiter	61798f0ae6	added option to distinguish between text crawl and media crawl - for each crawl start, there is now a flag for text and media - the localCrawl flag is superfluous - added new crawl profiles - if an image search is done, only media links are crawled for the snippets git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3100 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
orbiter	6866bcd0e0	added missing file for last commit git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3099 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago

1 2

80 Commits (0fd9540866186b2e2725ede9a1c4c8d1313c1270)