yacy_search_server

Commit Graph

Author	SHA1	Message	Date
orbiter	497428c8ec	refactoring git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2949 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
orbiter	76fceb9997	refactoring git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2945 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
orbiter	bb7d4b5d5e	refactoring to prepare new RWI entry object - moved all url and index(RWI) entries to index package - better naming to distinguish RWI entries and URL entries git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2937 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
orbiter	918b59dc5e	- bugfix for snippet profile (no delete button) - bugfix for search process (avoided null pointer exception in case other peer does not respond) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2742 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
orbiter	3ad0709b53	added a delete button to crawl profile list. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2682 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
theli	97d2a08ef1	*) restructuring needed to support parsing of documents using various charsets - serverFileUtils.java: -- adding methods to copy from stream to writer and readers to writers -- moving httpc writeX methods into serverFileUtils class - serverCharBuffer.java: removing inheritance from Writer class - replacing htmlFilterOutputStream by htmlFilterWriter class which handles content as char stream - htmlFilterContentTransformer.java: deactivating getText mode (still needs to be migrated to use char streams instead of byte streams) - changes in several classes to use htmlFilterWriter instead of htmlFilterOutputStream - changes in Scraper and Transformer classes to operate on chars instead of bytes - httpdProxyHandler.java: bugfix. clientTimeout setting was missing in config file git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2617 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
theli	5847492537	*) next step of restructuring for new crawlers - IndexCreate_p.java: correcting problems with ftp urls - URL.java does not cutout the userinfo anymore (needed to transport authentication info in ftp urls, e.g. ftp://username:pwd@ftp.irgendwas.de) - plasmaCrawlLoader.java: -- hack to re enable https urls -- adding function getSupportedProtocols git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2482 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
theli	34831d2d9f	*) Check validity of crawl filter reg.exp. before adding it into the crawler queue See: http://www.yacy-forum.de/viewtopic.php?p=24671 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2410 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	abf22f6e60	removed url normalform computation from htmlFilterContentScraper. This method was implemented in de.anomic.net.URL git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2377 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	5f72be2a95	some redesign of EURL storage * store() is now called explicitely * more urls are written to the EURL table * the EURL stack does not store the complete entry any more, now only the URL hash git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2323 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	3879a0ecd0	replaced java.net.URL usage by use of new class de.anomic.net.URL This shall be seen as an experiment to exclude all cases where there could be a DNS lookup during URL comparisment. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2290 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	90d569d70f	refactoring of index management: url storage is part of index management; moved plasmaURL to indexURL git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2122 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	00a5d435e2	- fixed some bugs with domain filter - added new ranking filter "prefermask": urls that match the filter are ranked better git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2022 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	bd283b8443	fixed bugs: - null pointer exception during startup of a robinson-configured peer - wrong time calculation of default value of re-crawl option git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2005 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	e566d1d8d6	some bugfixes regarding new crawling options git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1980 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	c7f1300300	-fixes for last commit -some more ranking attributes (comments only) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1979 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	860a7b545b	enhanced input options for crawl start git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1978 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	7a650d0023	several bugfixes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1971 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	59d52fb4a9	fixed some problems with crawl profiles git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1967 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	0c9b61820e	enhanced re-crawl settings git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1960 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	708cc6c8d9	fixed some bugs for auto-filter and added monitor in profile list git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1959 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	63f39ac7b5	added 3 new crawling steering options: - re-crawl by age of page (enter in minutes) - auto-domain-filter - maximum number of pages per domain NOT YET TESTED! git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1949 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	1fc3b34be6	some pre-work (without function yet) to implement: - re-crawl (by age of last crawl) - auto-crawl-filter by crawl depth (to be explained..) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1948 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	2336f0f013	*) allow pausing/resuming of crawlJob Threads separately - pausing/resuming localCrawls - pausing/resuming remoteTriggeredCrawls - pausing/resuming globalCrawlTrigger See: http://www.yacy-forum.de/viewtopic.php?t=1591 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1723 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	37f88b4017	code cleanup git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1176 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	548f0c6aff	first Try with Eclipse / cleaned sources git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1157 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	444a5a9368	*) Bugfix for Entries with null url in GlobalQueue See: http://www.yacy-forum.de/viewtopic.php?p=12675#12675 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1069 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	d2731418bf	added creation of global ranking files and changed url normal form usage git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1046 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
hydrox	cb69047b91	*)cleanup access static methods and fields git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1016 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
hydrox	56b9f34411	*)removed unused imports git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1015 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	a2fa75e688	) Asynchronous queuing of crawl job URLs (stackCrawl) various checks like the blacklist check or the robots.txt disallow check are now done by a separate thread to unburden the indexer thread(s) TODO: maybe we have to introduce a threadpool here if it turn out that this single thread is a bottleneck because of the time consuming robots.txt downloads ) improved index transfer The index selection and transmission is done in parallel now to improve index transfer performance. TODO: maybe we could speed up performance by unsing multiple transmission threads in parallel instead of only a single one. ) gzip encoded post requests it is now configureable if a gzip encoded post request should be send on intex transfer/distribution ) storage Peer (very experimentell and not optimized yet) Now it's possible to send the result of the yacy indexer thread to a remote peer istead of storing the indexed words locally. This could be done by setting the property "storagePeerHash" in the yacy config file - Please note that if the index transfer fails, the index ist stored locally. - TODO: currently this index transfer is done by the indexer thread. To seedup the indexer a) this transmission should be done in parallel and b) multiple chunks should be bundled and transfered together ) general performance improvements - better memory cleanup after http request processing has finished - replacing some string concatenations with stringBuffers - replacing BufferedInputStreams with serverByteBuffer - replacing vectors with arraylists wherever possible - replacing hashtables with hashmaps wherever possible This was done because function calls to verctor or hashtable functions take 3 time longer than calls to functions of arraylists or hashmaps. TODO: we should take a look on the class serverObject which is inherited from hashmap Do we realy need a synchronization for this class? TODO: replace arraylists with linkedLists if random access to the list elements is not needed ) Robots Parser supports if-modified-since downloads now If the downloaded robots.txt file is older than 7 days the robots parser tries to download the robots.txt with the if-modified-since header to avoid unnecessary downloads if the file was not changed. Additionally the ETag header is used to detect changes. ) Crawler: better handling of unsupported mimeTypes + FileExtension ) Bugfix: plasmaWordIndexEntity was not closed correctly in - query.java - plasmaswitchboard.java *) function minimizeUrlDB added to yacy.java this function tests the current urlHashDB for unused urls ATTENTION: please don't use this function at the moment because it causes the wordIndexDB to flush all words into the word directory! git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@853 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
low012	4dbc871524	) Trying to get rid of possibility of exploits in IndexCreate through HTML and JavaSkript in peernames, URLs, <title>-tags etc. (see http://www.yacy-forum.de/viewtopic.php?t=1181 ) I hope I got them all and did not overdo it. *) Just a tiny bit of cleanig up in News.java. (I messed it up myself some time ago.) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@749 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	e6338b4390	*) Bugfix for "Error with request: GET http://localpeer:80/IndexDelete_p.ht " See: http://www.yacy-forum.de/viewtopic.php?p=8906 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@678 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	bead8a32aa	) IndexCreate_p.java: Crawler StartURLs will now also added to the errorURL-DB if an error occures on this url ) kelondroStack.java, plasmaSwitchboardQueue.java Adding method which returns a list of all entries in the queue. This list is used by IndexCreate_p.java instead of an iterator to display the indexing-list. Advantages: avoid concurrent modifications of the list while displaying it. Speedup because now we have to access only one sync function instead of multiple ones (one for each entry) ) IndexCreateIndexingQueue_p.java Using new list() function of plasmaSwitchboardQueue ) httpdFileHandler.java If a servelet returns the special value "LOCATION" the httpFileHandler does a Redirection of the Browser to the URL specified by the servelet. This can e.g. be used when a http get request is used insead of a post request, but a refresh should not be allowed. *) IndexCreateWWWLocalQueue_p.html Now it's possible to delete single entries of the local crawler queue git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@626 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	330eae7cf3	) Normalizing CrawlerStartURL now before crawling is started ) CrawlWorker also does a URL normalization now before following the redirection URL *) CrawlWorker removes redirection URL correctly from noticeURL stack now git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@571 6c8d7289-2bf4-0310-a012-ef5d649a1542	20 years ago
orbiter	bb3e897baf	mor minor changes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@488 6c8d7289-2bf4-0310-a012-ef5d649a1542	20 years ago
orbiter	2d8557cb10	minor changes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@487 6c8d7289-2bf4-0310-a012-ef5d649a1542	20 years ago
orbiter	e84a177c49	many bigfixes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@475 6c8d7289-2bf4-0310-a012-ef5d649a1542	20 years ago
orbiter	9ee8a5ba6c	fixed big in yacynews git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@474 6c8d7289-2bf4-0310-a012-ef5d649a1542	20 years ago
orbiter	d34eb23e4e	fixed news; added news appearance on Network and IndexCreate page; added intention string to global crawl git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@466 6c8d7289-2bf4-0310-a012-ef5d649a1542	20 years ago
orbiter	1022fbeb65	many YaCyNews fixes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@461 6c8d7289-2bf4-0310-a012-ef5d649a1542	20 years ago
orbiter	13abd8b6e7	added news-creation at crawl start git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@460 6c8d7289-2bf4-0310-a012-ef5d649a1542	20 years ago
orbiter	81e564edb8	faster crawl profile list cleanup git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@442 6c8d7289-2bf4-0310-a012-ef5d649a1542	20 years ago
orbiter	3470a72d48	fixed div by zero, set default delays, fixed release number format and display git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@435 6c8d7289-2bf4-0310-a012-ef5d649a1542	20 years ago
orbiter	be1f324fca	performance setting for remote indexing configuration and latest changes for 0.39 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@424 6c8d7289-2bf4-0310-a012-ef5d649a1542	20 years ago
theli	5c3822d5f4	*) adding experimental support for parsing of bookmarksfiles See: http://www.yacy-forum.de/viewtopic.php?t=177 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@388 6c8d7289-2bf4-0310-a012-ef5d649a1542	20 years ago
orbiter	858cd94299	replaced indexing ram-queue by file-based stack-queue git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@381 6c8d7289-2bf4-0310-a012-ef5d649a1542	20 years ago
orbiter	252c6e4869	added crawl queue monitor for global crawls git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@372 6c8d7289-2bf4-0310-a012-ef5d649a1542	20 years ago
orbiter	9a3f80403e	redesigned IndexCreate menu -- introduced submenues to enable more crawl queue control pages git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@370 6c8d7289-2bf4-0310-a012-ef5d649a1542	20 years ago
orbiter	a25b5b4986	fixed possible memory leak in htmlScraper: be aware that now links can get lost; further work necessary git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@288 6c8d7289-2bf4-0310-a012-ef5d649a1542	20 years ago

1 2

65 Commits (49a83f99d9990d720e45ddf5ee16285fc272e0fb)