yacy_search_server

Commit Graph

Author	SHA1	Message	Date
reger	1fdcc2d67b	change seedfile upload ip check to allow intranet ip in intranet mode - this allows to setup a principal peer in intranet environment	11 years ago
reger	e31b0e6d67	- update javadoc Seed.getIP - default mySeed.ip to hostip in SeedDB.initMySeed() if Intranetmode this allows to become senior status in intranet hosted search network with view peers, otherwise peer would stay junior because of default init with loopback ip as public (dna) ip.	11 years ago
reger	350c6b8250	in IntranetMode allow intranet hosted seedlist with Network_Domain "any" - so far intranet seedlist hosts are always denied but need to be allowed in intranet mode	11 years ago
orbiter	d68438c3d9	make sure that the postprocessing background thread never dies by any exception	11 years ago
orbiter	b4f2a1db6e	added a unlock icon for all protected pages that are unlocked because the administrator is logged in.	11 years ago
reger	ea6c9e9b07	reduce mem buffer overhead for gap files during r/w (they are typically small compared to idx allowing to use smaller buffersize -> set to 16k records)	11 years ago
reger	e88537522d	allow single quote " ' " in query see http://mantis.tokeek.de/view.php?id=379 -add QueryGoal test case for this	11 years ago
orbiter	487021fb0a	snippet computation update	11 years ago
orbiter	1c2f1f233a	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
reger	5a4995ded3	fill solr rss writer dc:subject tag with keyword content	11 years ago
orbiter	927aaa95a6	concurrency bugfix	11 years ago
orbiter	c9e593cf78	removed warnings	11 years ago
reger	7584352e7b	use more predefined Solr query parameter constants - use CommonParams and DisMaxParams constants - fix typo in get sort parameter - getDocumentCountByParams redundant implementation and risk of not optimized call (row parameter unspecified) -> as only used from getCountByQuery removed from interface	11 years ago
reger	f9db5dd6c5	reduce doublecontent check document (prevent out of memory) see http://mantis.tokeek.de/view.php?id=437 test result (concurrency=7) 2000 docs = eom always 1000 docs = eom always 100 docs = eom never chosen -> 200 docs (eom not encountered during test with 1GB mem setting)	11 years ago
reger	e9eae45b55	simplify rssreader and improve atom feed link extraction - type detection (rss/atom) - init type parameter overwritten during parse, parameter obsolete - detection by endtag changed to simpler first-tag evaluation - channel image not used, removed related extra parser handling - remove unused code (set/getImage) in rssfeed - atom link extraction to account for possible multipe link tags - spec limits link to one with rel="alternate" or one without rel attribute not accounting for the follwing type & hreflang exception yet: o atom:entry elements MUST NOT contain more than one atom:link element with a rel attribute value of "alternate" that has the same combination of type and hreflang attribute values.	11 years ago
reger	a8508417d1	catch NPE during crawl (OAI import) - condenseDocument mime=null (allowed) - collectionconfiguration responseheader = null (allowed)	11 years ago
reger	3dde94422f	center searchevent lines on network graph (PerformanceSearch_p.html)	11 years ago
Michael Peter Christen	3860711aef	fix for possible interruption of concurrent queries	11 years ago
Michael Peter Christen	6344718f8b	reducing the concurrent query stack size and reduced concurrency of postprocessing to avoid OOM situations	11 years ago
Michael Peter Christen	eca9380e3d	bugfix for crawler double-check: if an url is redirected, the redirect-target was not double-checked. This is now done by replacing the redirect-URL on the crawl queue again (where it is double-checked)	11 years ago
Michael Peter Christen	9ac0c93f17	fix for subpath crawl filter	11 years ago
Michael Peter Christen	66106bdaf0	fix for crawler attribute maxdompages	11 years ago
Michael Peter Christen	49d91b94c3	npe fix in crawler	11 years ago
Michael Peter Christen	b7183a7321	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
reger	ea2e627662	fix ConfigAccounts del user with uppercase letter in name (usernames are case sensitive, userdb.delete used toLower)	11 years ago
Michael Peter Christen	c465b791af	typo	11 years ago
Michael Peter Christen	191ec8c82a	added concurrency to postprocess rewrite process	11 years ago
Michael Peter Christen	a1e8bdd5e9	log ppm instead of docs/second	11 years ago
Michael Peter Christen	cc0ded7abd	set process type of web graph according to fields as defined in the schema	11 years ago
Michael Peter Christen	12fb9d7cd1	log postprocessing constraints in case that postprocessing is not performed	11 years ago
Michael Peter Christen	3c23b89823	less logging	11 years ago
Michael Peter Christen	a0c53174c5	better solr query logging to detect unnecessary sort requests for more performance profiling	11 years ago
Michael Peter Christen	338f574bdc	no sorting if http/www unique fields are not demanded (makes query faster) and some code restrucuring	11 years ago
Michael Peter Christen	1609763be5	toString fix	11 years ago
Michael Peter Christen	b983e68254	more retries, less sleep	11 years ago
Michael Peter Christen	1503ba7794	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
reger	8f77719091	fix "Ljava.lang.String" in crawl queue anchor name (e.g. IndexCreateQueues_p.html?stack=LOCAL with images in queue)	11 years ago
Michael Peter Christen	0ceeceb35e	more logic on Solr queries; usage of the query terms in posprocessing, saving one query for double document detection now per document	11 years ago
orbiter	38864ae004	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
orbiter	4099296b45	added new classes which shall reduce call overhead to Solr (stub)	11 years ago
reger	d0c02e1de7	adjust rss lat/lon to double (common format across other classes)	11 years ago
orbiter	3491ab4c38	removed unused images from webgraph edge computation	11 years ago
orbiter	2371d6b8db	target linktexts must be string to enable search facets on these fields	11 years ago
Michael Peter Christen	001e05bb80	do not store failure of loading of robots.txt into the index as a fail document	11 years ago
Michael Peter Christen	05d58e4df0	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	98f45c9032	fix for image alt attachment to AnchorURLs in html parser.	11 years ago
orbiter	22ce4fb4dd	better error handling for remote solr queries and exists-checks	11 years ago
Marc Nause	9df14fc126	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Marc Nause	477be17c51	Replaced old UPNP library with Weupnp. UPNP should work now, at least it does on my network. UPNP code in YaCy can still be improved though (see TODO comment: make port on gateway configurable or find free one). ) removed old code ) added new lib *) changed code to work with new lib	11 years ago
orbiter	738989aab7	reverted commit `f94c91315b` because the webgraph has not enough performance for that	11 years ago
orbiter	e9163e7e10	fix for malformed hostpath names in crawl balancer	11 years ago
Michael Peter Christen	c115f3869c	enhanced snippet computation and test method in ViewFile	11 years ago
reger	6c10b59f3e	move bootstrap peers test systems to its test class var assignment not needed elsewhere.	11 years ago
orbiter	1027f3d04a	fix for the usage of ready-prepared solr queries, some queries are formulated as edismax query but this was not set as query attribut. The defType=edismax property needs a qf-field, so this was added as well. Do not remove that field again! This fixes also a problem with title-unique computation.	11 years ago
Michael Peter Christen	f94c91315b	if the webgraph is used, then use it also for reference computation to avoid contradictions with references_i in the collection index.	11 years ago
Michael Peter Christen	6e1dc444c3	added a snippet test function in ViewFile: you can now search for a specific word on the document; the servlet returns the snippet in the same way as it would be shown in a search result.	11 years ago
orbiter	4b06adb751	fix for file urls	11 years ago
orbiter	08409ec680	no idea why the words max was an ordered one. This change increaes speed dunring document processin a bit	11 years ago
reger	e5854a5cdb	fix localhost link to opensearchdescription.xml	11 years ago
Michael Peter Christen	b44626e55b	fixed target_alt_t in webgraph	11 years ago
Michael Peter Christen	504327b15c	fix for condition for writing the webgraph	11 years ago
Michael Peter Christen	542c20a597	changed handling of crawl profile field crawlingIfOlder: this should be filled with the date, when the url is recognized as to be outdated. That field was partly misinterpreted and the time interval was filled in. In case that all the urls which are in the index shall be treated as outdated, the field is filled now with Long.MAX_VALUE because then all crawl dates are before that date and therefore outdated.	11 years ago
Michael Peter Christen	4eec1a7452	refactoring (change Metadata name of load time data structure to avoid confusion with Node data which is also called metadata)	11 years ago
reger	c95ba52cf0	improve logexception info - log a message or class name insted of msgtxt "null"	11 years ago
orbiter	e441831a24	reverted toString() change in AnchorURL to prevent mistakenly used toString(). This fixes also the update link bug.	11 years ago
reger	47f201a6b8	Add Solr default query fields (&qf) to select servlet according to the ranking profiles boost fields defined by the peer (if df/qf is not specified in query). This allows for pretty simple queries ( q=word) without the need to know about the specific index configuration. Making sure all relevant fields (as determined by the index owner) are searched, still maintaining the option to query specific fields and does not relay on the duplication of text to text_t. - add author to reset-default boost fields (support results for author nav)	11 years ago
reger	f96cfdc84d	prevent array out of bound exception on getRankingProfile(x) on faulty &profileNr= query parameter	11 years ago
reger	5f5fb4ecdc	remove unused static (RSS)search from protocol	11 years ago
reger	7c1706d83a	use CRLF in generated bat command scripts for windows - for easier viewing with standard viewers	11 years ago
reger	a2cb366b25	Combine /heuristic search modifier with opensearch configured targets - with search modifier /heuristic a request is send to all configured opensearch target systems (old /heuristic/blekko modifier not longer valid) - this allows to use opensearch heuristic on individual search request (in contrast to configuration HEURISTIC_OPENSEARCH=true which sends a osd request on all global searches - the index.html searchoption text adjusted to be displayed only if option configured - add Archive-It to predefined systems	11 years ago
Michael Peter Christen	2de159719b	added an option to set 'obey nofollow' for links with rel="nofollow" attribute in the <a> tag for each crawl. This introduces a lot of changes because it extends the usage of the AnchorURL Object type which now also has a different toString method that the underlying DigestURL.toString. It is therefore not advised to use .toString at all for urls, just just toNormalform(false) instead.	11 years ago
Michael Peter Christen	bf1b6b93e7	do not write CR values to webgraph if no CR values are computed	11 years ago
Michael Peter Christen	e039e78210	small bugfixes	11 years ago
Michael Peter Christen	32a2ff925c	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	d07cdd8c3b	added SolrCloud access mode and configuration	11 years ago
Michael Peter Christen	8514bffc22	enhanced postprocessing status report	11 years ago
reger	b24572f304	fix GSA filter query assignment - use more parameter constants	11 years ago
Michael Peter Christen	b5fc2b63ea	removed exist() retrieval functions from error cache and replaced it with metadata retrieval from connectors directly. This should cause better usage of the cache. Automatically increase the metadata cache if more memory is available.	11 years ago
Michael Peter Christen	62c72360ee	cleanup of checkAcceptanceInitially in CrawlStacker, should avoid double-calling of solr	11 years ago
Michael Peter Christen	dd5cdfe212	reverted filter query hack, it did not work	11 years ago
Michael Peter Christen	b5d78ba156	reduced number of solr queries during crawling	11 years ago
Michael Peter Christen	5326970d6c	enhanced solr queries for single document extraction	11 years ago
Michael Peter Christen	525575bd97	added debugging of filter queries in thread dump thread names	11 years ago
Michael Peter Christen	f319ef268f	testing filter queries instead of queries to retrieve documents by id	11 years ago
Michael Peter Christen	fd87fa1613	removed more unnecessary exist-checks in ErrorCache	11 years ago
Michael Peter Christen	f2b476e08b	don't do a double check to solr for failed documents if they are not written to solr	11 years ago
Michael Peter Christen	06ab72d1af	enhanced crawler host round-robin strategy	11 years ago
orbiter	dab9a0786a	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
orbiter	51bf5c85b0	Renamed the transmission cloud to buffer in dispatcher since the name 'cloud' was a bad idea. Changed also the accumulation process for peer targets so that every dht chunk is not assigned the set of redundant targets but they are assigned to redundant targets individually. This enhances the granularity of the target accumulation and should enhance the efficiency of the process. Finally the dht protocol client was enriched with the ability to remove the 'accept remote index' flag from peers or remove peers completely if they do not answer at all.	11 years ago
Michael Peter Christen	a694b6a8fc	another fix for unique field computation	11 years ago
Michael Peter Christen	fb3dd56b02	fix for processing of noindex flag in http header	11 years ago
Michael Peter Christen	b0d941626f	fixed bugs in canonical, robots and title/description unique calculation	11 years ago
reger	d9472d043a	cleanup older unused classes	11 years ago
reger	665e12f88e	move startup time from old serverCore to switchboard (most used here) to make servercore eventually obsolete.	11 years ago
reger	336425912a	remove unused localSearchThread from SearchEvent	11 years ago
reger	32bd2a61c1	add local ip to AbstractRemoteHandler local hostname cache	11 years ago
Michael Peter Christen	f3a6b6e21e	fix for bad URL decoding	11 years ago
Michael Peter Christen	1092e798a5	fixed double content postprocessing	11 years ago
Michael Peter Christen	aee5b108e5	added linkScraperParser, a parser which ignores the text like the generic parser but extracts links like the htmlParser. This should be used for ASCII documents without known text format annotation like source code files or json documents. Probably also good for xml files without known schema.	11 years ago
reger	2b8cc5832c	fix seek error for 0 file size records file by add extra check for file size = 0 in cleanlast() - (http://mantis.tokeek.de/view.php?id=411)	11 years ago

1 2 3 4 5 ...

7339 Commits (6491270b3a17a26834956d9aaf396b211e0d6b2b)