yacy_search_server

Commit Graph

Author	SHA1	Message	Date
luccioman	db3b9db9c2	Crawl from local file : faster task end when manually terminating crawl.	8 years ago
reger	4c67ed3f8d	catch rwi ranking div by zero exception during rwi search result processing worddistance calculation is effected by concurrent update (normalization) of min/max ranking parameter for wordpositions. On update of min/max the exception is raised in distance calc and now catched. This concurrent update and change of ranking results is needed for speed but should be further checked for optimization	8 years ago
luccioman	47af33a04c	Advanced Crawl from local file : better processing of large files. Applied strategy : when there is no restriction on domains or sub-path(s), stack anchor links once discovered by the content scraper instead of waiting the complete parsing of the file. This makes it possible to handle a crawling start file with thousands of links in a reasonable amount of time. Performance limitation : even if the crawl start faster with a large file, the content of the parsed file still is fully loaded in memory.	8 years ago
luccioman	ee92082a3b	Updated javadocs : warning about closing stream responsibility.	8 years ago
luccioman	6f49ece22f	Fixed redirected URLs processing as crawl start point. See mantis 699 (http://mantis.tokeek.de/view.php?id=699) for details.	8 years ago
reger	68217465fe	div by null in word distance calculation (again, description in http://mantis.tokeek.de/view.php?id=698) as root cause was not seen, added just workaround reducing in favour over a try catch (for easier followup).	8 years ago
luccioman	7263d17436	Removed mentions of deprecated LURL-db. Thanks to LA_FORGE asking about if on YaCy forum ( http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5895 )	8 years ago
reger	8b74a6bf57	fix min/max calculation of WordReferenceVars.distance() Issue was the calculation in AbstractReference with positions.clear() call, this made distance result always 0 (distance needs min 2 positions) and created concurrency issues. + unit test of changes	8 years ago
luccioman	da362628fb	Added fine log level for too long blacklist matching processing.	8 years ago
reger	aaae7c6462	adjust ConcurrentScoreMap internal value map to interface and use parameter Long -> Integer (saves some bytes)	8 years ago
reger	31d2a5645e	remove obsolete query variable leftover from `8fb370d9f8 (diff-1d4259005ebfddc11083387857a86175)` harmonize ranking shift parameter to 0xFF correct addresult weight parameter to long	8 years ago
luccioman	a588ed7628	Applied image headers customization to the new ViewFavicon servlet.	8 years ago
luccioman	7717a3d43d	Fixed license headers on files created to improve favicon management.	8 years ago
luccioman	6e1959f469	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git Conflicts: htroot/yacysearchitem.java source/net/yacy/cora/federate/solr/responsewriter/YJsonResponseWriter.java source/net/yacy/search/schema/CollectionConfiguration.java source/net/yacy/server/serverObjects.java	8 years ago
reger	685d8e86bf	Avoid frequent data type casting (float/long) for rwi score refactor to using long in URIMetadataNode too (and related call parameters) As remote rwi score's are not used (since v1.83) skip reading float-score , but keep in toString() for communication with older versions.	8 years ago
luccioman	3ccd89e274	Fixed MultiProtocolURL.resolveBackpath to handle remaining '..' segments	8 years ago
luccioman	4b699c469a	Blacklist refactoring : extracted a function for easier unit testing	8 years ago
luccioman	54cfcc3f56	CrawlCheck_p.html : also display info about disallowed URLs.	8 years ago
luccioman	8b341e9818	Robots : properly handle URLs including non ASCII characters This fixes GitHub issue 80 ( https://github.com/yacy/yacy_search_server/issues/80 ) reported by Lord-Protector.	8 years ago
reger	e68b00678e	prevent negative score on URIMetadataNode - in the special case were no solr score is supplied. + assert before use & test case	8 years ago
luccioman	242707f9b4	Fixed loadFromCache with strategy IFFRESH. This fixes mantis 695 ( http://mantis.tokeek.de/view.php?id=695 ) : crawl start with 'Link-List of URL' option on websites using cookies.	8 years ago
reger	b752bcfecb	adjust date in text detection to ignore some program version strings like "3.1.2.0102" see http://mantis.tokeek.de/view.php?id=650 + expand test case	8 years ago
reger	b017e97421	optimize condenser language detection a little. langdetect probabilities take letter case into account, add words from description and anchors etc. as is. + add it to javadoc	8 years ago
reger	ae3717d087	adjust Tokenizer sentence count to ignore repeated punktuation (like !!!! ) + remove unused sentenceword map (we use only the count) + upd test case for sentence count	8 years ago
reger	474f0476c6	adjust Tokenizer sentence count on trailing text after last recognized sentence + upd test case for rwi multi-word-query (leaving results known to fail untested)	8 years ago
reger	3861ac9293	upd maven dependency-check plugin to reflect changes of https://nvd.nist.gov + upd unknown ant script with current lib/jsch version	8 years ago
reger	681a61dafb	adjust rwi index result word position handling used for rwi ranking - correct WordReferenceVars.toRowEntry posintext parameter to set expected min posintext (the difference is on multi-word queries, while positions are ordered by search word order). - modified posofphrase/posinphrase join operation - to set min posofphrase - and keep posinphrase if not same posofphrase (was set to 0, no differentiation during ranking) + fix compiler msg (missing type declaration)	8 years ago
reger	14f7577231	add support for older Word versions (Word6/Word95) to docParser	8 years ago
reger	1a79c64495	generalize DateDetection with holiday date rules readily available in icu to make sure current dates are recognized (was fixed to 2014 - 2016) + adjust holiday date parser from pattern.match to pattern.find to deal with leading and trailing text + moved relative date recognition (morgen, tomorrow) to parseline (used by query parser only), as not working and problematic for indexing + add test case for parseline (used by query parser)	8 years ago
reger	6f68f08354	correct DateDetection Silvester date add Thanksgiving	8 years ago
reger	32a2e3a22a	have RSSFeed.getChannel return empty message on missing channel element, a) required b) prevent NPE in rss servlets + add test	8 years ago
luccioman	8d57b5b970	Added some javadocs.	8 years ago
luccioman	60df09fff9	Fixed some HTML validation errors : Illegal character in query Now encode space characters in URLs query part.	8 years ago
reger	862f28eaa6	display number of documents/rss-items for label "docs" in load_rss_p servlet (as replacement for the rarely used "docs" rss-tag for a url to the rss-specification)	8 years ago
luccioman	dcdea2d02f	Fixed shutdown for crawler.MaxActiveThreads value greater than 200 Shutdown was hanging in CrawlQueues.close() at this.workerQueue.put(POISON_REQUEST) when config value crawler.MaxActiveThreads was greater than 200. Revealed by "Collision" Threads dumps in mantis 689 (http://mantis.tokeek.de/view.php?id=689#c1312) Fixed consistency between this.worker.length and this.workerQueue capacity, and made the process more reliable using non-blocking offer() function.	8 years ago
luccioman	d286ba2c3e	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git	8 years ago
luccioman	b8f6458152	Prevent yacy main thread from hanging on browser opening process. First fix for mantis 689 (http://mantis.tokeek.de/view.php?id=689). On Debian Linux, with a headless jre and no open browser, browser.openBrowserClassic() was called and waited forever the browser process end (p.waitFor()). YaCy shutdown was therefore not working until the browser was closed. Also modified browser opening command for Unix platform to open the default the browser (with xdg-open util) instead of Firefox. xdg-open also has the advantage to be asynchronous (not blocking).	8 years ago
reger	70e1eb30a5	prevent StringIndexOutOfBounds in getLocalFile() + tighten patching of DOS path w/o protocol to drive "LETTER":	8 years ago
luccioman	1bb0b135ac	Avoid duplication of various MS Windows file URLs flavors Fix for mantis 692 (http://mantis.tokeek.de/view.php?id=692)	8 years ago
luccioman	b9a8476f02	Removed unused import	8 years ago
reger	e73c1eea8c	remove unused rootpattern, leftover from commit `9a5ab4e2c1 (diff-d2b184283abed53ae260fc9eabdaef40)`	8 years ago
reger	6f8c3ccea4	improve url hash computation for file path with mixed java & windows file.separator to compute equal hashes (by normalizing path for computation) + expand test case for to check mixed java / windows file url notation like e.g. file:///c:/test/file.html vs. file:///c:\test/file.html - relates partially to http://mantis.tokeek.de/view.php?id=692	8 years ago
reger	efcb6a1e74	fix supported mime XML -> xml for rssParser (mime normalized to lower case for comparison) + add mime text/xml as in use for rss in the wild	8 years ago
luccioman	b3b75b0498	Accessibility : add a customizable alternative text to YaCy log Applied W3C recommendations : https://www.w3.org/TR/html51/semantics-embedded-content.html#a-link-or-button-containing-nothing-but-an-image and https://www.w3.org/TR/html51/semantics-embedded-content.html#logos-insignia-flags-or-emblems	8 years ago
luccioman	f2bc1b268d	Updated URL fragment validation rules according to current standards See RFC 3986 (https://tools.ietf.org/html/rfc3986) or URL living standard (https://url.spec.whatwg.org/)	8 years ago
luccioman	b1b8e69da8	Fixed NullPointerException cases	8 years ago
luccioman	3ee4f56c39	Improved ErrorCache behavior when switching networks Even after network switch, ErroCache was still holding a reference to the previous Solr cores, thus becoming useless until next YaCy restart. Initial error cache filling with recent errors from the index was also missing after the swtich.	8 years ago
luccioman	7d5ba2afa4	Added some JavaDoc and moved crawlStacker close at the right place.	8 years ago
luccioman	8edbcd8ad4	Log eventual Solr instances close errors. We do not want to block on this kind of error, but this should not silently fail as it may have later consequences.	8 years ago
reger	330768c8a2	fix for solr write.lock after mode change http://mantis.tokeek.de/view.php?id=686 The embedded core holds a lock on the index and must be closed. Earlier commit comment states that core should be closed with solr instance instead on close of connector. Adjusted the InstanceMirror.close() to take care of closing the embedded instance to release the lock. In 2 routines of fulltext this was already explicite implemented (disconnectLocalSolr). Now this disconnect is part of the InstanceMirror.close().	8 years ago

1 2 3 4 5 ...

3748 Commits (c0379c3cd359cdf7371ad793719104d8efb871f6)