yacy_search_server

Commit Graph

Author	SHA1	Message	Date
reger	083df255e4	fix html tag attribute parsing containing attribute w/o value e.g. itemscope or autofocus (in such case the next key was not properly recognized).	8 years ago
reger	cb95b7339a	include html5 <time> tag in content scraper, add "datetime" property of <time> tag to scrapers startdate list. Datetime is parsed as iso8601 (xml) date, html5 allows partial as well as duration (not handled by this)	8 years ago
luccioman	aa9ddf3c23	Added control over Robots.txt active threads maximum number. When starting a crawl from a file containing thousands of links, configuration setting "crawler.MaxActiveThreads" is effective to prevent saturating the system with too many outgoing HTTP connections threads launched by the crawler. But robots.txt was not affected by this setting and was indefinitely increasing the number of concurrently loading threads until most ot the connections timed out. To improve performance control, added a pool of threads for Robots.txt, consistently used in its ensureExist() and massCrawlCheck() methods. The Robots.txt threads pool max size can now be configured in the /PerformanceQueus_p.html page, or with the new "robots.txt.MaxActiveThreads" setting, initialized with the same default value as the crawler.	8 years ago
reger	fdcf33f08f	fix Domain.stripToHostName for some IPv6 cases add unit test for it	8 years ago
reger	ac6e198bd1	add unit test for Domains.stripToPort, simplify ipv6 check	8 years ago
luccioman	a0dfbaca6a	FileUtils : added some JavaDocs and unit test cases	8 years ago
reger	395f2e8946	Make ServletRequest implement the standardized HttpServletRequest interface, to make all readily available information from the original ServletRequest available to YaCy servlets (without converting data to internal structures). The implementation of the common interface allows easier integration of YaCy servlets with the servlet standard (e.g. shared login service with the servlet container etc.)	8 years ago
luccioman	7296e3884f	Switched even more URLs to pure relative ones. Thus a YaCy peer can run behind a reverse proxy subfolder without need for the reverse proxy to rewrite HTML links (a CPU costly operation). Tested on Debian Jessie with an apache2 reverse proxy. See related mantis issues http://mantis.tokeek.de/view.php?id=106 and http://mantis.tokeek.de/view.php?id=701	8 years ago
luccioman	731684105a	Improved absolute URLs rendering in OpenSearch desc and RSS feeds. When the peer is behind a reverse proxy providing SSL/TLS encryption, the rendered absolute URLs should start with https when the user browser requested https : added limited support to the X-Forwarded-Proto HTTP header notably provided on Heroku platform. Also added some unit tests.	8 years ago
reger	c9e81d2fa0	fix Column parsing from celldefinition string, without cellwidth def. (outofbound exception)	8 years ago
reger	af39a76bf6	Reduce number of default max. search navigator lines (from 10000) to 100 + make it configurable	8 years ago
reger	20a1b29ed3	add simple test case for ReferenceContainer helpful for debugging calculated ranking parameter	8 years ago
reger	3c7220bc7b	Refacture rwi reference word position and word distance calculation used for rwi ranking. Main changes: - introduce a posintext() to access the stored value. This reduces also mem alloc of position array for WordReferenceRow (index access) - use the positions() array for joined references on multi-word queries if needed (otherwise allow positions() to be null - adjust assignments and the min() max() and distance() calculation accordingly	8 years ago
luccioman	c3c4a52408	Added more examples in Blacklist JUnit test.	8 years ago
reger	8b74a6bf57	fix min/max calculation of WordReferenceVars.distance() Issue was the calculation in AbstractReference with positions.clear() call, this made distance result always 0 (distance needs min 2 positions) and created concurrency issues. + unit test of changes	8 years ago
luccioman	7717a3d43d	Fixed license headers on files created to improve favicon management.	8 years ago
luccioman	6e1959f469	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git Conflicts: htroot/yacysearchitem.java source/net/yacy/cora/federate/solr/responsewriter/YJsonResponseWriter.java source/net/yacy/search/schema/CollectionConfiguration.java source/net/yacy/server/serverObjects.java	8 years ago
luccioman	7136b1ad60	HTML validation : fixed URL encoding of Pictures link.	8 years ago
luccioman	3ccd89e274	Fixed MultiProtocolURL.resolveBackpath to handle remaining '..' segments	8 years ago
luccioman	f1f4459f88	Added some unit tests for Blacklist.isListed()	8 years ago
reger	e68b00678e	prevent negative score on URIMetadataNode - in the special case were no solr score is supplied. + assert before use & test case	8 years ago
reger	b752bcfecb	adjust date in text detection to ignore some program version strings like "3.1.2.0102" see http://mantis.tokeek.de/view.php?id=650 + expand test case	8 years ago
reger	b017e97421	optimize condenser language detection a little. langdetect probabilities take letter case into account, add words from description and anchors etc. as is. + add it to javadoc	8 years ago
reger	ae3717d087	adjust Tokenizer sentence count to ignore repeated punktuation (like !!!! ) + remove unused sentenceword map (we use only the count) + upd test case for sentence count	8 years ago
reger	474f0476c6	adjust Tokenizer sentence count on trailing text after last recognized sentence + upd test case for rwi multi-word-query (leaving results known to fail untested)	8 years ago
reger	1a79c64495	generalize DateDetection with holiday date rules readily available in icu to make sure current dates are recognized (was fixed to 2014 - 2016) + adjust holiday date parser from pattern.match to pattern.find to deal with leading and trailing text + moved relative date recognition (morgen, tomorrow) to parseline (used by query parser only), as not working and problematic for indexing + add test case for parseline (used by query parser)	8 years ago
reger	32a2e3a22a	have RSSFeed.getChannel return empty message on missing channel element, a) required b) prevent NPE in rss servlets + add test	8 years ago
luccioman	4585a60d7e	Made use of the constant corresponding to the hard-coded value.	8 years ago
luccioman	1bb0b135ac	Avoid duplication of various MS Windows file URLs flavors Fix for mantis 692 (http://mantis.tokeek.de/view.php?id=692)	8 years ago
reger	6f8c3ccea4	improve url hash computation for file path with mixed java & windows file.separator to compute equal hashes (by normalizing path for computation) + expand test case for to check mixed java / windows file url notation like e.g. file:///c:/test/file.html vs. file:///c:\test/file.html - relates partially to http://mantis.tokeek.de/view.php?id=692	8 years ago
reger	330768c8a2	fix for solr write.lock after mode change http://mantis.tokeek.de/view.php?id=686 The embedded core holds a lock on the index and must be closed. Earlier commit comment states that core should be closed with solr instance instead on close of connector. Adjusted the InstanceMirror.close() to take care of closing the embedded instance to release the lock. In 2 routines of fulltext this was already explicite implemented (disconnectLocalSolr). Now this disconnect is part of the InstanceMirror.close().	8 years ago
reger	11786457b7	add test case for EmeddedSolrConnector close() for issue http://mantis.tokeek.de/view.php?id=686 (without solving the issue here)	8 years ago
reger	585d2a6441	test case: for NewsPool to check the id modificator (for unique id) and observe the distribution order .. hands on. + add test/DATA to gitignor	8 years ago
reger	ff6589fc0f	test case: simulating multi word query for local rwi index Purpose of the test case is to be able to (controlled) analyse the rwi ranking for multi word searches (with focus on posintext and word-distance ranking)	8 years ago
reger	7f63fc50f3	prepare a IndexSegment test case for RWI index testing + prevent NPE in Segment.clear() on missing embedded solr instance.	8 years ago
reger	272cdd496a	reactivate sentence counter in WordTokenizer for phrasepos ranking, by counting punktuation (delivered as 1 char word) again.	8 years ago
Michael Peter Christen	5e165a8150	removed unused imports	8 years ago
reger	e310ec5f70	fix posInText ranking calculation to score 0 on no position info + fix Word posInText calc in Tokenizer to start with 1 + test case	8 years ago
reger	39dd244693	fix ConcurrentScoreMap.set() calculation of totalCount() + test case	8 years ago
reger	ebde21079a	refactor xlsParser to include Excel file attribute (like author) in parser result doc. Similar to ppt and doc parser, completing a TODO in xlsParser.	8 years ago
reger	5e335b32da	fix Blacklist.contains() matching path pattern to string similar to `5e9e871192` + add proof testcase	8 years ago
reger	f89d4eb51d	fix MultiProtocolURL init (assign of host) for urls with '/' in query part + add to test case	8 years ago
reger	87fcfc6d78	Adjusted hash computation and toNormalform for file:// protocol to deliver same hash same file on Windows filesystem path with forward- and backslash in path. Background see http://mantis.tokeek.de/view.php?id=671 +Test case	8 years ago
reger	7b226afc33	fix HostQueueTest - changed open parameter	8 years ago
luccioman	893a40995a	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git	8 years ago
reger	fcc29c36f0	test case for HostBalancer issue in intranet mode with file:// protocol, 2 hostqueues accessing same cache file concurrently http://mantis.tokeek.de/view.php?id=668 Reason seems to be diff. hosthash key of hostqueues on reopen. Internal queue key and external representation (directoryname currently hostname.port) must be adjusted to fix it (not done yet).	8 years ago
luccioman	6e96c7341a	Merge remote-tracking branch 'origin/master' Conflicts: htroot/Load_MediawikiWiki.java htroot/Load_PHPBB3.java htroot/ViewImage.java	8 years ago
reger	a476d06aec	wiki header code test string add "closing" tag	9 years ago
reger	d4da4805a8	internal wiki code, require header line to start with markup (to allow something like "one=two" as text) + incl. test case	9 years ago
reger	223071337b	Translator to take caution of word boundaries to identify text portion to be translated. To avoid key="TEST" sourcetext="this is a myTESTcase for it" translation of partial terms/words. Add check of word boundary before and after sourcetext (incl. take care of current praxis for key to be delimetered by > < + add test case	9 years ago

1 2

58 Commits (bdaef80a551b3609b82a235ca99aca9bd6e56aab)