yacy_search_server

Commit Graph

Author	SHA1	Message	Date
luccioman	2a87b08cea	Removed temporary html parser test code	7 years ago
luccioman	90a7c1affa	HTML parser : removed unnecessary remaining recursive processing Recursive processing was removed in commit `67beef657f`, but one remained for anchors content(likely omitted from refactoring). It is no more necessary : other links such as images embedded in anchors are currently correctly detected by the parser. More annoying : that remaining recursive processing could lead to almost endless processing when encountering some (invalid) HTML structures involving nested anchors, as detected and reported by lucipher on YaCy forum ( http://forum.yacy-websuche.de/viewtopic.php?f=23&t=6005 ).	7 years ago
luccioman	9b1bb2545e	Refactored plain-text URLs detection implementation. For faster processing (measured about 2 times faster on many real-world examples) and more advanced detection (previous algorithm detected only URLs separated from the rest of the text by a space character).	7 years ago
luccioman	8da3174867	Ensure lower case conversion consistency with any default locale. Especially for Turkish speaking users using "tr" as their system default locale : strings for technical stuff (URLs, tag names, constants...) must not be lower cased with the default locale, as 'I' doesn't becomes 'i' like in other locales such as "en", but becomes 'ı'.	7 years ago
luccioman	286f3018bd	Made mime type and extension normalization locale independent. Previously, upper cased mime type was incorrectly normalized when the default locale is Turkish.	7 years ago
luccioman	319231a458	Added a generic XML parser, able to parse elements text and URLs. This parser adds support for any XML based format other than already supported XML vocabularies such XHTML, RSS/Atom feeds... It will eventually be used as a fallback if one of these specific parsers fail, before falling back to the existing genericParser which extracts not that much useful information except URL tokens.	7 years ago
luccioman	64cec2790d	Improved character encoding detection from Content-Type header Also updated some related JavaDocs	8 years ago
luccioman	1acb7005d0	Added a basic JUnit test with test gz files for the gzip parser	8 years ago
luccioman	1e2fb76720	Properly close test files in htmlParser unit test	8 years ago
luccioman	9dd790087d	Added HT Cache basic statistics (hit rate)	8 years ago
luccioman	28b451a0b3	Made Cache compression level and lock timeout user configurable	8 years ago
luccioman	a7394b479b	Limit the synchronization blocking time on some Cache operations. Using a Reentrant lock instead of the intrinsic synchronization lock permits limiting the blocking time to acquire a lock. Useful on a very busy Cache concurrently accessed by many threads : when the time to acquire a lock is too high, getting/storing content on the cache becomes inefficient, and it is then better to fall back to loading remote resources. Illustrated by the CacheTest stress test and some traces reported in mantis 751 ( http://mantis.tokeek.de/view.php?id=751 )	8 years ago
Michael Peter Christen	6fe735945d	migrated Solr 5.5 -> Solr 6.6 and from Java 1.7 -> 1.8 Also: now Version 1.921	8 years ago
luccioman	a04feac064	Ensure file input streams proper closing in both success and failures Also add when possible a warning level log message on input stream closing error instead of failing silently. This could help understanding some IO exceptions such as "too many files open".	8 years ago
luccioman	d98c04853d	Ensure proper closing of file input streams.	8 years ago
luccioman	c226ded799	Fix unescape of URLs having some '%' chars but not percent-encoded	8 years ago
reger	077d062be3	Adjust mergeDocuments to keep youngest last-modified date of document collection	8 years ago
luccioman	522a268305	Improved new blacklist entries URL scheme detection.	8 years ago
luccioman	31fff2c986	Extended WikiCode template inclusion syntax support. Wiki templates are not rendered but syntax support is improved, which greatly enhance snippets rendering on search results coming from a MediaWiki dump import. Tested on various dumps from Wikimedia at https://dumps.wikimedia.org/backup-index.html See also Wikipedia transclusion documentation at https://en.wikipedia.org/wiki/Wikipedia:Transclusion	8 years ago
reger	7a7da698d4	fix unit test MultiProtocolURL(file) assertion for Windows path with drive letter.	8 years ago
luccioman	23775e76e2	Fixed endless loop case in wikicode processing. Detected when importing recent MediaWiki dumps containing some pages with script content in plain text format (see Scribunto extension https://www.mediawiki.org/wiki/Extension:Scribunto ). Further improvement : modify the MediawikiImporter to prevent processing revisions whose <model> is not wikitext.	8 years ago
luccioman	0bc868a819	Improved support for non ASCII chars in local file system URLs Creating a MultiProtocolURL instance from a File object and then retrieving a File with getFSFile() was inconsistent with file paths containing space or non ASCII chars.	8 years ago
reger	777cb5b812	remove test case for Standard_MemoryControl which will always fail see https://github.com/yacy/yacy_search_server/pull/114	8 years ago
reger	1ccc44e681	fix default/httpd.mime Z file extension to lower case + test case	8 years ago
reger	18c7563dbe	Extend DCEntry.getLanguage convert to ISO639-1 codes for more languages by using icu.ULocale for languages not already covered (ICU normalizes to ISO639-1 2 char codes). Add test class Use DublinCore vocabulary declarations in DCEntry and SurrogateReader for easier usage debugging, Init SurrogateReader.inputSource on first use.	8 years ago
reger	275c0cddd1	Adjust DefaultServlet test case to recent change, depreciate unused CONNECTION_PROP_PROTOCOL (also as it might be misleading with getProtocol vs getScheme)	8 years ago
reger	41e2ee0eca	Fix call parameter for ConnectionInfo in MonitorHandler (expected scheme e.g. http, was protocol version). Depreceate obsolete custom X-...-Scheme header constant. Use existing FORMAT_ANSIC Dateformatter in HeaderFramework. Correct htmlParserTest (del one not intended println)	8 years ago
reger	f254fcfc67	fix htmlParser <script> text extraction on code containing expression recognized as tag like 1<a reported in https://github.com/yacy/yacy_search_server/issues/109 Script content is ignored by default, but the text is filtered for html tags. Modified scraper to skip tag filtering while within a <script> section (until a closing tag is detected </script>. Possible side effect, missing </script> end-tag will truncate trailing content text.	8 years ago
luccioman	2f191e0e1c	Improved MultiprocotolURL non ASCII characters support. After @sinkuu Pull Request #108 added JUnit tests, updated some JavaDoc and also improved URL tokenization to support non ASCII characters.	8 years ago
luccioman	5c8958bcea	Updated Javadoc and Junit tests for the WebStructureGraph class.	8 years ago
luccioman	d9766ca981	Fixed WatchWebStructure_p.html render to include https URLs. As described in mantis 721 (http://mantis.tokeek.de/view.php?id=721) WatchWebStructure_p.html failed to include in its structure view https and other protocols and ports than default http.	8 years ago
luccioman	ed3dd5e31a	Fixed webstructure.xml API used with a domain name 'about' parameter. As described in mantis 720 (http://mantis.tokeek.de/view.php?id=720), when requesting this API with a domain name instead of a complete URL only HTTP references on default port were listed.	8 years ago
luccioman	0da1e6ba16	Factored code re-implementing DigestURL.hosthash() method. This ensure consistent implementation of the url host hash generation and easier usage finding in source code. Also added a unit test for this function.	8 years ago
luccioman	86adfef30f	Added automated unit tests and perfs test for WebStructureGraph class. Fixed references count when multiple links target the same domain name in one document.	8 years ago
luccioman	c9889991b9	Fixed 2 failing JUNit tests.	8 years ago
reger	083df255e4	fix html tag attribute parsing containing attribute w/o value e.g. itemscope or autofocus (in such case the next key was not properly recognized).	8 years ago
reger	cb95b7339a	include html5 <time> tag in content scraper, add "datetime" property of <time> tag to scrapers startdate list. Datetime is parsed as iso8601 (xml) date, html5 allows partial as well as duration (not handled by this)	8 years ago
luccioman	aa9ddf3c23	Added control over Robots.txt active threads maximum number. When starting a crawl from a file containing thousands of links, configuration setting "crawler.MaxActiveThreads" is effective to prevent saturating the system with too many outgoing HTTP connections threads launched by the crawler. But robots.txt was not affected by this setting and was indefinitely increasing the number of concurrently loading threads until most ot the connections timed out. To improve performance control, added a pool of threads for Robots.txt, consistently used in its ensureExist() and massCrawlCheck() methods. The Robots.txt threads pool max size can now be configured in the /PerformanceQueus_p.html page, or with the new "robots.txt.MaxActiveThreads" setting, initialized with the same default value as the crawler.	8 years ago
reger	fdcf33f08f	fix Domain.stripToHostName for some IPv6 cases add unit test for it	8 years ago
reger	ac6e198bd1	add unit test for Domains.stripToPort, simplify ipv6 check	8 years ago
luccioman	a0dfbaca6a	FileUtils : added some JavaDocs and unit test cases	8 years ago
reger	395f2e8946	Make ServletRequest implement the standardized HttpServletRequest interface, to make all readily available information from the original ServletRequest available to YaCy servlets (without converting data to internal structures). The implementation of the common interface allows easier integration of YaCy servlets with the servlet standard (e.g. shared login service with the servlet container etc.)	8 years ago
luccioman	7296e3884f	Switched even more URLs to pure relative ones. Thus a YaCy peer can run behind a reverse proxy subfolder without need for the reverse proxy to rewrite HTML links (a CPU costly operation). Tested on Debian Jessie with an apache2 reverse proxy. See related mantis issues http://mantis.tokeek.de/view.php?id=106 and http://mantis.tokeek.de/view.php?id=701	8 years ago
luccioman	731684105a	Improved absolute URLs rendering in OpenSearch desc and RSS feeds. When the peer is behind a reverse proxy providing SSL/TLS encryption, the rendered absolute URLs should start with https when the user browser requested https : added limited support to the X-Forwarded-Proto HTTP header notably provided on Heroku platform. Also added some unit tests.	8 years ago
reger	c9e81d2fa0	fix Column parsing from celldefinition string, without cellwidth def. (outofbound exception)	8 years ago
reger	af39a76bf6	Reduce number of default max. search navigator lines (from 10000) to 100 + make it configurable	8 years ago
reger	20a1b29ed3	add simple test case for ReferenceContainer helpful for debugging calculated ranking parameter	8 years ago
reger	3c7220bc7b	Refacture rwi reference word position and word distance calculation used for rwi ranking. Main changes: - introduce a posintext() to access the stored value. This reduces also mem alloc of position array for WordReferenceRow (index access) - use the positions() array for joined references on multi-word queries if needed (otherwise allow positions() to be null - adjust assignments and the min() max() and distance() calculation accordingly	8 years ago
luccioman	c3c4a52408	Added more examples in Blacklist JUnit test.	8 years ago
reger	8b74a6bf57	fix min/max calculation of WordReferenceVars.distance() Issue was the calculation in AbstractReference with positions.clear() call, this made distance result always 0 (distance needs min 2 positions) and created concurrency issues. + unit test of changes	8 years ago

1 2

96 Commits (2a87b08cea67f8f2ae46e318c1c3945e8520ec53)