yacy_search_server

Commit Graph

Author	SHA1	Message	Date
luccioman	0a120787e3	Improved accuracy of URLs search filters : protocol, tld, host, file ext	7 years ago
luccioman	d1c7dfd852	Fixed URL parsing with fragment and empty path	7 years ago
luccioman	e2f6427a63	Added a basic JUnit test for the Visio parser (vsdParser)	7 years ago
luccioman	d41ad7af6f	Restore initial locale at the end of a JUnit test case which modify it.	7 years ago
luccioman	7206f1ed71	Do locale neutral case conversions on domain names. Required to properly run on systems with default locale set to Turkish language, as with this locale the 'i' character has different upper and lower case flavors than with other locales.	7 years ago
luccioman	398c66f06c	Do locale neutral case conversions in MultiProtocolURL For any relevant URL parts : host name, URL scheme, session ids or technical parts (see https://url.spec.whatwg.org/#url-writing and https://tools.ietf.org/html/rfc3986 for current standard references). Remaining locale sensitive conversion used for detection of URL word components in urlComps() makes sense but using detected language would be preferable than using the default system locale.	7 years ago
luccioman	9531b83598	Do locale neutral case conversions in Classification Required for people using Turkish language as their default system locale, as with this locale the 'i' character has different upper and lower case flavors than with other locales.	7 years ago
luccioman	ac209cac2e	Updated the generic top-level known domains list. Using current IANA reference list at https://www.iana.org/domains/root/db The generated URL hashes on these domains stay the same but performance is greatly improved as a DNS resolve request is required on URL hash computation when the TLD part of the host name is unknown. Hash computation mean time measured on 1541 sample URLs (one on each TLD) and a computer with a DSL connection : about 230ms before change, then only 20ms.	7 years ago
luccioman	fcd57e2d0f	Improved some JUnit tests isolation and resources release The modified tests were successfull when run manually from an IDE such as Eclipse, but failed occasionnally when run with maven as part of the overall test suite.	7 years ago
luccioman	e0eda84c24	Remove old hard-coded holiday dates from DateDection class. Replaced with rules based relative to current year as already done for a part of the supported dates.	7 years ago
luccioman	73977ec0fe	Added a html parser charset detection unit test	7 years ago
luccioman	285f0d6a39	Consistently encode snapshot image with format requested on the API. Previously, calling /api/snapshot.png rendered JPEG encoded images.	7 years ago
luccioman	7c319c841e	Fixed pdf2image conversion with imagemagick on PDFs having transparency The target image format (jpeg) doesn't support transparency, so the Html2ImageTest produced unusable black images when ran on a linux machine having imagemagick package installed.	7 years ago
luccioman	fe75f326d8	Fixed ProfilingGraph calculation integer overflows and added test class. Complementary to fix proposed in PR #128 by @otteresk.	7 years ago
luccioman	5bf76f058a	Adjusted ResponseHeaderTest to succeed on slow or highly loaded CPU	7 years ago
luccioman	32c9dfa768	Added partial bzip2 stream parsing support and bzipParser Junit test	7 years ago
luccioman	dd9cb06d25	Fixed RWI distance calculation on multi words search queries. Distance was lost when storing/retrieving references to intermediate result container. Now all JUnit tests are again successfully passing!	7 years ago
luccioman	c6ae87168a	Added unit tests on the gzip parser.	8 years ago
luccioman	169ffdd1c7	Finer control on max links to parse in the html parser.	8 years ago
luccioman	4743a104b5	Added some unit tests on FileUtils.	8 years ago
luccioman	e41d046a9d	Improved parsing support for OOXML spreadsheets (.xlsx) As reported edycop in mantis 765 ( http://mantis.tokeek.de/view.php?id=765 ), parsing of xlsx files was quite incomplete. Now properly support "Shared String Table" entry in Office Open XML spreadsheets, an also detect embedded URLs. Integrating the Apache poi-ooxml library could be an option for finer OOXML formats support, but their SAX style parsing example ( http://poi.apache.org/spreadsheet/how-to.html#xssf_sax_api ) tends to show that a custom SAX handler is still efficient for lightweight and low memory footprint processing.	8 years ago
luccioman	780173008e	Implemented partial stream parsing of tar archives. Also added JUnit tests for the tar parser and fixed unwanted use of the tar parser as a fallback on files included in a tar archive.	8 years ago
luccioman	acab6a6def	Also handle text content when parsing XML within limits.	8 years ago
reger	f38fb7f02c	Add junit test for AbstractOperations.addOperand()	8 years ago
luccioman	ed678186a8	Updated xml parser limited parsing test for use latest jdk.	8 years ago
luccioman	f369679d1c	Fixed read/copy on input streams reading sometimes less than expected.	8 years ago
luccioman	bf55f1d6e5	Started support of partial parsing on large streamed resources. Thus enable getpageinfo_p API to return something in a reasonable amount of time on resources over MegaBytes size range. Support added first with the generic XML parser, for other formats regular crawler limits apply as usual.	8 years ago
luccioman	2a87b08cea	Removed temporary html parser test code	8 years ago
luccioman	90a7c1affa	HTML parser : removed unnecessary remaining recursive processing Recursive processing was removed in commit `67beef657f`, but one remained for anchors content(likely omitted from refactoring). It is no more necessary : other links such as images embedded in anchors are currently correctly detected by the parser. More annoying : that remaining recursive processing could lead to almost endless processing when encountering some (invalid) HTML structures involving nested anchors, as detected and reported by lucipher on YaCy forum ( http://forum.yacy-websuche.de/viewtopic.php?f=23&t=6005 ).	8 years ago
luccioman	9b1bb2545e	Refactored plain-text URLs detection implementation. For faster processing (measured about 2 times faster on many real-world examples) and more advanced detection (previous algorithm detected only URLs separated from the rest of the text by a space character).	8 years ago
luccioman	8da3174867	Ensure lower case conversion consistency with any default locale. Especially for Turkish speaking users using "tr" as their system default locale : strings for technical stuff (URLs, tag names, constants...) must not be lower cased with the default locale, as 'I' doesn't becomes 'i' like in other locales such as "en", but becomes 'ı'.	8 years ago
luccioman	286f3018bd	Made mime type and extension normalization locale independent. Previously, upper cased mime type was incorrectly normalized when the default locale is Turkish.	8 years ago
luccioman	319231a458	Added a generic XML parser, able to parse elements text and URLs. This parser adds support for any XML based format other than already supported XML vocabularies such XHTML, RSS/Atom feeds... It will eventually be used as a fallback if one of these specific parsers fail, before falling back to the existing genericParser which extracts not that much useful information except URL tokens.	8 years ago
luccioman	64cec2790d	Improved character encoding detection from Content-Type header Also updated some related JavaDocs	8 years ago
luccioman	1acb7005d0	Added a basic JUnit test with test gz files for the gzip parser	8 years ago
luccioman	1e2fb76720	Properly close test files in htmlParser unit test	8 years ago
luccioman	9dd790087d	Added HT Cache basic statistics (hit rate)	8 years ago
luccioman	28b451a0b3	Made Cache compression level and lock timeout user configurable	8 years ago
luccioman	a7394b479b	Limit the synchronization blocking time on some Cache operations. Using a Reentrant lock instead of the intrinsic synchronization lock permits limiting the blocking time to acquire a lock. Useful on a very busy Cache concurrently accessed by many threads : when the time to acquire a lock is too high, getting/storing content on the cache becomes inefficient, and it is then better to fall back to loading remote resources. Illustrated by the CacheTest stress test and some traces reported in mantis 751 ( http://mantis.tokeek.de/view.php?id=751 )	8 years ago
Michael Peter Christen	6fe735945d	migrated Solr 5.5 -> Solr 6.6 and from Java 1.7 -> 1.8 Also: now Version 1.921	8 years ago
luccioman	a04feac064	Ensure file input streams proper closing in both success and failures Also add when possible a warning level log message on input stream closing error instead of failing silently. This could help understanding some IO exceptions such as "too many files open".	8 years ago
luccioman	d98c04853d	Ensure proper closing of file input streams.	8 years ago
luccioman	c226ded799	Fix unescape of URLs having some '%' chars but not percent-encoded	8 years ago
reger	077d062be3	Adjust mergeDocuments to keep youngest last-modified date of document collection	8 years ago
luccioman	522a268305	Improved new blacklist entries URL scheme detection.	8 years ago
luccioman	31fff2c986	Extended WikiCode template inclusion syntax support. Wiki templates are not rendered but syntax support is improved, which greatly enhance snippets rendering on search results coming from a MediaWiki dump import. Tested on various dumps from Wikimedia at https://dumps.wikimedia.org/backup-index.html See also Wikipedia transclusion documentation at https://en.wikipedia.org/wiki/Wikipedia:Transclusion	8 years ago
reger	7a7da698d4	fix unit test MultiProtocolURL(file) assertion for Windows path with drive letter.	8 years ago
luccioman	23775e76e2	Fixed endless loop case in wikicode processing. Detected when importing recent MediaWiki dumps containing some pages with script content in plain text format (see Scribunto extension https://www.mediawiki.org/wiki/Extension:Scribunto ). Further improvement : modify the MediawikiImporter to prevent processing revisions whose <model> is not wikitext.	8 years ago
luccioman	0bc868a819	Improved support for non ASCII chars in local file system URLs Creating a MultiProtocolURL instance from a File object and then retrieving a File with getFSFile() was inconsistent with file paths containing space or non ASCII chars.	8 years ago
reger	777cb5b812	remove test case for Standard_MemoryControl which will always fail see https://github.com/yacy/yacy_search_server/pull/114	8 years ago

1 2 3 4 5 ...

253 Commits (e5b4799838b8ed319974b6eb3f28689d0ba14670)