yacy_search_server

Commit Graph

Author	SHA1	Message	Date
luccioman	8a29551c54	Upgraded the OpenGeoDB dump URL The status of the library in the DictionaryLoader_p.html page now also advertises the user that an upgrade can be applied when an older dump is already loaded. Upgrade applied as suggested by Niklas Andrus @fapth_gitlab on Gitter chat.	7 years ago
luccioman	bb51555830	Removed remaining unsafe accesses to SimpleDateFormat instances. SimpleDateFormat must not be used by concurrent threads without synchronization for parsing or formating dates as it is not thread-safe (internally holds a calendar instance that is not synchronized). Prefer now DateTimeFormatter when possible as it is thread-safe without concurrent access performance bottleneck (does not internally use synchronization locks).	7 years ago
luccioman	e97580dfc7	Fixed unsafe conccurent access to generic SimpleDateFormat instances SimpleDateFormat must not be used by concurrent threads without synchronization for parsing or formating dates as it is not thread-safe (internally holds a calendar instance that is not synchronized). Prefer now DateTimeFormatter when possible as it is thread-safe without concurrent access performance bottleneck (does not internally use synchronization locks).	7 years ago
Michael Christen	e0dc632020	removed transformer it was not used any more	7 years ago
luccioman	fa4399d5d2	Small perf improvement : initialize threads names early when possible Initializing Thread names using the Thread constructor parameter is faster as it already sets a thread name even if no customized one is given, while an additional call to the Thread.setName() function internally do synchronized access, eventually runs access check on the security manager and performs a native call. Profiling a running YaCy server revealed that the total processing time spent on Thread.setName() for a typical p2p search was in the range of seconds.	7 years ago
luccioman	e357ade47d	Reduced memory footprint of text snippet extraction By not parsing and storing at first all sentences of a document, but only on the fly the ones necessary to compute the snippet.	7 years ago
luccioman	e115e57cc7	Reduced text snippet extraction processing time. By not generating MD5 hashes on all words of indexed texts, processing time is reduced by 30 to 50% on indexed documents with more than 1Mbytes of plain text.	7 years ago
luccioman	fb3032c530	Added a crawl filtering possibility on documents Media Type (MIME)	7 years ago
luccioman	cf62b571bd	Added RSS reader support for `enclosure` feed item sub element. Enclosure element (see http://www.rssboard.org/rss-specification#ltenclosuregtSubelementOfLtitemgt ) can be seen for example in podcasts feeds.	7 years ago
luccioman	3da2739bbd	Parse and index more common audio metadata text tag fields.	7 years ago
luccioman	846aba00fa	Added parsing of URLs eventually present in audio metadata tags	7 years ago
Michael Peter Christen	187075b878	added nav filter	7 years ago
luccioman	bcbd0ae1a4	Enabled partial parsing of audio resources.	7 years ago
luccioman	978e2be95b	Let a chance for other parsers on audioTagParser error As done in all other parsers, eventually falling back in the end to the genericParser which creates a minimal index entry.	7 years ago
luccioman	9e5846a26e	Small fix on svg parser error message	7 years ago
luccioman	11611dbdcf	Reuse existing File copy function to handle audio parser tmp files	7 years ago
luccioman	f77f8f40f9	Factored audio parser tag processing	7 years ago
luccioman	9a7a353d0e	Removed some unnecessary intermediate list creation on array copy.	7 years ago
luccioman	fb6457f5bc	Fixed NPE case when on audio resource parsed with null tag	7 years ago
luccioman	c3ff50c17a	Updated the list of audio file formats supported by the audioTagParser Follows upgrade to Jaudiotagger dependency to version 2.2.5.	7 years ago
luccioman	eb20589e29	Fixed issue #158 : completed div CSS class ignore in crawl	7 years ago
luccioman	9412881230	Added basic support for autotagging microdata annotated item types. With the appropriate vocabulary settings in Vocabulary_p.html page, this can produce Vocabulary search facets displaying item types referenced in html documents by microdata annotation. Tested notably, but not limited to, vocabulary classes/types defined by Schema.org and Dublin Core.	7 years ago
luccioman	5a14d34a7d	Refactoring : documented and extracted autotagging processing functions.	7 years ago
luccioman	58b9834729	Added HTML microdata typed items parsing capability. This adds the possibility for the HTML parser to gather typed items URLs annotated in HTML tags with itemscope and itemtype attributes (see microdata specification https://www.w3.org/TR/microdata/ ), notably Types from the schema.org vocabulary, but also Types/Classes from any other vocabulary, such as the common ones listed in the RDFa core context ( https://www.w3.org/2011/rdfa-context/rdfa-1.1.html ).	7 years ago
luccioman	733cacdbb8	Revised the RDFaParser main launcher for minimal proper operation. This parser is still not enabled in the main text parsers list. More would have to be done to make it functional.	7 years ago
Michael Peter Christen	25573bd5ab	added a crawl filter based on <div> tag class names When a crawl is started, a new field to exclude content from scraping is available. The field can be identified with the class name of div tags. All text contained in such a div tag where the configured class name(s) match are not indexed, while the remaining page is indexed.	7 years ago
luccioman	e2f6427a63	Added a basic JUnit test for the Visio parser (vsdParser)	7 years ago
luccioman	1e9cdaabd4	Do locale neutral case conversion of HTML charset name. Required to properly run on systems with default locale set to Turkish language, as with this locale the 'i' character has different upper and lower case flavors than with other locales.	7 years ago
luccioman	e0eda84c24	Remove old hard-coded holiday dates from DateDection class. Replaced with rules based relative to current year as already done for a part of the supported dates.	7 years ago
luccioman	46f37e38dc	Customized Threads with generic name for easier monitoring.	7 years ago
luccioman	32c9dfa768	Added partial bzip2 stream parsing support and bzipParser Junit test	8 years ago
luccioman	c6ae87168a	Added unit tests on the gzip parser.	8 years ago
luccioman	169ffdd1c7	Finer control on max links to parse in the html parser.	8 years ago
luccioman	e41d046a9d	Improved parsing support for OOXML spreadsheets (.xlsx) As reported edycop in mantis 765 ( http://mantis.tokeek.de/view.php?id=765 ), parsing of xlsx files was quite incomplete. Now properly support "Shared String Table" entry in Office Open XML spreadsheets, an also detect embedded URLs. Integrating the Apache poi-ooxml library could be an option for finer OOXML formats support, but their SAX style parsing example ( http://poi.apache.org/spreadsheet/how-to.html#xssf_sax_api ) tends to show that a custom SAX handler is still efficient for lightweight and low memory footprint processing.	8 years ago
reger	51a4e03c93	Allow to stop currently running warc import (stop button)	8 years ago
luccioman	780173008e	Implemented partial stream parsing of tar archives. Also added JUnit tests for the tar parser and fixed unwanted use of the tar parser as a fallback on files included in a tar archive.	8 years ago
luccioman	acab6a6def	Also handle text content when parsing XML within limits.	8 years ago
luccioman	8a94fef9e0	Prevent unwanted cached bytes duplication on stream parsing.	8 years ago
luccioman	5a646540cc	Support parsing gzip files from servers with redundant headers. Some web servers provide both 'Content-Encoding : "gzip"' and 'Content-Type : "application/x-gzip"' HTTP headers on their ".gz" files. This was annoying to fail on such resources which are not so uncommon, while non conforming (see RFC 7231 section 3.1.2.2 for "Content-Encoding" header specification https://tools.ietf.org/html/rfc7231#section-3.1.2.2)	8 years ago
luccioman	eda7b0aeb6	Merge branch 'master' of https://github.com/yacy/yacy_search_server	8 years ago
reger	3005be7349	Clean up unmaintained and unused AugmentParser trail.	8 years ago
luccioman	cb4f1358e1	Added gzip parser support for max content bytes limit	8 years ago
luccioman	5216c681a9	Added HTML parser support for maximum content bytes parsing limit	8 years ago
luccioman	651fad6da5	Added RSS parser support for maximum content bytes parsing limit	8 years ago
luccioman	452a17a8d5	Finer control on bounded input streams with custom stream implementation	8 years ago
luccioman	f8f1959ebb	Added parsing within bounds implementation to the generic parser.	8 years ago
luccioman	e0f400a0bd	Support trying multiple parsers even when streaming on large resources.	8 years ago
luccioman	bf55f1d6e5	Started support of partial parsing on large streamed resources. Thus enable getpageinfo_p API to return something in a reasonable amount of time on resources over MegaBytes size range. Support added first with the generic XML parser, for other formats regular crawler limits apply as usual.	8 years ago
luccioman	90a7c1affa	HTML parser : removed unnecessary remaining recursive processing Recursive processing was removed in commit `67beef657f`, but one remained for anchors content(likely omitted from refactoring). It is no more necessary : other links such as images embedded in anchors are currently correctly detected by the parser. More annoying : that remaining recursive processing could lead to almost endless processing when encountering some (invalid) HTML structures involving nested anchors, as detected and reported by lucipher on YaCy forum ( http://forum.yacy-websuche.de/viewtopic.php?f=23&t=6005 ).	8 years ago
luccioman	9b1bb2545e	Refactored plain-text URLs detection implementation. For faster processing (measured about 2 times faster on many real-world examples) and more advanced detection (previous algorithm detected only URLs separated from the rest of the text by a space character).	8 years ago

1 2 3 4 5 ...

732 Commits (1ca9cb6bd90bd303f8cb2078213f9d12f8ce8c7d)