yacy_search_server

Commit Graph

Author	SHA1	Message	Date
reger	4c9be29a55	fix concurrency issue with htmlParser using not current scraper data resulting in incorrect data for some html index metadata. Details see http://mantis.tokeek.de/view.php?id=717	8 years ago
reger	b522d540b9	Include itemprop latitude/longitude (see schema.org) in attribute parsing for lat/lon. Harmonize number parsing for lat/lon to parseDouble. Fix endDate_dts value assignment.	8 years ago
reger	083df255e4	fix html tag attribute parsing containing attribute w/o value e.g. itemscope or autofocus (in such case the next key was not properly recognized).	8 years ago
reger	cb95b7339a	include html5 <time> tag in content scraper, add "datetime" property of <time> tag to scrapers startdate list. Datetime is parsed as iso8601 (xml) date, html5 allows partial as well as duration (not handled by this)	8 years ago
reger	c50e23c495	reduce creation of empty legacy RequestHeader() in situation where null is acceptable (less for garbage collection).	8 years ago
luccioman	d27adc2b92	Fixed language detector initialization and NullPointerException cases. NullPointerException occurred when using and Identificator instance which encountered and error in its constructor. This error could be caused by a missing "langdetect" folder in the current folder of the main process, or by simultaneous first calls to the constructor, initializing concurrently the DetectorFactory.langlist. Fixes the mantis 714 (http://mantis.tokeek.de/view.php?id=714)	8 years ago
luccioman	3f561c1635	Fixed a NullPointerException case. Could occur when a search request was performed just after peer startup, and the Switchboard Thread "LibraryProvider.initialize" had completed, thus requesting a ProbabilisticClassifier not completely initialized (and having a null contexts property).	8 years ago
luccioman	3092a8ced5	Fixed thread name consistency for improved monitoring. Some tasks were modifying the current thread name without restoring it once finished as it is effectively done elsewhere.	8 years ago
luccioman	eec5779889	Added a name prefix to pooled threads for easier monitoring. Using JVM monitoring tools, it is then easier to identify tasks running inside thread pool with a custom prefix rather than the generic one : "pool-".	8 years ago
reger	8fe28a83f2	harmonize used lastmodified date for rwi and fulltext in storeDocument	8 years ago
luccioman	f0639d810c	Customized name for Threads still using the default "Thread-n" pattern. This makes threads monitoring easier to read.	8 years ago
luccioman	47af33a04c	Advanced Crawl from local file : better processing of large files. Applied strategy : when there is no restriction on domains or sub-path(s), stack anchor links once discovered by the content scraper instead of waiting the complete parsing of the file. This makes it possible to handle a crawling start file with thousands of links in a reasonable amount of time. Performance limitation : even if the crawl start faster with a large file, the content of the parsed file still is fully loaded in memory.	8 years ago
luccioman	7717a3d43d	Fixed license headers on files created to improve favicon management.	8 years ago
luccioman	6e1959f469	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git Conflicts: htroot/yacysearchitem.java source/net/yacy/cora/federate/solr/responsewriter/YJsonResponseWriter.java source/net/yacy/search/schema/CollectionConfiguration.java source/net/yacy/server/serverObjects.java	8 years ago
reger	b752bcfecb	adjust date in text detection to ignore some program version strings like "3.1.2.0102" see http://mantis.tokeek.de/view.php?id=650 + expand test case	8 years ago
reger	b017e97421	optimize condenser language detection a little. langdetect probabilities take letter case into account, add words from description and anchors etc. as is. + add it to javadoc	8 years ago
reger	ae3717d087	adjust Tokenizer sentence count to ignore repeated punktuation (like !!!! ) + remove unused sentenceword map (we use only the count) + upd test case for sentence count	8 years ago
reger	474f0476c6	adjust Tokenizer sentence count on trailing text after last recognized sentence + upd test case for rwi multi-word-query (leaving results known to fail untested)	8 years ago
reger	14f7577231	add support for older Word versions (Word6/Word95) to docParser	8 years ago
reger	1a79c64495	generalize DateDetection with holiday date rules readily available in icu to make sure current dates are recognized (was fixed to 2014 - 2016) + adjust holiday date parser from pattern.match to pattern.find to deal with leading and trailing text + moved relative date recognition (morgen, tomorrow) to parseline (used by query parser only), as not working and problematic for indexing + add test case for parseline (used by query parser)	8 years ago
reger	6f68f08354	correct DateDetection Silvester date add Thanksgiving	8 years ago
reger	efcb6a1e74	fix supported mime XML -> xml for rssParser (mime normalized to lower case for comparison) + add mime text/xml as in use for rss in the wild	8 years ago
luccioman	b1b8e69da8	Fixed NullPointerException cases	8 years ago
reger	a4465c97d6	as requested, disable/remove old swf parser http://forum.yacy-websuche.de/viewtopic.php?f=8&t=5861#p33098	8 years ago
reger	96467c5467	remove not needed counter in Tokeninzer (completing last changes) including a small change, word posintext counting. We remember/store 1st posintext. Previously following words got a handle (posintext) excluding found. Now it just counts and assigns true posintext as handle (posintext)	8 years ago
reger	272cdd496a	reactivate sentence counter in WordTokenizer for phrasepos ranking, by counting punktuation (delivered as 1 char word) again.	8 years ago
Michael Peter Christen	5e165a8150	removed unused imports	8 years ago
reger	e310ec5f70	fix posInText ranking calculation to score 0 on no position info + fix Word posInText calc in Tokenizer to start with 1 + test case	8 years ago
reger	4c7a77662a	eleminate dependency on file-extension in storeDocument but use supported mime-type to also support handling of urls w/o corresponding file-extension. For this refactor use of document.getParserObject() to alway return a Parser (for clean logic) and define/move the scraperObject as local var of AbstractParser. Adjust related calls to getParserObject (where actually a scraperObject is wanted). Addionally skip appending url token to parsed text for dht metadata entries (by default returned as result by rwi index).	8 years ago
reger	ebde21079a	refactor xlsParser to include Excel file attribute (like author) in parser result doc. Similar to ppt and doc parser, completing a TODO in xlsParser.	8 years ago
reger	27163af0e1	improve detection of referenced links by taking http and https link protocol into account + correct query start detection of commit `f89d4eb51d`	8 years ago
luccioman	6e96c7341a	Merge remote-tracking branch 'origin/master' Conflicts: htroot/Load_MediawikiWiki.java htroot/Load_PHPBB3.java htroot/ViewImage.java	8 years ago
reger	9e94989237	upd to PDFBox 2.0.1	9 years ago
reger	24b0fa2a38	extend snapshot Html2Image.pdf2image to use PDFBox image export capability if no external tool installed (and for Win) Resulting jpg are not always perfect (if graphic included) but imho sufficient.	9 years ago
reger	1d940e5a94	upd commons-compress 1.11	9 years ago
reger	764f5100f0	fix delete of temp file after odt % ooxml parser Close zipfile after parsing	9 years ago
reger	06d0e2aeb9	result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode. - Above brought up that parser start url parameter, declared as AnchorURL uses only methodes of parent object DigestURL (changed parameter declaration accordingly).	9 years ago
luc	9f712146df	Display icons in ViewFile "links" mode.	9 years ago
reger	6f0b073bf3	override detected language (statistic langdetect) only with TLD determided language if langdetect probability is not high. + additionally truncate zh-cn / zh-tw returned by langdetect to 2 char ISO639-1 zh used by YaCy	9 years ago
reger	b65e2b527d	include use of condenser's content text for language detection. Language identification may show poor performance on documents with short or no title but clear lang indication in text content. Using content text too improves lang detection. + remove double caching of text in Identificator	9 years ago
luc	3cc5619d93	Improved HTML icons indexing and rendering in search results. See http://mantis.tokeek.de/view.php?id=629	9 years ago
reger	2048b7e057	support scraping start-/enddate from html tag with property "datetime" This may be used in html5 <time> tag (which we don't explicite support yet for date in content scraping).	9 years ago
reger	900d4584ba	complet resource cleanup of lists in contentscraper's close()	9 years ago
reger	1f18653de0	pass parsed swf content trough htmlscraper Swf may contain subset of html tags which shoul'd appear as text. Especially <font> tag may totally screw up metadata servlet if not filtered out.	9 years ago
reger	18ecf57792	add support of compressed swf to swfParser from JavaSWF2 (source compatible to WebCat). Moved swf file signature check to parser Changed use of synced vector to list swf InStream	9 years ago
reger	ff27824964	fix swfParser reading file signature before passing to library (current version expects data w/o signature)	9 years ago
luc	571bc55937	Refactoring : use StandardCharsets constants instead of hard-coded charset names.	9 years ago
reger	46ac0867ff	fix poison mediawikiimporter output queue also after ExecutionException in worker thread. Writer of importer keeps needs a poison to close the file. On exception (e.g. OOM) add a poison marker in outer most try/catch to assure output queue will terminate in this condition too (and closes+renames the surrogate/in/xxx.prt file)	9 years ago
reger	a7591d3ed0	fix mediawikiimporter number format exception on coordinate parsing handle uncomplete metadata like "NS=43/50//N". For other {expr ... } type entries a try catch added	9 years ago
reger	e84d94f8ca	fix mime table for ms office / open office documents (causing wrong parser detect in intranet mode)	9 years ago

1 2 3 4 5 ...

651 Commits (86dc198698bcb11b366f6d7e7191ef2ef79de455)