yacy_search_server

Commit Graph

Author	SHA1	Message	Date
reger	6ca02ad577	upd httpclient-4.5.1, httpmime-4.5.1, httpcore-4.4.3, commons-compress-1.10	10 years ago
reger	c6495a5b62	add a log entry on parsing ajax crawling scheme snapshot (prev. commit `9252e36aeb`)	10 years ago
reger	9252e36aeb	implement ajax crawling scheme for ajax sites which adhere to the proposed use of hash-bangs to provide html content see freshly deprecated https://developers.google.com/webmasters/ajax-crawling/ Implementation improves parsing of the homepage (ajax page) which uses metatag "fragment" in header and parses supplied html snapshot instead of mostly empty ajax/scripted page. Implementation supports also hash-bang urls (url with anchor starting with ! like ...path#!hashfragment) but our crawler filters it (use of hash-bang is controversly discussed and proposal is deprecated, makes no sense to adjust the crawler, but as long as it is used by some sites the minor change/improvement in htmlparser is good for some time). Quick - how does it work - if metatag fragment with content "!" is found - htmlparser tries to get content of htmls snapshot (using a different url) - htmlparser returns 2 documents (original url and snapshot content - but using same original url) - after parsing result documents are joined (and stored to index containing content also from snapshot page... as the original ajax page contains typically no parseable html content)	10 years ago
Michael Peter Christen	d1ae999ef9	replaced HashMap with LinkedHashMap to preserve the object order	10 years ago
Michael Peter Christen	7d075a1d76	added log lines	10 years ago
Michael Peter Christen	092dac086e	Merge branch 'master' of https://github.com/luccioman/yacy_search_server	10 years ago
Michael Peter Christen	a44cc774d0	Merge branch 'master' of github.com:yacy/yacy_search_server	10 years ago
reger	7a64bebb86	init Recrawl job chunk size to max crawl loader during job start, to use some system preferences and allow injection of recrawl urls before queue is empty During recrawl the balancer hangs on the very last urls often on hosts with huge delay time, by allowing injection earlier progress is more balanced. Max number of injected crawl urls by recrawl job is 2 * max loader.	10 years ago
luc	d6522fa4a2	Integrated haraldk/TwelveMonkeys library to first add TIF image format support.	10 years ago
luc	e093fb228d	Created a generic ViewImage performance render test.	10 years ago
Michael Peter Christen	9244694e64	Merge branch 'master' of git@github.com:yacy/yacy_search_server.git	10 years ago
Michael Peter Christen	151ccd50a9	fix for image size field values (must be multi-valued)	10 years ago
luc	3ad564e2e4	Created a ViewImage rendering performance measurement test.	10 years ago
luc	62e07a26a0	Refactoring : split into sub-functions to make it understanding and performance measurement easier.	10 years ago
luc	b3f044072e	Updated table headers and SVG file url for case sensitive OS.	10 years ago
luc	ff963cbe23	Merge branch 'master' of https://github.com/yacy/yacy_search_server	10 years ago
reger	c9937973e3	unescape MultiProtocolURL getAttributes() return values. use getAttributes() to get query parameters as clear text (w/o url encoding) use getSearchpartMap() to get in internal format (url encoded) fix for http://mantis.tokeek.de/view.php?id=606	10 years ago
reger	10b0eb106f	fix link target on iframe list in CrawlProfileEditor	10 years ago
reger	78e8c6f3e5	refactor special handling (static override) of SUPPORTED_EXTENSIONS/MIME_TYPES not used for genericImageParser	10 years ago
reger	d54c5d310a	add links with image extension not automatically to image links. With the wide spread use e.g. of Wikimedia the url file extension of links with image extension often point to html.	10 years ago
luc	f5746b5490	Added ico and bmp sample pictures	10 years ago
luc	baede48161	Added JPEG 2000 and FITS samples	10 years ago
luc	7c9d80c5d0	Added image formats and informations for each format.	10 years ago
luc	073ef730af	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git	10 years ago
reger	5744342fec	handle image preview for url w empty file extension fix of commit `688f7b2a5c`	10 years ago
luc	82dd004260	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git	10 years ago
reger	851e8f6c8a	check jpeg file signature in genericImageParser to fail early without further object allocation if source is not a jpeg.	10 years ago
reger	fb75fea446	use recrawljob w/o sort results by date This is a workaround for existing index (not fully reindexed) since intro of schema with docvalues to prevent solr exception causing recrawljob to fail with org.apache.solr.core.SolrCore java.lang.IllegalStateException: unexpected docvalues type NONE for field 'load_date_dt' (expected=NUMERIC). Use UninvertingReader or index with docvalues.	10 years ago
Michael Peter Christen	3cbf86f295	Merge branch 'master' of git@github.com:yacy/yacy_search_server.git	10 years ago
Michael Peter Christen	23f6294a2d	removed unused import	10 years ago
reger	43c27aa550	upd to solr/lucene 5.3.1	10 years ago
reger	fd5a1dc297	upd to poi-3.13	10 years ago
luc	0ae9297ca5	Created a html test page to check ViewImage rendering with different file formats.	10 years ago
luc	136e8f6fbd	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git	10 years ago
reger	688f7b2a5c	allow/display svg images in image results previews svg is not supported by awt but by most browser. Image content is delivered as received (without size adjustment)	10 years ago
reger	d5330391de	remove some unused var allocation in parser	10 years ago
Michael Peter Christen	3d7dd9d3aa	follow-up to latest commit: also flush the search cache if all crawls had been terminated.	10 years ago
Michael Peter Christen	225200194a	every time a crawl is started, the user expects a different search result behaviour. This requires that the search cache is flushed for each crawl start. TODO: this should also be done if a crawl is terminated.	10 years ago
Michael Peter Christen	c737ff235d	in case that the include_string contains several entries including 1-char tokens and also more-than-1-char tokens, then remove the 1-char tokens to prevent that we are to strict. This will make it possible to be a bit more fuzzy in the search where it is appropriate.	10 years ago
Michael Peter Christen	8e555d79a3	add also 1-character tokens to the token list because that could be also searched for. A full-string search for a filename may fail if those 1-char tokens are omitted	10 years ago
luc	eb7989b17b	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git	10 years ago
reger	7c82cd4415	add a end condition to svgParser for wrong content (if parser choosen just by file extension)	10 years ago
luc	82f4f221e9	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git	10 years ago
reger	b92d81b073	remove double caching of inputstream in ViewImage	10 years ago
reger	c7c5e2dff9	fix old/obsolete solr dependency to stax delete obsolete jar	10 years ago
reger	beed1c417e	Add report profile with OWASP Dependency-Check to maven pom	10 years ago
reger	356d4d1301	remove rdfParser from init (current function identical with genericParser)	10 years ago
reger	c647d899e3	add svgParser to parse metadate from svg images Reads document level included title and description and skips the graphic content to save bandwidth. svg metadata element is not interpreted - remove rdfParser from init (current function identical with genericParser)	10 years ago
reger	bad34804fe	optimize parseInt for <img> tag attribute parsing Performance better as using Numberformat.parse or parseInt(substring())	10 years ago
Michael Peter Christen	3c31bf845f	fix for latest merge	10 years ago

1 2 3 4 5 ...

11984 Commits (6ca02ad57763e45491875f58ce67d5400a825c0a) All Branches Search

11984 Commits (6ca02ad57763e45491875f58ce67d5400a825c0a)

All Branches