yacy_search_server

Commit Graph

Author	SHA1	Message	Date
reger	5d71fc70e3	fix tarParser early exit on looping content - adjust check of data available according to doc - return null on no recognized content (to not exit TextParser next parser try) - use commons.compress directly	9 years ago
reger	2fcf6f104c	fix bzipParser recognition - Bzip2Inputstream checks magic byte itself to identify bz2 (leave it in input) - try to suppy fitting mime for parsing bz2 content	9 years ago
reger	a60b1fb6c2	differentiate api call getLocalPort() from getConfigInt()	9 years ago
reger	11f3666660	increase use of pre.defined CATCHALL_QUERY string	9 years ago
reger	a58ee49307	Optimize internal imagequery focus on using content_type to select images (in favor of url file extension)	9 years ago
reger	d223cf0ae4	adjust MediaWiki importer geo coordinate calculation - allow lat/long 0.xxx - south / west assignment include test class	9 years ago
reger	2b775d5be6	fix typo in WikiCode coordinate calculation	9 years ago
reger	bbe9df2bb3	fix MediawikiImporter for bz2 dump skip reading bz2 file magicbyte to identify bz2 format as inputstream reset would be required. Common compress reads and checks the magicbytes internally and throws ioexception if wrong, making preread obsolete.	9 years ago
reger	c6687dd560	fix a system.out to log.fine in bmpParser	9 years ago
reger	e53c6bbd51	fix init of peer flags (remove hiding of ssl flag)	9 years ago
Michael Peter Christen	ac034db8bc	Merge branch 'master' of https://github.com/luccioman/yacy_search_server # Conflicts: # htroot/js/highslide/highslide.js # source/net/yacy/document/ImageParser.java	9 years ago
reger	826f14f37f	fix unnececary set null of peer flags, causing reread remove obsolete version flags	9 years ago
luc	5902ce032e	Corrected NullPointerException case when ImageIO reader is not found for image format.	9 years ago
reger	c6495a5b62	add a log entry on parsing ajax crawling scheme snapshot (prev. commit `9252e36aeb`)	9 years ago
reger	9252e36aeb	implement ajax crawling scheme for ajax sites which adhere to the proposed use of hash-bangs to provide html content see freshly deprecated https://developers.google.com/webmasters/ajax-crawling/ Implementation improves parsing of the homepage (ajax page) which uses metatag "fragment" in header and parses supplied html snapshot instead of mostly empty ajax/scripted page. Implementation supports also hash-bang urls (url with anchor starting with ! like ...path#!hashfragment) but our crawler filters it (use of hash-bang is controversly discussed and proposal is deprecated, makes no sense to adjust the crawler, but as long as it is used by some sites the minor change/improvement in htmlparser is good for some time). Quick - how does it work - if metatag fragment with content "!" is found - htmlparser tries to get content of htmls snapshot (using a different url) - htmlparser returns 2 documents (original url and snapshot content - but using same original url) - after parsing result documents are joined (and stored to index containing content also from snapshot page... as the original ajax page contains typically no parseable html content)	9 years ago
Michael Peter Christen	d1ae999ef9	replaced HashMap with LinkedHashMap to preserve the object order	9 years ago
Michael Peter Christen	7d075a1d76	added log lines	9 years ago
Michael Peter Christen	092dac086e	Merge branch 'master' of https://github.com/luccioman/yacy_search_server	9 years ago
reger	7a64bebb86	init Recrawl job chunk size to max crawl loader during job start, to use some system preferences and allow injection of recrawl urls before queue is empty During recrawl the balancer hangs on the very last urls often on hosts with huge delay time, by allowing injection earlier progress is more balanced. Max number of injected crawl urls by recrawl job is 2 * max loader.	9 years ago
luc	d6522fa4a2	Integrated haraldk/TwelveMonkeys library to first add TIF image format support.	9 years ago
Michael Peter Christen	9244694e64	Merge branch 'master' of git@github.com:yacy/yacy_search_server.git	9 years ago
Michael Peter Christen	151ccd50a9	fix for image size field values (must be multi-valued)	9 years ago
reger	c9937973e3	unescape MultiProtocolURL getAttributes() return values. use getAttributes() to get query parameters as clear text (w/o url encoding) use getSearchpartMap() to get in internal format (url encoded) fix for http://mantis.tokeek.de/view.php?id=606	9 years ago
reger	78e8c6f3e5	refactor special handling (static override) of SUPPORTED_EXTENSIONS/MIME_TYPES not used for genericImageParser	9 years ago
reger	d54c5d310a	add links with image extension not automatically to image links. With the wide spread use e.g. of Wikimedia the url file extension of links with image extension often point to html.	9 years ago
reger	851e8f6c8a	check jpeg file signature in genericImageParser to fail early without further object allocation if source is not a jpeg.	9 years ago
reger	fb75fea446	use recrawljob w/o sort results by date This is a workaround for existing index (not fully reindexed) since intro of schema with docvalues to prevent solr exception causing recrawljob to fail with org.apache.solr.core.SolrCore java.lang.IllegalStateException: unexpected docvalues type NONE for field 'load_date_dt' (expected=NUMERIC). Use UninvertingReader or index with docvalues.	9 years ago
reger	43c27aa550	upd to solr/lucene 5.3.1	9 years ago
reger	688f7b2a5c	allow/display svg images in image results previews svg is not supported by awt but by most browser. Image content is delivered as received (without size adjustment)	9 years ago
reger	d5330391de	remove some unused var allocation in parser	9 years ago
Michael Peter Christen	3d7dd9d3aa	follow-up to latest commit: also flush the search cache if all crawls had been terminated.	9 years ago
Michael Peter Christen	c737ff235d	in case that the include_string contains several entries including 1-char tokens and also more-than-1-char tokens, then remove the 1-char tokens to prevent that we are to strict. This will make it possible to be a bit more fuzzy in the search where it is appropriate.	9 years ago
Michael Peter Christen	8e555d79a3	add also 1-character tokens to the token list because that could be also searched for. A full-string search for a filename may fail if those 1-char tokens are omitted	9 years ago
reger	7c82cd4415	add a end condition to svgParser for wrong content (if parser choosen just by file extension)	9 years ago
reger	356d4d1301	remove rdfParser from init (current function identical with genericParser)	9 years ago
reger	c647d899e3	add svgParser to parse metadate from svg images Reads document level included title and description and skips the graphic content to save bandwidth. svg metadata element is not interpreted - remove rdfParser from init (current function identical with genericParser)	9 years ago
reger	bad34804fe	optimize parseInt for <img> tag attribute parsing Performance better as using Numberformat.parse or parseInt(substring())	9 years ago
Michael Peter Christen	6ebc2451a9	Merge pull request #14 from luccioman/master Translator refactoring : no more regular expression processing	9 years ago
reger	2f51baff4f	check for loading error (includs unsupported formats) to prevent blank thumbnail display in image search because of not handled source which don't load on click. Now the cross icon indicates the problem (inlcuding not supported format)	9 years ago
luc	5578886f6f	Merge branch 'master' of https://github.com/luccioman/yacy_search_server.git	9 years ago
luc	c38d6c1f37	Correction for mantis 535: inurl: parameter doesn't work on URLs with upper-case letters	9 years ago
reger	52e3eb4ce8	harmonize/correct assignment to Ymarkmeta.mime replace use of deprecated	9 years ago
Michael Peter Christen	87f358058e	Fix for index entries which have id's not computed as hash from the url. This makes it possible to operate with outside-computed url hashes in enterprise environments not using the build-in crawler from YaCy.	9 years ago
reger	3f2b8ab5e5	optionally include mime in p2p url exchange string if doctype decodes to ambiguous mime and default conversion is not equal to original	9 years ago
reger	a3195d78ae	add Portuguese month names to date recognition	9 years ago
reger	d2cc11ea8f	fix html parser taking <style> content as text. Noticed some result description contain css content from style tag. Added <style> to tag list to scrape it's content not as text + test case included	9 years ago
Michael Peter Christen	5f706797cb	patch for a bug inside of solr since solr 5.0 when using a boost function with a numeric date field: "unexpected docvalues type NUMERIC for field 'last_modified' (expected one of [SORTED, SORTED_SET]). Use UninvertingReader or index with docvalues." This is a well-known bug inside solr which prevents that now the 'sort by date' in the YaCy search interface can be used. Without this patch no results at all is displayed (since the exception prevents that). Now there is at least a result but it is not ordered properly.	9 years ago
reger	7889fc2389	Hack to prevent Solr issue on partial update on a document containing multivalued date field (regardless if these fields part of update). Switch partial update option off in postprocessing if schema contains *_dts (multivalued date field). see http://mantis.tokeek.de/view.php?id=601	9 years ago
reger	b4cbdea1e7	adapt SolrServerConnector.add to handle error on partial update input document. In case of error we deleted the original document and added the new doc to the index. This is not valid for partial update documents (which contain only a subset of the fields). Remove the "delete" error handling step.	9 years ago
reger	98ab655917	on reindex delete index document with invalid url if discovered	9 years ago

1 2 3 4 5 ...

7843 Commits (5d71fc70e3f21b46a87f4f3edd708196e11f8caf)