yacy_search_server

Commit Graph

Author	SHA1	Message	Date
luccioman	9412881230	Added basic support for autotagging microdata annotated item types. With the appropriate vocabulary settings in Vocabulary_p.html page, this can produce Vocabulary search facets displaying item types referenced in html documents by microdata annotation. Tested notably, but not limited to, vocabulary classes/types defined by Schema.org and Dublin Core.	7 years ago
luccioman	929e0d6eae	Replaced improper ByteBuffer.equals() implementation by Arrays.equals() Renamed also ByteBuffer.equals() to startsWith() as this is the appropriate function implementation semantics.	7 years ago
luccioman	5db1c9155a	Do locale independant case conversion on hosts, schemes, and file exts. Required for proper operation when the default system locale is Turkish, as dottless and dotted i characters have specific case conversion rules in this language.	7 years ago
Michael Peter Christen	607b39b427	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git Conflicts: htroot/yacysearchitem.java	7 years ago
Michael Peter Christen	4355de0f3c	(more!) evaluation of XRealIP from nginx reverse proxy	7 years ago
luccioman	0a120787e3	Improved accuracy of URLs search filters : protocol, tld, host, file ext	7 years ago
luccioman	d1c7dfd852	Fixed URL parsing with fragment and empty path	7 years ago
reger	d5a75537e4	remove redundant setting of timeout for remoteinstance and replace depreciated updatesolrclient instantiation with recommended builder	7 years ago
luccioman	f01aac31fd	Made possible to use https for remote search on peers with SSL enabled. Default is still http to prevent any regressions, but a new setting is available to choose https as the preferred protocol to perform remote searches. New configuration setting 'remotesearch.https.preferred' is manually editable in yacy.conf file or in Advanced Properties page (/ConfigProperties_p.html). Should be enabled as default in the future for improved privacy. Https could also eventually be used for other peers communications.	7 years ago
luccioman	7206f1ed71	Do locale neutral case conversions on domain names. Required to properly run on systems with default locale set to Turkish language, as with this locale the 'i' character has different upper and lower case flavors than with other locales.	7 years ago
luccioman	398c66f06c	Do locale neutral case conversions in MultiProtocolURL For any relevant URL parts : host name, URL scheme, session ids or technical parts (see https://url.spec.whatwg.org/#url-writing and https://tools.ietf.org/html/rfc3986 for current standard references). Remaining locale sensitive conversion used for detection of URL word components in urlComps() makes sense but using detected language would be preferable than using the default system locale.	7 years ago
luccioman	9531b83598	Do locale neutral case conversions in Classification Required for people using Turkish language as their default system locale, as with this locale the 'i' character has different upper and lower case flavors than with other locales.	7 years ago
luccioman	d22fc0d0a2	Updated lists of known sponsored and country-code TLDs. Using current IANA reference list at https://www.iana.org/domains/root/db . As for previous update on known generic TLDs list, the generated URL hashes on these domains stay the same but it improves performance of URL hash computation for URLs on these domains.	7 years ago
luccioman	ac209cac2e	Updated the generic top-level known domains list. Using current IANA reference list at https://www.iana.org/domains/root/db The generated URL hashes on these domains stay the same but performance is greatly improved as a DNS resolve request is required on URL hash computation when the TLD part of the host name is unknown. Hash computation mean time measured on 1541 sample URLs (one on each TLD) and a computer with a DSL connection : about 230ms before change, then only 20ms.	7 years ago
luccioman	a17a418e78	Fixed NullPointerException cases on snapshot images parsing.	7 years ago
luccioman	285f0d6a39	Consistently encode snapshot image with format requested on the API. Previously, calling /api/snapshot.png rendered JPEG encoded images.	7 years ago
luccioman	7c319c841e	Fixed pdf2image conversion with imagemagick on PDFs having transparency The target image format (jpeg) doesn't support transparency, so the Html2ImageTest produced unusable black images when ran on a linux machine having imagemagick package installed.	7 years ago
luccioman	8303e15419	Reduced number of search navigators refresh requests in JS resort mode The SearchEvent listen to changes on each of its navigators, and the information about their overall state is sent with each fetched search item (as a "data-nav-generation" attribute). Then the browser can regularly fetch a fresh version of yacysearchtrailer.html only if necessary (when that nav-generation value change).	7 years ago
reger	c31d94664a	Update deprecated SolrInputDocument.addField() with boost value remove unused SchemaConfiguration.getDate (as it is designed to return only past dates which might be unexpected for general configuration schema)	7 years ago
luccioman	7e271f9cf5	Updated travis config : install ghostscript, required for Html2Image	7 years ago
reger	ae1c675c85	fix array out of bounds in YJsonResponseWriter and OpensearchResponsWriter on recreation of image url. Set parameter of indexList2protocolList to required number of images (image_stubs) Situation e.g. image_stub(size=15) but images_protocol(size=12)	7 years ago
luccioman	57a33aefb0	Removed unnecessary max counts init on empty search navigators.	7 years ago
luccioman	ef8aea7f8d	Made the dates navigator max elements number user configurable. Also used object properties on QueryParams instances, rather than using mutable class (static) properties.	7 years ago
luccioman	5d3ceb31b7	Improved search navigators counters accuracy and consistency. - added some missing increments from RWI results - decrement relevant navigator counts when solr or RWI results are evicted because duplicates detection or constraints checked belatedly - do not compute facets when unnecessary to avoid unwanted CPU load - do not increment from facets when already done - do not rely on facets on remote solr peers requests, as most of the time only a limited part of their total results if fetched (thus also preventing unnecessary load on remote peers) - use a concurrency friendly score map for the dates navigators to prevent unwanted ConcurrentModificationExceptions This improves the situation for the most obvious inconsistencies in search navigators counts, but more has to be done for a true accuracy (notably when query modifiers constraints are applied belatedly - after the solr or RWI retrieval request - such as the content domain constraint)	8 years ago
Michael Peter Christen	2f71005a93	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git	8 years ago
Michael Peter Christen	2314f8e358	try to fix problem with error description http://forum.yacy-websuche.de/viewtopic.php?f=5&t=6023&p=33889&sid=37bc7aa029422be571b9266cdef43c52#p33889	8 years ago
luccioman	a1a0515312	Added a button to manually refresh sorting of p2p search results. As a server-side oriented alternative to the JavaScript realtime resorting feature proposed in PR #104. The goal is the same as in this PR : having the possibility compensate the network latency of various peers results fetching and obtain once possible a consistently ranked result set.	8 years ago
luccioman	169ffdd1c7	Finer control on max links to parse in the html parser.	8 years ago
Michael Peter Christen	30d71c6359	added usage of X-Real-IP http header to identify request IPs which came through NGINX reverse proxy configurations	8 years ago
Michael Peter Christen	7f395ef937	added image link in search results This should be a help to make a preview of search results. The image is computed from the list of embedded images, it is always the first image in that list. In rss-type results the image is presented like <media:content medium="image" url="https://abc.xyz/logo.png"/> as defined in http://www.rssboard.org/media-rss#media-content	8 years ago
reger	2a07799ad1	Correction of `d03e2c98ea` Fix Conjunction.addOperator to do nothing if term is empty prevent to result in query string with repeated logical operator like "field:term AND AND field:term" possibliy causing out of mem in postprocessing_doublecontent	8 years ago
reger	d03e2c98ea	Fix Conjunction.addOperator to do nothing if term is empty prevent to result in query string with repeated logical operator like "field:term AND AND field:term" possibliy causing out of mem in postprocessing_doublecontent	8 years ago
luccioman	651fad6da5	Added RSS parser support for maximum content bytes parsing limit	8 years ago
luccioman	452a17a8d5	Finer control on bounded input streams with custom stream implementation	8 years ago
luccioman	e82eaee4b6	Apply consistent behavior on HTTP resource size exceeding limit. On content size known from HTTP headers, terminates connection faster and improves error reports quality by reporting relevant message "Content to download exceed maximum value..." rather than previously "no response (NULL) for url...".	8 years ago
luccioman	8da3174867	Ensure lower case conversion consistency with any default locale. Especially for Turkish speaking users using "tr" as their system default locale : strings for technical stuff (URLs, tag names, constants...) must not be lower cased with the default locale, as 'I' doesn't becomes 'i' like in other locales such as "en", but becomes 'ı'.	8 years ago
luccioman	64cec2790d	Improved character encoding detection from Content-Type header Also updated some related JavaDocs	8 years ago
Michael Peter Christen	c94a8c76bd	re-added solr synchronization hack	8 years ago
Michael Peter Christen	6fe735945d	migrated Solr 5.5 -> Solr 6.6 and from Java 1.7 -> 1.8 Also: now Version 1.921	8 years ago
luccioman	8399275142	Properly close file output streams even on exceptions scenarios.	8 years ago
luccioman	4e4dc6c4e5	Removed unnecessary finalize implementation. On such private classes with limited scope but with frequent instance creations and removals within the application lifecycle, implementing the finalize method is particularly unwanted as it decreases the garbage collector performance. What's more the Object.finalize() method is now deprecated in the JDK 9 and will eventually disappear from future releases (see https://bugs.openjdk.java.net/browse/JDK-8177970)	8 years ago
luccioman	d98c04853d	Ensure proper closing of file input streams.	8 years ago
luccioman	c226ded799	Fix unescape of URLs having some '%' chars but not percent-encoded	8 years ago
luccioman	88c062639b	Added some JavaDoc	8 years ago
luccioman	527d494c1a	Fixed "Unchecked conversion" compilation warnings.	8 years ago
luccioman	f66438442e	Extended Mediawiki dump import to remote URLs. When using a public HTTP URL in /IndexImportMediawiki_p.html, the remote file now is directly streamed and processed, allowing import of several GB dumps even with a low memory remote peer, and without need to manually download the dump file first.	8 years ago
luccioman	e5c3b16748	Improved http client close time on stream processing errors.	8 years ago
luccioman	0bc868a819	Improved support for non ASCII chars in local file system URLs Creating a MultiProtocolURL instance from a File object and then retrieving a File with getFSFile() was inconsistent with file paths containing space or non ASCII chars.	8 years ago
Michael Peter Christen	1d81b8f102	Merge branch 'master' of git@github.com:yacy/yacy_search_server.git	8 years ago
Michael Peter Christen	69081bce00	added export to elasticsearch. The export dump can easily be imported to elasticsearch using the command curl -XPOST localhost:9200/collection1/yacy/_bulk --data-binary @yacy_dump_XXX.flatjson	8 years ago

1 2 3 4 5 ...

1257 Commits (117a85987989210f3b3295778e12bbaf2f5cd733)