yacy_search_server

Commit Graph

Author	SHA1	Message	Date
luccioman	c35d0568b6	Support for preferred https in peers communication on more operations	7 years ago
luccioman	e914d17aca	Updated call to function deprecated since commons-codec version 1.11	7 years ago
luccioman	a3ec7a7a5f	Added analysis optional setting to compute statistics on text snippets Thus producing some basic stats on processing times for snippets generation and counts on snippets per source type.	7 years ago
luccioman	1889d484de	Added Solr HTML writer support for responses from remote instances	7 years ago
luccioman	2af3bf79c7	Improve rendering of remote Solr admin URLs - properly handle IPv6 loopback address replacement - replace loopback address or host only when accessing peer remotely - replace loopback part with the peer hostname as requested rather than with its seed public IP as this works better for Intranet mode and when peer is behind a reverse proxy.	7 years ago
luccioman	bb74de7d59	Removed unnecessary "/admin" suffix from remote Solr instance admin URL For quite quite a long time now, the Solr /admin URL suffix indeed redirects to the Solr base context (see https://issues.apache.org/jira/browse/SOLR-3337)	7 years ago
luccioman	0d34034f17	Ensure an embedded Solr is available for Solr dump/restore operations Otherwise, these operations triggered NullPointerException when only an external Solr index is attached.	7 years ago
luccioman	d92b191942	Ensure no remote Solr is attached before "Shut Down and Re-Start Solr" Otherwise once this operation is applied, the remote Solr(s) instances are deconnected and the embedded Solr is connected even if disabled by setting "core.service.fulltext". Also use constants for related default setting values.	7 years ago
luccioman	26d8ad591c	Adjusted Solr select servlet output when using an external Solr only - Use the EnhancedXMLResponseWriter only when requested output is "exml" - Use the Standard Solr writers when possible, for example for json, xml or javabin output formats - Return an error when the requested format can not been rendered with an external Solr server only Important : this modification is necessary for peers using exclusively an external Solr server to be reachable as robinson targets in p2p search, as the binary format ("javabin") is the default Solr exchange format for peers. Before this, when a peer requested a remote one attached only to an external Solr (no embedded one), it ended with "Invalid type" error, as the remote peer answered with xml although binary format was requested.	7 years ago
luccioman	69690c13a0	Optionally allow external Solr server with self-signed certificate This is necessary when you want to attach to a dedicated external Solr server protected with basic http authentication and requested over https but having only a self-signed certificate.	7 years ago
luccioman	b882f85900	Fixed NPE case in Solr select servlet on external Solr only setup Regression introduced with commit `0d7625ecfb`	7 years ago
luccioman	2fd4d05e2f	Added a shared Java constant for setting key server.servlets.called	7 years ago
luccioman	ba9cd14516	Removed hard-coded patch for Solr 5.0 on ranking boost function The current default boost function (`recip(ms(NOW,last_modified),3.16e-11,1,1)`) for the Date ranking profile is indeed working fine. What can trigger the error `unexpected docvalues type NUMERIC for field 'last_modified'` is the previous default boost function (quite old now) or any custom one using the Solr `ord` or `rord` functions on the last_modified field. Then the problem was that the migration code in the Switchboard supposed to detect the old date boost function was incorrect (one trailing right parenthesis in excess), so the deprecated function remained. This fixes issue #169.	7 years ago
luccioman	fb3032c530	Added a crawl filtering possibility on documents Media Type (MIME)	7 years ago
luccioman	e45afedee4	Added support for enclosures (media links) to the RSS loader	7 years ago
luccioman	aaefd5219c	Reduce log verbosity of RSS loader on feed items with no link	7 years ago
luccioman	cf62b571bd	Added RSS reader support for `enclosure` feed item sub element. Enclosure element (see http://www.rssboard.org/rss-specification#ltenclosuregtSubelementOfLtitemgt ) can be seen for example in podcasts feeds.	7 years ago
luccioman	e5f5de0fc7	Added some JavaDoc to the RSSMessage class.	7 years ago
luccioman	0d7625ecfb	Handle Solr fields restrict and alias in YaCy html and exml writers Thus allowing for example to read more easily the local Solr index full metadata in HTML by restricting if desired to some fields of interest. See Solr documentation about the 'fl' (Field List) parameter at https://lucene.apache.org/solr/guide/6_6/common-query-parameters.html#CommonQueryParameters-Thefl_FieldList_Parameter	7 years ago
luccioman	3da2739bbd	Parse and index more common audio metadata text tag fields.	7 years ago
luccioman	846aba00fa	Added parsing of URLs eventually present in audio metadata tags	7 years ago
Michael Peter Christen	187075b878	added nav filter	7 years ago
luccioman	bcbd0ae1a4	Enabled partial parsing of audio resources.	7 years ago
luccioman	fda0189613	Updated audio file extensions with ones recently added to audioTagParser	7 years ago
luccioman	978e2be95b	Let a chance for other parsers on audioTagParser error As done in all other parsers, eventually falling back in the end to the genericParser which creates a minimal index entry.	7 years ago
luccioman	9e5846a26e	Small fix on svg parser error message	7 years ago
luccioman	11611dbdcf	Reuse existing File copy function to handle audio parser tmp files	7 years ago
luccioman	f77f8f40f9	Factored audio parser tag processing	7 years ago
luccioman	9a7a353d0e	Removed some unnecessary intermediate list creation on array copy.	7 years ago
luccioman	fb6457f5bc	Fixed NPE case when on audio resource parsed with null tag	7 years ago
luccioman	c3ff50c17a	Updated the list of audio file formats supported by the audioTagParser Follows upgrade to Jaudiotagger dependency to version 2.2.5.	7 years ago
luccioman	1b90479a76	Added missing vocabulary navigator increment on results from RWI	7 years ago
luccioman	46c9da6428	Allow creation of vocabularies from remote CSV file URLs.	7 years ago
luccioman	17c7a85f18	Make StreamResponse usable in Java try-with-resources statements	7 years ago
luccioman	b67742336e	Provide user interface messages on vocabulary creation read/write errors	7 years ago
luccioman	3e8dd90211	Use https rather than http in links and queries to openstreetmap.org	7 years ago
luccioman	3a973dbb23	Removed unused import	7 years ago
luccioman	e9527cd0e5	Reuse the same Pattern instance when matching multiple key/values	7 years ago
luccioman	dbf4c1cd76	Improved blacklist entries editing operations : - Fixes issue #160 : handle properly syntax exceptions with a user friendly message - Fixes loss of information on multiple blacklist entries editions - Fixes loss of entries when moving entries from one list to another	7 years ago
reger	87077b8fb6	Adjust and move Language Navigator to be member of the navigatior plugin list.	7 years ago
luccioman	eb20589e29	Fixed issue #158 : completed div CSS class ignore in crawl	7 years ago
luccioman	0cdee4e26a	Fixed loss of "meanCount" search param when using facets or page buttons Then on new search queries, no suggestions at all could be displayed.	7 years ago
luccioman	117a859879	Do not clear all search modifiers when unselecting one modifier. Previously, when clicking a selected facet in the search results page to unselect it, all other eventually selected modifiers/facets were also removed.	7 years ago
luccioman	33593c22e9	Fixed loss of other modifiers on keywords/tags search navigation links	7 years ago
luccioman	a9dc0874c0	Remove old query terms from search results suggestions links. Especially when old terms were misspelled, suggestions links then provided most of the time empty results.	7 years ago
luccioman	9412881230	Added basic support for autotagging microdata annotated item types. With the appropriate vocabulary settings in Vocabulary_p.html page, this can produce Vocabulary search facets displaying item types referenced in html documents by microdata annotation. Tested notably, but not limited to, vocabulary classes/types defined by Schema.org and Dublin Core.	7 years ago
luccioman	5a14d34a7d	Refactoring : documented and extracted autotagging processing functions.	7 years ago
luccioman	58b9834729	Added HTML microdata typed items parsing capability. This adds the possibility for the HTML parser to gather typed items URLs annotated in HTML tags with itemscope and itemtype attributes (see microdata specification https://www.w3.org/TR/microdata/ ), notably Types from the schema.org vocabulary, but also Types/Classes from any other vocabulary, such as the common ones listed in the RDFa core context ( https://www.w3.org/2011/rdfa-context/rdfa-1.1.html ).	7 years ago
luccioman	80fb1026d0	Create recrawl requests with the relevant crawl profile. Recrawl default profile was previously effectively used for crawl stacker acceptance check, but request entries were indeed still created with the "snippetGlobalText" profile.	7 years ago
luccioman	539925a275	Added an utility to generate/update XLIFF master file from lng files.	7 years ago
luccioman	fa6d030b0b	Moved dbtest to the test source folder.	7 years ago
luccioman	6cd3847d0a	Fixed NullPointerException case on Table init with relative file path. Can occur for example when running dbtest with relative test table file name (wihout explicit parent folder).	7 years ago
luccioman	28883d8a71	Shutdown daemon threads at the end of dbtest	7 years ago
luccioman	929e0d6eae	Replaced improper ByteBuffer.equals() implementation by Arrays.equals() Renamed also ByteBuffer.equals() to startsWith() as this is the appropriate function implementation semantics.	7 years ago
luccioman	46b5249c20	Removed time condition on HostBalancer initialization in JUnit test. Its initialization in main application usage remains asynchronous.	7 years ago
luccioman	8b572b7337	Commit Solr index before simulating or starting recrawl job. This ensures up-to-date simulation query results, and recrawl processing.	7 years ago
luccioman	733cacdbb8	Revised the RDFaParser main launcher for minimal proper operation. This parser is still not enabled in the main text parsers list. More would have to be done to make it functional.	7 years ago
luccioman	7baa99f26f	Fixed stored URL in web cache when redirection(s) occurs. Associate cached content to the last redirection location, instead of the first URL of a redirection(s) chain : - for proper base URL processing in parsers (fixes mantis 636 - http://mantis.tokeek.de/view.php?id=636) - to prevent duplicated content in Solr index when recrawling a redirected URL	7 years ago
luccioman	9ddf92d143	Removed unncessary reflection usage for workflow tasks. This improves code readability and maintainability (calls hierarchy are easier to read) and eventually performance.	7 years ago
luccioman	897d3d30cc	Added new recrawl job profile to the list of default crawl profiles	7 years ago
luccioman	9624516bf8	Refresh recrawl job profile threshold date like other default profiles	7 years ago
luccioman	b712a0671e	Added a specific default crawl profile for the recrawl job. - with only light constraint on known indexed documents load date, as it can already been controlled by the selection query, and the goal of the job is indeed to recrawl selected documents now - using the iffresh cache strategy	7 years ago
luccioman	adf3fa493d	Added comments about crawl profiles recrawl cycles	7 years ago
luccioman	3638e16c2e	More comprehensive log on rejected recrawls caused by date constraint	7 years ago
luccioman	d47afe6fab	Use a constant for crawler reject reason prefix with specific processing	7 years ago
luccioman	4e03335625	Added more details to the recrawl job report	7 years ago
luccioman	6425963cee	Fixed internal tables exact value match iterator	7 years ago
luccioman	0c9e0b3566	Record recrawl calls to make them schedulable	7 years ago
luccioman	433e241e4f	Added a report info box about eventual last terminated recrawl job For easier monitoring of recrawls.	7 years ago
luccioman	b2af25b14f	Added a stop condition to the Recrawl busy thread	7 years ago
luccioman	421728d25a	Made possible to customize selection query before launching a recrawl	7 years ago
luccioman	36e9b1c5b3	Fixed SegmentTest test case time dependant occasional failures As highlighted by latest automated Travis builds.	7 years ago
luccioman	8a4ea1c11e	Added UI switch to control content domain constraint per search request	7 years ago
reger	f8071ac8ae	Make TokenizedStringNavigator (used for keyword search facet) active check case insensitive. As keywords are compared lower case, make sure user input keyword:Key or keyword:key will be shown as active in facet entry key.	7 years ago
luccioman	e6907fdab3	Added optional search parameter/setting to control content domain filter Thus allowing to choose at configuration or per search request, whether extending or not results beyond strict content domain filter (image, video, audio or application). Related graphical controls to be added to user interface.	7 years ago
luccioman	f52217c939	Enable full size images preview for users with extended search rights	7 years ago
luccioman	09c4ee56a7	Added optional https support for remote crawl and profile operations	7 years ago
luccioman	5db1c9155a	Do locale independant case conversion on hosts, schemes, and file exts. Required for proper operation when the default system locale is Turkish, as dottless and dotted i characters have specific case conversion rules in this language.	7 years ago
luccioman	1c4803e40a	Enable optional https support for /yacy/transferURL API calls. Also updated some Javadoc and consistently use Switchboard instance as a constructor parameter where relevant.	7 years ago
luccioman	c6e1befbca	Restored peer URL host name stripping removed from previous commit. Still useful for peers with IPv6 addresses.	7 years ago
luccioman	17e004599d	Started implementing optional https preference for protocol operations Introduced through the new configurable setting network.unit.protocol.https.preferred, defaulting to false for now. Let choose to prefer using https when available on remote peers to perform YaCy protocol operations including notably hello or transferRWI. Not yet implemented for every YaCy protocol operations.	7 years ago
Michael Peter Christen	b907819cb4	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git	7 years ago
Michael Peter Christen	25573bd5ab	added a crawl filter based on <div> tag class names When a crawl is started, a new field to exclude content from scraping is available. The field can be identified with the class name of div tags. All text contained in such a div tag where the configured class name(s) match are not indexed, while the remaining page is indexed.	7 years ago
luccioman	d95b288f19	Removed use of deprecated Jetty IPAccessHandler for client filtering. Upgraded to InetAccessHandler. Added InetPathAccessHandler extension to InetAccessHandler to maintain path patterns capability previously available in IPAccessHandler but lost in InetAccessHandler. Filtering on IPv6 addresses is now supported. Support for deprecated pattern formats such as "192.168." and "192.168.1.1/path" has been removed, but startup automated migration should convert such patterns eventually present in serverClient.	7 years ago
reger	cc7a93e6b6	remove deprecated jetty continuation class from urlproxyservlet (was a long time carry over, while not supporting async requests)	7 years ago
Michael Peter Christen	607b39b427	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git Conflicts: htroot/yacysearchitem.java	7 years ago
Michael Peter Christen	4355de0f3c	(more!) evaluation of XRealIP from nginx reverse proxy	7 years ago
luccioman	a4494d6e01	Improved support for internationalized domain names on "site:" modifier Allow typing directly internationalized domain names including non ASCII characters in the search field. Search is done using the ASCII Compatible Encoding (ACE) representation.	7 years ago
luccioman	d07006bac4	Do locale independant case conversion on "filetype:" query modifier.	7 years ago
luccioman	8fbf25d1ed	Made "site:" query modifier case insensitive.	7 years ago
luccioman	867388e05b	Refactored 'site:' query modifier parsing into a dedicated function.	7 years ago
luccioman	c9d80b5b77	Prefer fine URL match over approximate URL mask regex on final filtering Also prevent adding a redundant and CPU costly Solr url mask filter query when possible	7 years ago
luccioman	0a120787e3	Improved accuracy of URLs search filters : protocol, tld, host, file ext	7 years ago
luccioman	d1c7dfd852	Fixed URL parsing with fragment and empty path	7 years ago
luccioman	e07ef1b610	Apply tld query modifier on Solr host_s mandatory field. The filter has thus much more chances to be effective than when applied on the optional field host_dnc_s.	7 years ago
luccioman	478e92deff	Fixed url mask filter generated when protocol modifier is not null	7 years ago
luccioman	29de4a65d7	Refactored url mask filter build from query modifiers For better readability and easier unit testing.	7 years ago
reger	d5a75537e4	remove redundant setting of timeout for remoteinstance and replace depreciated updatesolrclient instantiation with recommended builder	7 years ago
luccioman	f01aac31fd	Made possible to use https for remote search on peers with SSL enabled. Default is still http to prevent any regressions, but a new setting is available to choose https as the preferred protocol to perform remote searches. New configuration setting 'remotesearch.https.preferred' is manually editable in yacy.conf file or in Advanced Properties page (/ConfigProperties_p.html). Should be enabled as default in the future for improved privacy. Https could also eventually be used for other peers communications.	7 years ago
luccioman	e2f6427a63	Added a basic JUnit test for the Visio parser (vsdParser)	7 years ago
luccioman	1e9cdaabd4	Do locale neutral case conversion of HTML charset name. Required to properly run on systems with default locale set to Turkish language, as with this locale the 'i' character has different upper and lower case flavors than with other locales.	7 years ago
luccioman	7206f1ed71	Do locale neutral case conversions on domain names. Required to properly run on systems with default locale set to Turkish language, as with this locale the 'i' character has different upper and lower case flavors than with other locales.	7 years ago
luccioman	398c66f06c	Do locale neutral case conversions in MultiProtocolURL For any relevant URL parts : host name, URL scheme, session ids or technical parts (see https://url.spec.whatwg.org/#url-writing and https://tools.ietf.org/html/rfc3986 for current standard references). Remaining locale sensitive conversion used for detection of URL word components in urlComps() makes sense but using detected language would be preferable than using the default system locale.	7 years ago
luccioman	9531b83598	Do locale neutral case conversions in Classification Required for people using Turkish language as their default system locale, as with this locale the 'i' character has different upper and lower case flavors than with other locales.	7 years ago
luccioman	d22fc0d0a2	Updated lists of known sponsored and country-code TLDs. Using current IANA reference list at https://www.iana.org/domains/root/db . As for previous update on known generic TLDs list, the generated URL hashes on these domains stay the same but it improves performance of URL hash computation for URLs on these domains.	7 years ago
luccioman	ac209cac2e	Updated the generic top-level known domains list. Using current IANA reference list at https://www.iana.org/domains/root/db The generated URL hashes on these domains stay the same but performance is greatly improved as a DNS resolve request is required on URL hash computation when the TLD part of the host name is unknown. Hash computation mean time measured on 1541 sample URLs (one on each TLD) and a computer with a DSL connection : about 230ms before change, then only 20ms.	7 years ago
luccioman	938d8a9731	Added some JavaDoc	7 years ago
luccioman	e0eda84c24	Remove old hard-coded holiday dates from DateDection class. Replaced with rules based relative to current year as already done for a part of the supported dates.	7 years ago
luccioman	cb10daba92	Renamed Chinese & Greek lng files using ISO639-1 codes. Previously named with their ISO 3166-1 country code : this way, when setting language to "Browser" in ConfigBasic.html, it didn't work properly when browser preferred language was Chinese or Greek as their respective language codes are "zh" and "el" (not "cn" and "gr" which are their country codes)	7 years ago
luccioman	46f37e38dc	Customized Threads with generic name for easier monitoring.	7 years ago
luccioman	046be566e1	Updated a license header typo.	7 years ago
Apply55gx	3c905a2a5c	fix typo	7 years ago
luccioman	8e732d437c	Enable HTTP Digest authentication for non admin users. Also ensure authentication is not lost by Digest timeout when navigating between index.html and search results page. This way, running searches with extended features on a remote peer or a password protected peer works with a regular user (with "Extended search" rights). When authenticating on the search page with a user without "Extended search" rights, it appears as authenticated, but has just its usual access to the public search features.	7 years ago
luccioman	d8eaf621cc	Fixed blacklist returned location URL on empty parameters	7 years ago
luccioman	af198b990b	Added an optional login link/status to the search public top nav bar. Thus allowing a more convenient way (wihout the need to go to the admin section) to login when searching on your remote or password protected peer and benefit from extended search features such as Heuristics, Bookmarking or JavasScript resorting. Can be disabled using the ConfigSearchPage_p.html.	7 years ago
luccioman	1de86cf1bf	Fixed JPEG snapshot resizing when running on OpenJDK. Resizing JPEG snapshot images through /api/snapshot.jpg failed when running on OpenJDK, but rendered successfully with a Oracle JDK. Details in mantis 772 ( http://mantis.tokeek.de/view.php?id=772 ). Removing any alpha component (useless in snapshot images) from the rendered resized image solves the issue.	7 years ago
luccioman	a17a418e78	Fixed NullPointerException cases on snapshot images parsing.	7 years ago
luccioman	285f0d6a39	Consistently encode snapshot image with format requested on the API. Previously, calling /api/snapshot.png rendered JPEG encoded images.	7 years ago
luccioman	34ca73d61b	Fixed a NullPointerException case on images encoding errors.	7 years ago
luccioman	7c319c841e	Fixed pdf2image conversion with imagemagick on PDFs having transparency The target image format (jpeg) doesn't support transparency, so the Html2ImageTest produced unusable black images when ran on a linux machine having imagemagick package installed.	7 years ago
luccioman	6e497241f7	Properly close resources (even on error) on OS and ThreadDump classes. Also updated some JavaDoc and main() function usage message on the same ones.	7 years ago
luccioman	fe75f326d8	Fixed ProfilingGraph calculation integer overflows and added test class. Complementary to fix proposed in PR #128 by @otteresk.	7 years ago
luccioman	5d1ef8fdfc	Merge branch 'master' of https://github.com/otteresk/yacy_search_server	7 years ago
luccioman	8303e15419	Reduced number of search navigators refresh requests in JS resort mode The SearchEvent listen to changes on each of its navigators, and the information about their overall state is sent with each fetched search item (as a "data-nav-generation" attribute). Then the browser can regularly fetch a fresh version of yacysearchtrailer.html only if necessary (when that nav-generation value change).	7 years ago
luccioman	dbff7b14fc	Add a configurable limit to tags initially displayed in search results When the limit is reached, a button allow expanding/collapsing remaining tags. When this feature is activated without a limit to the number of displayed tags, when encountering search results with a very large number of keywords, the results page can become almost unusable (very long vertical scrollbar)	7 years ago
Andreas	0c4db9eef0	Merge pull request #3 from yacy/master Fork update	7 years ago
reger	c31d94664a	Update deprecated SolrInputDocument.addField() with boost value remove unused SchemaConfiguration.getDate (as it is designed to return only past dates which might be unexpected for general configuration schema)	7 years ago
luccioman	7e271f9cf5	Updated travis config : install ghostscript, required for Html2Image	7 years ago
luccioman	32c9dfa768	Added partial bzip2 stream parsing support and bzipParser Junit test	7 years ago
luccioman	dd9cb06d25	Fixed RWI distance calculation on multi words search queries. Distance was lost when storing/retrieving references to intermediate result container. Now all JUnit tests are again successfully passing!	7 years ago
luccioman	6b11bf3a12	Fixed NullPointerException case on 'Browser' lang selection Occurred when English was the only active language, then making the ConfigBasic.html page unusable until manually modifying the locale.language setting.	7 years ago
reger	ae1c675c85	fix array out of bounds in YJsonResponseWriter and OpensearchResponsWriter on recreation of image url. Set parameter of indexList2protocolList to required number of images (image_stubs) Situation e.g. image_stub(size=15) but images_protocol(size=12)	7 years ago
otter	73d1d577fd	prevent integer overflow in chartDot for nodes with a big index	7 years ago
otter	4e2ccdfcac	prevent integer overflow in chartLine	7 years ago
luccioman	27ab733685	Ensure private search features are not lost on Digest auth timeout This is a fix for mantis 766 ( http://mantis.tokeek.de/view.php?id=766 ) Since the upgrade to Digest authentication, access to protected search features was indeed disabled once the Digest nonce timed out. After Digest auth timeout the browser no more sent authentication information and as the search results page is not private, protected features were simply be hidden without asking browser again for authentication. Adding a supplementary parameter when accessing the search results as authenticated fixes this.	7 years ago
reger	ba60f65040	Adjust filetype: query modifier parameter to lower case to prevent mismatch on user input with mixed case Internally file extension are always compared lowercase.	7 years ago
luccioman	57a33aefb0	Removed unnecessary max counts init on empty search navigators.	7 years ago
luccioman	ef8aea7f8d	Made the dates navigator max elements number user configurable. Also used object properties on QueryParams instances, rather than using mutable class (static) properties.	7 years ago
luccioman	9e86d183b8	Disable manual search results resorting when resorting is done with JS Also added a constant for the js resorting setting key.	7 years ago
luccioman	66cb9c4ff9	Added Solr filter queries for audio, video and application domains Inspired from the existing one used on image search, and consistent with post filtering on content domain applied in SearchEvent.addNodes(). These filters are quite simplistic but at least audio, video or application search now return results. Previously, when filtering on these content domains, many results pages (and often even the first page) were empty while the total results count suggested that results should be available. This was because filtering on domain was only applied AFTER requesting Solr indexes.	7 years ago
luccioman	5d3ceb31b7	Improved search navigators counters accuracy and consistency. - added some missing increments from RWI results - decrement relevant navigator counts when solr or RWI results are evicted because duplicates detection or constraints checked belatedly - do not compute facets when unnecessary to avoid unwanted CPU load - do not increment from facets when already done - do not rely on facets on remote solr peers requests, as most of the time only a limited part of their total results if fetched (thus also preventing unnecessary load on remote peers) - use a concurrency friendly score map for the dates navigators to prevent unwanted ConcurrentModificationExceptions This improves the situation for the most obvious inconsistencies in search navigators counts, but more has to be done for a true accuracy (notably when query modifiers constraints are applied belatedly - after the solr or RWI retrieval request - such as the content domain constraint)	7 years ago
luccioman	8e4f31bdc7	Updated internal ISO 639-1 language codes with latest standards. Includes 54 language code additions, some name modifications, and marking a few deprecated.	7 years ago
luccioman	a28428047a	Fixed count of filtered results from local solr. Was inadequately modified in my previous related commits (making next pages buttons unavailable in Search portal mode), as SearchEvent.local_solr_available did not count the total filtered results but only the ones within the currently fetched result page(s).	7 years ago
Michael Peter Christen	2f71005a93	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git	7 years ago
Michael Peter Christen	2314f8e358	try to fix problem with error description http://forum.yacy-websuche.de/viewtopic.php?f=5&t=6023&p=33889&sid=37bc7aa029422be571b9266cdef43c52#p33889	7 years ago
luccioman	3c9df6e0ce	Use local solr filtered results in total search results count. This modification has indeed low incidence as eventual query modifiers are already applied when requesting the local solr index. It mainly impact doublons detected with results from remote peers. Also updated javadocs for clarification.	7 years ago
luccioman	a1a0515312	Added a button to manually refresh sorting of p2p search results. As a server-side oriented alternative to the JavaScript realtime resorting feature proposed in PR #104. The goal is the same as in this PR : having the possibility compensate the network latency of various peers results fetching and obtain once possible a consistently ranked result set.	7 years ago
luccioman	4eba88f2ff	Removed some unnecessary uses of java.lang.reflect api. This improves code browsing and readability, making search by references or call hierarchy IDE features more accurate.	7 years ago
luccioman	da3dbf9ea1	Use Javadoc style comments on SearchEvent properties. For better code readability and understanding.	7 years ago
luccioman	c6ae87168a	Added unit tests on the gzip parser.	7 years ago
luccioman	169ffdd1c7	Finer control on max links to parse in the html parser.	7 years ago
luccioman	e41d046a9d	Improved parsing support for OOXML spreadsheets (.xlsx) As reported edycop in mantis 765 ( http://mantis.tokeek.de/view.php?id=765 ), parsing of xlsx files was quite incomplete. Now properly support "Shared String Table" entry in Office Open XML spreadsheets, an also detect embedded URLs. Integrating the Apache poi-ooxml library could be an option for finer OOXML formats support, but their SAX style parsing example ( http://poi.apache.org/spreadsheet/how-to.html#xssf_sax_api ) tends to show that a custom SAX handler is still efficient for lightweight and low memory footprint processing.	7 years ago
reger	51a4e03c93	Allow to stop currently running warc import (stop button)	7 years ago
luccioman	6cec2cdcb5	Use unredirected robots.txt URL when adding an entry to the table.	7 years ago
luccioman	3f0446f14b	Ensure proper synchronous robots entry retrieval on first check. Previously, when checking for the first time the robots.txt policy on a unknown host (not cached in the robots table), result was always empty in the /getpageinfo_p.xml api and in the /CrawlCheck_p.html page. Next calls returned however the correct information.	7 years ago
luccioman	b23a563065	Prevent search result failure on incomplete images information. Complements the recent modification related to images in commit `7f395ef`. Unfortunately many documents metadata fetched from the freeworld p2p network have only partial information about embedded images. Without proper error handling, this made many searches in p2p mode to fail completely.	7 years ago
Michael Peter Christen	30d71c6359	added usage of X-Real-IP http header to identify request IPs which came through NGINX reverse proxy configurations	7 years ago
Michael Peter Christen	f45378c11c	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git	7 years ago
Michael Peter Christen	7f395ef937	added image link in search results This should be a help to make a preview of search results. The image is computed from the list of embedded images, it is always the first image in that list. In rss-type results the image is presented like <media:content medium="image" url="https://abc.xyz/logo.png"/> as defined in http://www.rssboard.org/media-rss#media-content	7 years ago
luccioman	780173008e	Implemented partial stream parsing of tar archives. Also added JUnit tests for the tar parser and fixed unwanted use of the tar parser as a fallback on files included in a tar archive.	7 years ago
luccioman	acab6a6def	Also handle text content when parsing XML within limits.	7 years ago
reger	2a07799ad1	Correction of `d03e2c98ea` Fix Conjunction.addOperator to do nothing if term is empty prevent to result in query string with repeated logical operator like "field:term AND AND field:term" possibliy causing out of mem in postprocessing_doublecontent	7 years ago
reger	d03e2c98ea	Fix Conjunction.addOperator to do nothing if term is empty prevent to result in query string with repeated logical operator like "field:term AND AND field:term" possibliy causing out of mem in postprocessing_doublecontent	7 years ago
reger	b6a41df4f7	Remove deprecated YaCyProxyServlet was replaced by UrlProxyServlet	7 years ago
luccioman	8a94fef9e0	Prevent unwanted cached bytes duplication on stream parsing.	7 years ago
reger	4979439e87	Skip public post of jre version. Added to determine switch to java8 `596b5dfa59`	7 years ago
reger	e918ec199e	Replace deprecated ConcurrentHashSet with recommended Java8 ConcurrentHashMap.newKeySet() in postprocessDocuments()	7 years ago
reger	fb71994342	Harmonizing use of xml reader / sax parser in XMLBlacklistImporter eliminating the need for lib/xercesImpl.jar	7 years ago
reger	275d65fffe	Patch last_modified date with internal FirstSeenTime() if no date provided to make sure updated documents are indexed with their last-modified date as provided in current crawl. (to patch moddate always with firstseen might bear the risk of miss actual updates).	7 years ago
reger	d1b23afed6	Remove obsolete Protocol parameter ttl (time to live) not interpreted in target yacy/query.html also Protocol.querySeed() not used and parameter not interpreted in target servlet yacy/query.html	7 years ago
reger	15d78b1064	Replace deprecated getIP with getIPs in Protocol transferURL() and getProfile(). Remember used ip for error handling and departInterface	7 years ago
reger	ed36b47bec	Replace one more deprecated peerDeparture in Protocol.transferIndex() by moving/using interfaceDeparture() in transferRWI()	7 years ago
luccioman	0ee8c030c4	Log an error when Solr folder migration fails for some reason.	7 years ago
luccioman	5a646540cc	Support parsing gzip files from servers with redundant headers. Some web servers provide both 'Content-Encoding : "gzip"' and 'Content-Type : "application/x-gzip"' HTTP headers on their ".gz" files. This was annoying to fail on such resources which are not so uncommon, while non conforming (see RFC 7231 section 3.1.2.2 for "Content-Encoding" header specification https://tools.ietf.org/html/rfc7231#section-3.1.2.2)	7 years ago
luccioman	11a7f923d4	Distinguish response parsing failures from unexpected exceptions.	7 years ago
luccioman	eda7b0aeb6	Merge branch 'master' of https://github.com/yacy/yacy_search_server	7 years ago
reger	3005be7349	Clean up unmaintained and unused AugmentParser trail.	7 years ago
luccioman	cb4f1358e1	Added gzip parser support for max content bytes limit	7 years ago
luccioman	5216c681a9	Added HTML parser support for maximum content bytes parsing limit	7 years ago
luccioman	4aafebc014	Merge pull request #122 from Scarfmonster/patch-1 I also reproduced the issue, and the fix is working fine. Thanks @Scarfmonster	7 years ago
luccioman	651fad6da5	Added RSS parser support for maximum content bytes parsing limit	7 years ago
luccioman	452a17a8d5	Finer control on bounded input streams with custom stream implementation	7 years ago
luccioman	f8f1959ebb	Added parsing within bounds implementation to the generic parser.	7 years ago
luccioman	e0f400a0bd	Support trying multiple parsers even when streaming on large resources.	7 years ago
luccioman	1e84956721	Support loading local files with a per request specified maximum size. Consistently with the HTTP loader implementation.	7 years ago
luccioman	f369679d1c	Fixed read/copy on input streams reading sometimes less than expected.	7 years ago
luccioman	bf55f1d6e5	Started support of partial parsing on large streamed resources. Thus enable getpageinfo_p API to return something in a reasonable amount of time on resources over MegaBytes size range. Support added first with the generic XML parser, for other formats regular crawler limits apply as usual.	7 years ago
luccioman	90a7c1affa	HTML parser : removed unnecessary remaining recursive processing Recursive processing was removed in commit `67beef657f`, but one remained for anchors content(likely omitted from refactoring). It is no more necessary : other links such as images embedded in anchors are currently correctly detected by the parser. More annoying : that remaining recursive processing could lead to almost endless processing when encountering some (invalid) HTML structures involving nested anchors, as detected and reported by lucipher on YaCy forum ( http://forum.yacy-websuche.de/viewtopic.php?f=23&t=6005 ).	7 years ago
reger	e6e20dab52	upd to Jetty 9.4.6.v20170531 Modify loginservice to the changes in Jetty, partially based on pull request #101 https://github.com/yacy/yacy_search_server/pull/101 bu @automenta	7 years ago
luccioman	dcc56318bb	Made remote search max system load limits configurable from UI. As reported by davide on YaCy forums ( http://forum.yacy-websuche.de/viewtopic.php?f=23&t=6004 ) when the system is on high load, unless reading carefully YaCy configuration file, it could be difficult to understand why remote search results are not fetched.	7 years ago
reger	ddd13b776d	Add keyword constraint to rwi query result filter To discard rwi results not matching query keyword: parameter	7 years ago
luccioman	e82eaee4b6	Apply consistent behavior on HTTP resource size exceeding limit. On content size known from HTTP headers, terminates connection faster and improves error reports quality by reporting relevant message "Content to download exceed maximum value..." rather than previously "no response (NULL) for url...".	7 years ago
luccioman	0b75e92ac2	Do not wrap unnecessarily loader IOExceptions in IOExceptions	7 years ago
luccioman	433bdb7c0d	Respect maxFileSize limit also when streaming HTTP and when relevant. Constraint applied consistently with HTTP content full load in byte array.	7 years ago
luccioman	9b1bb2545e	Refactored plain-text URLs detection implementation. For faster processing (measured about 2 times faster on many real-world examples) and more advanced detection (previous algorithm detected only URLs separated from the rest of the text by a space character).	7 years ago
luccioman	8da3174867	Ensure lower case conversion consistency with any default locale. Especially for Turkish speaking users using "tr" as their system default locale : strings for technical stuff (URLs, tag names, constants...) must not be lower cased with the default locale, as 'I' doesn't becomes 'i' like in other locales such as "en", but becomes 'ı'.	7 years ago
luccioman	286f3018bd	Made mime type and extension normalization locale independent. Previously, upper cased mime type was incorrectly normalized when the default locale is Turkish.	7 years ago
luccioman	319231a458	Added a generic XML parser, able to parse elements text and URLs. This parser adds support for any XML based format other than already supported XML vocabularies such XHTML, RSS/Atom feeds... It will eventually be used as a fallback if one of these specific parsers fail, before falling back to the existing genericParser which extracts not that much useful information except URL tokens.	7 years ago
Ryszard Goń	3cedbbd4ed	Wrong password was removed after the SSL certificate import Removing the keystore password will prevent ssl from working after the next restart. The certificate password should be removed instead. Fixes http://mantis.tokeek.de/view.php?id=687	8 years ago
luccioman	64cec2790d	Improved character encoding detection from Content-Type header Also updated some related JavaDocs	8 years ago

... 2 3 4 5 6 ...

8767 Commits (13e42c2dd27894043892a2600679cbecfba05339)