yacy_search_server

Commit Graph

Author	SHA1	Message	Date
luccioman	3475d8c1a9	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git	8 years ago
luccioman	c68a8be2d9	Refactored and enforced Solr mandatory fields for proper operation - Added a new method to check activation of mandatory fields on Collection Configuration commit, consistently with checks previously performed in Switchboard startup and with mandatory fields in the default schema. - Reorganized default schema and CollectionConfiguration enumeration : moved no more mandatory fields in a specific section, and moved fields enabled at startup to the mandatory section. - Marked mandatory fields as required and with stronger font in the IndexSchema_p.html page	8 years ago
reger	334c70c37a	correct fromDate init value on missing param in api/timeline_p servlet revert test modification from last commit in AccessTracker.main	8 years ago
reger	cc770512d5	add hint of query syntax in AccessTracker log (qs=normal querystring, sq=solr-querystring) to allow to filter simple text queries for processing, remove toString for counter parameter use more predefined constants in solrservlet	8 years ago
luccioman	e5858bc8c8	Fixed a NullPointerException case possible on Index Export As reported by Palulukas in YaCy forum (http://forum.yacy-websuche.de/viewtopic.php?f=18&t=5944&sid=dcef5b899ab4aa9b40e3a3d158c13aed#p33454) the Index Export operation can fails, notably when the Solr index contains one or more documents with empty (despite required) "load_date_dt" field. This fixes the export failure when the situation finally occurs, but more should be done to harden verifications on minimum required fields.	8 years ago
reger	7e53860fc7	fix NPE in HTMLResponseWriter on missing document title	8 years ago
reger	5e8879beb7	Reduce self generated content for text_t (visible text index field) to avoid repeat of tokenized url as description, continuation of `7e09bff4a1` `1409cabe8b` Add some javadoc, and not needed remove of omitted fields in postprocessing.	8 years ago
luccioman	6e89d125f2	Added robots.txt support for heuristics federated search. As noticed by @reger24, abusive use of OpenSearch systems should be prevented, especially if allowing to parse and reuse HTML results. robots.txt file is now checked before requesting an external OpenSearch system to respect the host exclusions and eventual crawl-delay value. The check is also performed when trying to add a new OpenSearch URL template through the /ConfigHeuristics_p.html admin page.	8 years ago
luccioman	bf16de29c1	Added support for HTML OpenSearch results. Many OpenSearch systems do not provide results as standard RSS/Atom feeds but only as HTML. This modification add some support for custom OpenSearch HTML results through the use of mapping files (as already done for federated Solr search) relying on CSS-like selectors to retrieve information from HTML content. An example mapping file is provided to map results from the www.npmjs.com OpenSearch URL.	8 years ago
luccioman	54405577aa	Replaced absolute redirection locations by relative ones when possible. This makes integration of YaCy behind a reverse proxy subfolder easier.	8 years ago
luccioman	1857651988	Added a new Debug/Analysis advanced settings subsection. As discussed in PR #93 with @JeremyRand and @reger24 this new advanced settings page includes: - a new setting to control remote Solr responses encoding - some existing debug settings which could not be set through the admin user interface	8 years ago
luccioman	526f2d6a8b	Fixed NPE case occurring when local solr index is disabled in search.	8 years ago
luccioman	def55ec166	Improved termination of timed out remote solr requests to peers. On timeout, closing remote Solr requests is proper than simply using Thread.interrupt() that is not effective in most cases. Closing does not ask commit on remote solr, but release http connections resources and is more likely to end those threads that can else wait indefinitely. Other related improvements included : - no more marking remote peer as not available when remote search is interrupted before timeout by the cleanup job. - added a short fine log level trace of failing remote solr requests	8 years ago
luccioman	08de58b6d3	Named a Thread without name for easier monitoring	8 years ago
luccioman	9a5a124bf2	Distinguished solr connectors thread names for easier monitoring.	8 years ago
reger	1f497ccad5	Add consistency check for related index fields upon load and save of index schema. To assemble the original link url for out-/inboundlinks, icons and pictures the _protocol_sxt and _urlstub_sxt is needed (due to the used data-reduced storage methode). Auto-enable _protocol_sxt if _urlstub_sxt is enabled. to be able to correctly assemble the original link url.	8 years ago
luccioman	68afe900d0	Added user-friendly controls over disk usage configuration settings. As mentioned in issue #103, control settings over YaCy disk usage already existed but lacked a user-friendly way to set them. I added it to the Performance_p.html administration page with a little refactoring on the "Resource Observer" fieldset for improved accessibility and HTML standards respect. Also added the possibility to enable/disable the autoregulation fonction from this page.	8 years ago
reger	95d2a28599	adjust the Field-Reindex Thread to verify and update the document id in case hash (ID) doesn't match document url (sku field).	8 years ago
luccioman	fc01b69eca	Fixed local image search pagination regression. As reported by @tglman on issue #90, when searching images on the local index only, pages next to the first were always empty. This was a regression from commit `c25e48e969`.	8 years ago
Michael Peter Christen	02d0b3172c	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git	8 years ago
Michael Peter Christen	d4f45cf05e	added dc.date.modified and dc.date.created to date parser	8 years ago
reger	f9180fabc4	assure that RWI Index.Segment IODispatcher is not blocking on shudown waiting on a semaphore permit. see desc. http://mantis.tokeek.de/view.php?id=723	8 years ago
reger	e61ee180a7	Group all proxy settings on System Administration by adding settings of UrlProxyAccss page (moved from deleted AugmentedBrowsing_p), adjust submenu (remove Augmented Browsing) and translation files.	8 years ago
luccioman	39e081ef38	Fixed display of crawler pending URLs counts in HostBrowser.html page. As described in mantis 722 (http://mantis.tokeek.de/view.php?id=722) Also updated some Javadoc.	8 years ago
reger	df80c57842	add ukr and pol to DCEntry.getLanguage ISO639-2 3-char language code conversion to deliver uk, pl 2-char code and use if else to return on match	8 years ago
luccioman	e048e74072	Added an optional parameter to webstructure.xml api. This new "documentStructure" parameter can be set to false to only get hosts accumulated references on a resource and thus prevent scraping the specified URL and getting citations references. Also set WebStructureGraph constants as final and updated the Javadoc with example api call URLs.	8 years ago
reger	581b00cc20	remove obsolete lastmodified calculation in WebgraphConfig	8 years ago
luccioman	5c8958bcea	Updated Javadoc and Junit tests for the WebStructureGraph class.	8 years ago
luccioman	d9766ca981	Fixed WatchWebStructure_p.html render to include https URLs. As described in mantis 721 (http://mantis.tokeek.de/view.php?id=721) WatchWebStructure_p.html failed to include in its structure view https and other protocols and ports than default http.	8 years ago
luccioman	ed3dd5e31a	Fixed webstructure.xml API used with a domain name 'about' parameter. As described in mantis 720 (http://mantis.tokeek.de/view.php?id=720), when requesting this API with a domain name instead of a complete URL only HTTP references on default port were listed.	8 years ago
luccioman	0da1e6ba16	Factored code re-implementing DigestURL.hosthash() method. This ensure consistent implementation of the url host hash generation and easier usage finding in source code. Also added a unit test for this function.	8 years ago
luccioman	86adfef30f	Added automated unit tests and perfs test for WebStructureGraph class. Fixed references count when multiple links target the same domain name in one document.	8 years ago
luccioman	9cea7cbb10	Detailed some Javadoc related to /api/webstructure.xml usage.	8 years ago
luccioman	6a4d51d8f9	Cleaned up some Javadoc warnings.	8 years ago
luccioman	86dc198698	Fixed some JavaDocs broken links.	8 years ago
reger	16beb551ea	fix DC.Elements namespace in DublinCore vocabulary class delete redundant (unused) DCElements.	8 years ago
luccioman	339f005ced	Blacklist import and update performance improvements. Measurement sample : import from blacklist local file containing about 15000 entries - before refactoring : several minutes - after refactoring : a few seconds!	8 years ago
luccioman	e3892b0957	Added some JavaDoc.	8 years ago
reger	4c9be29a55	fix concurrency issue with htmlParser using not current scraper data resulting in incorrect data for some html index metadata. Details see http://mantis.tokeek.de/view.php?id=717	8 years ago
reger	eedee6eabb	fix exception on URIMetadataNote instantiation with corrected id hash on host_id_s. Use Solr setField instead of addField to prevent java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String at net.yacy.kelondro.data.meta.URIMetadataNode.hosthash(URIMetadataNode.java:247) at net.yacy.search.query.SearchEvent.addNodes(SearchEvent.java:966) at net.yacy.peers.Protocol.solrQuery(Protocol.java:1242) at net.yacy.peers.RemoteSearch$2.run(RemoteSearch.java:349)	8 years ago
luccioman	c1401d821e	Adjusted crawl depth control for FTP crawl start URLs.	8 years ago
reger	68d4dc5cc5	Complete harmonization RequestHeader getCookie with std ServletRequest to use javax.servlet.http.Cookie parameters. Depreciate now obsolete getHeaderCookies. Adjust setting of MaxAge to spec if >= 0 otherwise keep default.	8 years ago
reger	a1e5f7dbca	fix of fulltext.remove() by id of webgraph document webgraph has document hash in source_id_s	8 years ago
luccioman	1df558a6c6	Fixed YaCy proper shutdown triggered by SIGTERM signal. The main shutdown hook thread was not properly waiting for the main thread termination which consequently could not properly close resources and threads. After terminating a running YaCy peer this way (Ctrl+C in console, or kill <pid> for example), you could see the still existing DATA/yacy.running file. Tested with : - Debian Jessie openjdk 7 and 8 : regular shutdown, Ctrl+C, kill command, system restart while yacy is running - Windows 10 Oracle JDK 7 and 8 : non regression on regular shutdown	8 years ago
reger	b522d540b9	Include itemprop latitude/longitude (see schema.org) in attribute parsing for lat/lon. Harmonize number parsing for lat/lon to parseDouble. Fix endDate_dts value assignment.	8 years ago
reger	083df255e4	fix html tag attribute parsing containing attribute w/o value e.g. itemscope or autofocus (in such case the next key was not properly recognized).	8 years ago
reger	cb95b7339a	include html5 <time> tag in content scraper, add "datetime" property of <time> tag to scrapers startdate list. Datetime is parsed as iso8601 (xml) date, html5 allows partial as well as duration (not handled by this)	8 years ago
reger	7bf2bcf504	fix and prevent exception on missing required cookie name skip cookie creation if name is empty.	8 years ago
luccioman	3ca695390c	FTP crawl start URLs : applied crawl profile depth control Applied rules : - when the FTP URL denotes a file resource, stack it as any start URL : eventually embedded links can be followed applying the usual depth rules - when the FTP URL denotes a directory, list files under this directory and stack them for crawl, and repeat the process on sub folders until crawl depth is reached	8 years ago
luccioman	128c8ef8d4	Fixed title rendering having non ASCII chars in QuickCrawlLink_p.html.	8 years ago

1 2 3 4 5 ...

3890 Commits (7c188ad09288d738056fa8de61b666e36960ee28)