yacy_search_server

Commit Graph

Author	SHA1	Message	Date
sgaebel	c69c462a15	replaces a expensive getLoadTimeURL() by exists() refactors urlExists to getHarvestProcess as that is what it does	4 years ago
sgaebel	26223dc25a	replaces getLoadTime() by exists() with a simpler query since solr-8.8.1 getLoadTime() causes a high cpu usage	4 years ago
Michael Peter Christen	8b4394a6c5	fixes for solr 8.8.1 migration - replace new guava 30 with older 25 because that is the correct dependency for solr 8.8.1. The newer one did actually not work! - index will be crated in a DATA/INDEX/freeworld/SEGMENTS/solr_8_8_1 subfolder. The older solr_6_6 index is not touched but also not migrated. The index starts with fresh (empty) content. - Older indexes must be migrated by hand (export/import) so far until a better solution is found. - Large schema adoptions for lucene 8.8.1	4 years ago
Al Sutton	69014a701e	Update API Usage	4 years ago
Michael Peter Christen	13a2e6dc6e	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git	4 years ago
Michael Peter Christen	0ae8ccf657	Make it possible to set an empty password disabling the authentication protocol completely If you set now an empty password, then the http server will not ask to authentify. This is required for environment where we attach an outside authentification service like keycloak or similar using authentication in an ingress proxy. This change is part of the approach to run YaCy inside of a kubernetes cluster where we do not want individual authentication of peers and want to apply a ingress authentication.	4 years ago
Michael Peter Christen	96592a10cf	added option to set yacy configuration values using environment variables To use that feature, set an environment variable with prefix "yacy." and suffix identical to the yacy configuration attribute name. Additionaly we implemented a way to set a peer name using the setting "network.unit.agent". This can therefore now be used to set a peer name with the java call parameter -Dyacy.network.unit.agent=anonymous The purpose for this feature is the ability to set peer names in mass-deployed kubernetes clusters to the same name to prevent that we are flooding peer name statistics with auto-deployment-generated names.	4 years ago
Michael Peter Christen	198826c362	added network scanner process to discover all YaCy peers in the intranet this will be used to wire YaCy peers in a kubernetes cluster	4 years ago
Michael Peter Christen	907f121d0c	do not overwrite PW with random PW	4 years ago
Michael Peter Christen	3e6a1e0a49	fixed surrogate process counter	4 years ago
Michael Peter Christen	baad56d83d	beautified default peer names	4 years ago
Michael Peter Christen	43a9f4f574	updated solr 6.6.6 -> 7.7.3 dropped GSA support (GSA API is still in YaCy Grid) The 6.6.6 solr index works without migration also with 7.7.3	4 years ago
Michael Peter Christen	c0d9a3e9a7	turned HostBrowser into a admin-only page, now called IndexBrowser This was required because spiders and bots crawled through this page and created load on the peer without use for the user or the YaCy network.	4 years ago
Michael Peter Christen	6271e9122c	javadoc fix	4 years ago
Michael Peter Christen	52228cb6be	added a gc to cleanup process (once every 10 minutes)	4 years ago
Michael Peter Christen	22841ffbf1	creating a threaddump during every cleanup process to be able to find out what a peer did (not) last time before a crash	4 years ago
sgaebel	3431f91db9	removes unused 'unused' tokens	5 years ago
sgaebel	fc03c4b4fe	removes some warning and unused objects	5 years ago
sgaebel	4a495df63a	removes some deprecation-warnings	5 years ago
sgaebel	dd9d4b1188	replace org.junit.Assert.assertThat by org.hamcrest.MatcherAssert.assertThat from hamcrest 2.2 to avoid deprecation-warning	5 years ago
Michael Peter Christen	e0ad8ca9da	replaced json library from JSON.org with libandroid-json-java This fixes https://github.com/yacy/yacy_search_server/issues/347	5 years ago
Michael Christen	cfa27d2fd5	fixed links	5 years ago
luccioman	6b45cd5799	New optional crawl filter on the URL a doc must match to crawl its links For finer control over which parsed documents can trigger an addition of their links to the crawl stack, complementary to the existing crawl depth parameter.	6 years ago
luccioman	a5771b1f14	Made SNI extension user configurable without the need for server restart TLS Server Name Indication (SNI) extension activation can now be configured with the new Settings_p.html?page=httpClient administration page. SNI extension is also now enabled by default, as in 2019 the unrecognized_name(112) alert is more properly handled by major web servers TLS implementations, following the RFC 6066 standard. Related YaCy issues : #153 #189 and #272 JDK 1.7 bug : https://bugs.java.com/bugdatabase/view_bug.do?bug_id=7127374 Apache httpd issue : https://bz.apache.org/bugzilla/show_bug.cgi?id=56241 RFC 6066 : https://tools.ietf.org/html/rfc6066#section-3	6 years ago
luccioman	e90405b6f0	Support parsing audio URLs without file extension Added also a Junit for the audio tag parser	6 years ago
luccioman	a8316c79da	Allow JS resorting of search results by unauthenticated users Acces rate limitations to this search mode by unauthenticated users are set low by default to prevent unwanted server overload but can be customized through the SearchAccessRate_p.html configuration page Fixes #291	6 years ago
luccioman	0ab2b49c31	Made /yacysearch access rate limitations user configurable With a new admin page at /SearchAccessRate_p.html in menu Network Access > Local Search > Access Rate Limitations	6 years ago
luccioman	9782a98a9c	Added the possibility to customize facets sort type and direction Previously search navigators/facets elements were sorted only by counts. Now from the ConfigSearchPage_p.html admin page, sort direction (ascending/descending) and type (on counts or labels) can be customized independently for each navigator.	6 years ago
sgaebel	c2398fd890	remove warnings: 'Statement unnecessarily nested within else clause'	6 years ago
luccioman	08ea0b0397	Added a configurable timeout to wkhtmltopdf calls for pdf snapshots Necessary to prevent blocking the indexing workflow when some wkhtmltopdf renderings fail without terminating	6 years ago
luccioman	e85f231bdf	Fixed termination of Host browser and link structure Solr query threads On some conditions (especially when reaching timeout), concurrent Solr query tasks used by the /HostBrowser.html and /api/linkstructure.json never terminated, thus leaking resources, as reported by @Vort in issue #246	6 years ago
luccioman	fcf6b16db4	Added new crawler attribute for finer control over Media Type detection New "Media Type detection" section in the advanced crawl start page allow to choose between : - not loading URLs with unknown or unsupported file extension without checking the actual Media Type (relying Content-Type header for now). This was the old default behavior, faster, but not really accurate. - always cross check URL file extension against the actual Media Type. This lets properly parse URLs ending with an apparently odd file extension, but which have actually a supported Media Type such as text/html. Sample URLs with misleading file extensions added as documentation in the crawl start page. fixes issue #244	6 years ago
luccioman	54fbe166ba	Updated pdf cache clear steps consistently with current pdfbox version - Removed calls to no more existing clearResources functions (on PDFont class and its children) since upgrade to pdfbox 2.n.n - Removed hacky usage of protected internal ClassLoader function. This removes the warnings displayed when running with JDK9 or JDK10 : [java] WARNING: Illegal reflective access by net.yacy.document.parser.pdfParser$ResourceCleaner (file:<path>) to method java.lang.ClassLoader.findLoadedClass(java.lang.String) [java] WARNING: Please consider reporting this to the maintainers of net.yacy.document.parser.pdfParser$ResourceCleaner [java] WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations [java] WARNING: All illegal access operations will be denied in a future release Crawling thousands of pdf documents from various sources after modifications applied, revealed no new memory leak related to pdfbox (measurements done with JVisualVM).	7 years ago
luccioman	bdafb14336	Removed redundant synchronization lock on network switch function Was useless as done in an already synchronized block, and the lock object was assigned a new value in that same block, and nowhere else a lock is requested on that same object.	7 years ago
luccioman	dcad393fe5	Fixed exceeding max size of failreason_s Solr field on large link list When using the 'From Link-List of URL' as a crawl start, with lists in the order of one or more thousands of links, the failreason_s Solr field maximum size (32kb) was exceeded by the string representation of the URL must-match filter when a crawl URL was rejected because not matching.	7 years ago
luccioman	f467601561	Properly lock solrInstances for reboot and restoration of embedded Solr Putting a synchronization lock directly on the solrInstances property was ineffective as it is assigned a new (unlocked) instance in these operations.	7 years ago
luccioman	2bdd71de60	Added server side columns sorting on the Process Scheduler table For easier usage of large tables in the Table_API_p.html page.	7 years ago
luccioman	e97580dfc7	Fixed unsafe conccurent access to generic SimpleDateFormat instances SimpleDateFormat must not be used by concurrent threads without synchronization for parsing or formating dates as it is not thread-safe (internally holds a calendar instance that is not synchronized). Prefer now DateTimeFormatter when possible as it is thread-safe without concurrent access performance bottleneck (does not internally use synchronization locks).	7 years ago
luccioman	cced94298a	Added a new crawler document filter type using Solr syntax This makes possbile to set up much more advanced document crawl filters, by filtering on one or more document indexed fields before inserting in the index.	7 years ago
luccioman	40e8c7b89b	Use the heavy ConcurrentUpdateSolrClient only when necessary Prefer the lightweight HttpSolrClient when no updates are performed on the remote Solr instance, as recommended by Solr documentation itself.	7 years ago
luccioman	b5dc1f376f	Made outgoing pools max total connections user configurable For a finer control over the maximum simultaneously active outgoing connections.	7 years ago
luccioman	387d646c0e	Added gzip compression of responses returned to user-agents accepting it Enabled as default, but can be disabled using the "Server Access Settings" admin page.	7 years ago
luccioman	ee6670fb8f	Use a common pooled http connection manager for remote solr instances For a better control on the maximum simultaneous outgoing http connections, as already done for any other http connections (crawls, rwi search, p2p protocol) using the net.yacy.cora.protocol.http.HTTPClient	7 years ago
luccioman	35826a3091	Added a search page customization setting to display or not favicons If not interested in displaying this on your search results and notably on a peer with limited resources this can help saving some CPU and outgoing network connections.	7 years ago
luccioman	fa4399d5d2	Small perf improvement : initialize threads names early when possible Initializing Thread names using the Thread constructor parameter is faster as it already sets a thread name even if no customized one is given, while an additional call to the Thread.setName() function internally do synchronized access, eventually runs access check on the security manager and performs a native call. Profiling a running YaCy server revealed that the total processing time spent on Thread.setName() for a typical p2p search was in the range of seconds.	7 years ago
luccioman	f511e16d50	Prevent duplication of Solr query highlight fields parameters That was caused by concurrent modifications (with addHighlightField() function) to the same SolrQuery instance when requesting Solr on remote peers in p2p search.	7 years ago
luccioman	e357ade47d	Reduced memory footprint of text snippet extraction By not parsing and storing at first all sentences of a document, but only on the fly the ones necessary to compute the snippet.	7 years ago
luccioman	e115e57cc7	Reduced text snippet extraction processing time. By not generating MD5 hashes on all words of indexed texts, processing time is reduced by 30 to 50% on indexed documents with more than 1Mbytes of plain text.	7 years ago
sgaebel	4b79851e12	corrected icons_sizes_sxt to SolrType.string	7 years ago
luccioman	3b89c232db	Easier tracking of longest text snippets initializations When text snippets statistics are enabled and FINE log level is enabled on the TextSnippetStatistics class.	7 years ago

1 2 3 4 5 ...

1447 Commits (f16cd154f7d39e42e2669e6a46cbbd4b589a561f)