yacy_search_server

Commit Graph

Author	SHA1	Message	Date
luccioman	f467601561	Properly lock solrInstances for reboot and restoration of embedded Solr Putting a synchronization lock directly on the solrInstances property was ineffective as it is assigned a new (unlocked) instance in these operations.	7 years ago
luccioman	e97580dfc7	Fixed unsafe conccurent access to generic SimpleDateFormat instances SimpleDateFormat must not be used by concurrent threads without synchronization for parsing or formating dates as it is not thread-safe (internally holds a calendar instance that is not synchronized). Prefer now DateTimeFormatter when possible as it is thread-safe without concurrent access performance bottleneck (does not internally use synchronization locks).	7 years ago
luccioman	cced94298a	Added a new crawler document filter type using Solr syntax This makes possbile to set up much more advanced document crawl filters, by filtering on one or more document indexed fields before inserting in the index.	7 years ago
luccioman	0d34034f17	Ensure an embedded Solr is available for Solr dump/restore operations Otherwise, these operations triggered NullPointerException when only an external Solr index is attached.	7 years ago
luccioman	d92b191942	Ensure no remote Solr is attached before "Shut Down and Re-Start Solr" Otherwise once this operation is applied, the remote Solr(s) instances are deconnected and the embedded Solr is connected even if disabled by setting "core.service.fulltext". Also use constants for related default setting values.	7 years ago
luccioman	fb3032c530	Added a crawl filtering possibility on documents Media Type (MIME)	7 years ago
luccioman	929e0d6eae	Replaced improper ByteBuffer.equals() implementation by Arrays.equals() Renamed also ByteBuffer.equals() to startsWith() as this is the appropriate function implementation semantics.	7 years ago
Michael Peter Christen	25573bd5ab	added a crawl filter based on <div> tag class names When a crawl is started, a new field to exclude content from scraping is available. The field can be identified with the class name of div tags. All text contained in such a div tag where the configured class name(s) match are not indexed, while the remaining page is indexed.	7 years ago
luccioman	0ee8c030c4	Log an error when Solr folder migration fails for some reason.	8 years ago
luccioman	8da3174867	Ensure lower case conversion consistency with any default locale. Especially for Turkish speaking users using "tr" as their system default locale : strings for technical stuff (URLs, tag names, constants...) must not be lower cased with the default locale, as 'I' doesn't becomes 'i' like in other locales such as "en", but becomes 'ı'.	8 years ago
Michael Peter Christen	6fe735945d	migrated Solr 5.5 -> Solr 6.6 and from Java 1.7 -> 1.8 Also: now Version 1.921	8 years ago
luccioman	8399275142	Properly close file output streams even on exceptions scenarios.	8 years ago
luccioman	02ec0ed13c	Quoted param value in Solr query to avoid unwanted traces in logs When Webgraph Solr core is enabled, crawling and removing from index an URL whose hash starts with the '-' character (example URL : https://cs.wikipedia.org/ whose hash is "-2-HuTEndn4x") produced a full ParseException stack trace in YaCy logs. This was not blocking because the Solr query parser is able to escape itself the query and run it successfully, but filled uselessly YaCy logs.	8 years ago
sgaebel	ff6392215e	added closing of lst-Tag in solr-Export	8 years ago
reger	9ad4d16829	Add a responsHeader to the solr index export with a format identifier and export parameter (in accordance with response xml format) for easier format detection on import.	8 years ago
luccioman	9697209ef6	Fixed Index Export feature for compatibility with old indexed documents. This is a fix for mantis 682 (http://mantis.tokeek.de/view.php?id=682) and issue #116	8 years ago
luccioman	f66438442e	Extended Mediawiki dump import to remote URLs. When using a public HTTP URL in /IndexImportMediawiki_p.html, the remote file now is directly streamed and processed, allowing import of several GB dumps even with a low memory remote peer, and without need to manually download the dump file first.	8 years ago
Michael Peter Christen	69081bce00	added export to elasticsearch. The export dump can easily be imported to elasticsearch using the command curl -XPOST localhost:9200/collection1/yacy/_bulk --data-binary @yacy_dump_XXX.flatjson	8 years ago
reger	86534a56f7	fixed ReindexSolrBusyThread new and unexpected repeat of same query with low number of found documents - by adding additional end condition to remove processed query with number of found docs <= process-chunck-size. Noticed on query h4_txt:[* TO *], found 21, process 21, call of commit happend but on next cycle same query again 21 docs found (while h4_txt was removed from schema and committed inputdocuments).	8 years ago
luccioman	e5858bc8c8	Fixed a NullPointerException case possible on Index Export As reported by Palulukas in YaCy forum (http://forum.yacy-websuche.de/viewtopic.php?f=18&t=5944&sid=dcef5b899ab4aa9b40e3a3d158c13aed#p33454) the Index Export operation can fails, notably when the Solr index contains one or more documents with empty (despite required) "load_date_dt" field. This fixes the export failure when the situation finally occurs, but more should be done to harden verifications on minimum required fields.	8 years ago
luccioman	1857651988	Added a new Debug/Analysis advanced settings subsection. As discussed in PR #93 with @JeremyRand and @reger24 this new advanced settings page includes: - a new setting to control remote Solr responses encoding - some existing debug settings which could not be set through the admin user interface	8 years ago
reger	95d2a28599	adjust the Field-Reindex Thread to verify and update the document id in case hash (ID) doesn't match document url (sku field).	8 years ago
luccioman	6a4d51d8f9	Cleaned up some Javadoc warnings.	8 years ago
reger	a1e5f7dbca	fix of fulltext.remove() by id of webgraph document webgraph has document hash in source_id_s	8 years ago
reger	8fe28a83f2	harmonize used lastmodified date for rwi and fulltext in storeDocument	8 years ago
luccioman	f0639d810c	Customized name for Threads still using the default "Thread-n" pattern. This makes threads monitoring easier to read.	8 years ago
luccioman	7263d17436	Removed mentions of deprecated LURL-db. Thanks to LA_FORGE asking about if on YaCy forum ( http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5895 )	8 years ago
reger	685d8e86bf	Avoid frequent data type casting (float/long) for rwi score refactor to using long in URIMetadataNode too (and related call parameters) As remote rwi score's are not used (since v1.83) skip reading float-score , but keep in toString() for communication with older versions.	8 years ago
luccioman	3ee4f56c39	Improved ErrorCache behavior when switching networks Even after network switch, ErroCache was still holding a reference to the previous Solr cores, thus becoming useless until next YaCy restart. Initial error cache filling with recent errors from the index was also missing after the swtich.	8 years ago
luccioman	8edbcd8ad4	Log eventual Solr instances close errors. We do not want to block on this kind of error, but this should not silently fail as it may have later consequences.	8 years ago
reger	330768c8a2	fix for solr write.lock after mode change http://mantis.tokeek.de/view.php?id=686 The embedded core holds a lock on the index and must be closed. Earlier commit comment states that core should be closed with solr instance instead on close of connector. Adjusted the InstanceMirror.close() to take care of closing the embedded instance to release the lock. In 2 routines of fulltext this was already explicite implemented (disconnectLocalSolr). Now this disconnect is part of the InstanceMirror.close().	8 years ago
reger	7f63fc50f3	prepare a IndexSegment test case for RWI index testing + prevent NPE in Segment.clear() on missing embedded solr instance.	9 years ago
reger	4c7a77662a	eleminate dependency on file-extension in storeDocument but use supported mime-type to also support handling of urls w/o corresponding file-extension. For this refactor use of document.getParserObject() to alway return a Parser (for clean logic) and define/move the scraperObject as local var of AbstractParser. Adjust related calls to getParserObject (where actually a scraperObject is wanted). Addionally skip appending url token to parsed text for dht metadata entries (by default returned as result by rwi index).	9 years ago
reger	35a7d57260	update lucenematchversion to current (5.2.0 -> 5.5.0) there should be no need for reindex by the update	9 years ago
Michael Peter Christen	b89465d952	0N - basic dump upload servlet infrastructure, to share index dumps within an experimental new sharing model	9 years ago
Michael Peter Christen	a6bf0b1649	0N - added option to generate index export files for a specific number of minutes in the past and reverted latest change. The export file dump will now contain four data elements: f - first date of index entry write date, l - last date of index write date, n - now-date of index dump time, c - count of numbers inside the dump. '0N' denotes a series of changes which will lead to the opportunity to exchange index data dumps in a way that is needed to integrate ZeroNet index data. This will be based on index dump sharing; that causes this commit.	9 years ago
reger	6f0b073bf3	override detected language (statistic langdetect) only with TLD determided language if langdetect probability is not high. + additionally truncate zh-cn / zh-tw returned by langdetect to 2 char ISO639-1 zh used by YaCy	9 years ago
reger	535d4bf75f	respect hidden attribute for file and smb directory listing (hidden directories are not listed, effects crawling of local file system)	9 years ago
reger	a58d34a4e8	check error URL cache before adding errorDoc to index - del obsolete related switchboardconstant	9 years ago
Michael Peter Christen	ef8cd80593	fix for npe	9 years ago
reger	ca3d26a401	harmonize wordsintitle & CollectionSchema.title_words_val calculation, remove obsolete partial init of wordreference from urimetadata	9 years ago
reger	11f3666660	increase use of pre.defined CATCHALL_QUERY string	9 years ago
reger	802ccaead6	fix init of error cache, use latest faildates => load_date_dt	10 years ago
Michael Peter Christen	de8cfbe1d7	added export option to export the fulltext of the search index text only	10 years ago
Michael Peter Christen	694b22f165	migration to Solr 5.2: huge benefits - this is a lot faster! This is a very complex migration: many classes had been renamed or removed, dependencies changed and the solr index type is now aligned to be a solr cloud repository. Together with the Solr 5.2 library update, one other dependent library had been updated as well: httpclient 4.4->4.4.1 Older indexes are migrated from 4_10 to 5_2. However, the new index structure is more efficient and we recommend to re-index everything. Please use the index export before you do the update to a large surrogate xml file. After the update, start with an empty index and then initialize this with your dump.	10 years ago
Michael Peter Christen	34de1e8cbc	gzip compression will perform more efficient and with better compression level	10 years ago
Michael Peter Christen	98be59ce9c	full solr xml exports will now be automatically compressed during export. That makes it possible to export a solr xml dump even if disc space is low.	10 years ago
Michael Peter Christen	b43811d38c	added surrogate import process for exported solr dumps. Just throw your solr dump file into DATA/SURROGATES/in/ and it will be imported!	10 years ago
Michael Peter Christen	c7576d6028	added a full solr export to the IndexControlURLs_p.html servlet. The export function is also now the default export option. The export file format for a full solr export is very similar to a solr search result xml, only the <lst name="responseHeader"> tag is missing. The exported xml has a special line termination feature: all documents will be exported into a single line without any CR in between. That means that every document is completely inside a single line. While this is not readable at all for humans, it is very useful for linux line processing scripts, like grep. Using grep it will be easy to select single documents which match for a given pattern. Such dumps shall be importable with the DATA/SURROGATE/in import function, but that import is not yet adopted to the new file format.	10 years ago
reger	d882991bc5	Implement sharing of ioDispatcher for term & citation index as proposed in ioDispatcher description	10 years ago

1 2 3 4 5 ...

439 Commits (0eb52f8c72a1f040412378f69f7537f5c3f16f3c)