yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Michael Christen	4304e07e6f	crawl profile adoption to new tag valency attribute	2 years ago
reger24	a7e93d9328	Add option to add host to default blacklist from search result - added authorized ikon/button to blacklist a host - host is added to default blacklist - insired by https://github.com/yacy/yacy_search_server/issues/213#issuecomment-412485190	3 years ago
reger24	027e284ef9	Enhance notability of current blacklist by diff color in header in servlet Blacklist_p.html bugfix for `18dddb74c9`	3 years ago
reger24	18dddb74c9	Harmonize loading/reading blacklist between init and servlet to use the same procedures -added BlacklistHelper.blacklistToSortedArray to simplify use in servlet	3 years ago
reger24	f28d705cd0	update IndexBroser_p add to blacklist button add feedback to user on success	3 years ago
reger24	6a5f0b3684	Servlet IndexBroser_p add button "Add to blacklist" allows to add the displayed host to add to the default blacklist	3 years ago
Daleth Darko	3ced06c731	Various javadoc fixes	3 years ago
Michael Peter Christen	bd3f2483a1	replaced url and date retrieval by only url retrieval This should prevent that the search index is used for freshnes of the index entry.	3 years ago
Michael Peter Christen	9be36800a4	increased redirect depth by one this makes sense if one redirect replaces http with https and another replaces www subdomain by without (and vice versa)	4 years ago
luccioman	61c337f29a	Decode blacklist entries for easier edition of non ascii chars Not using the JDK URLDecoder.decode() function, as it strips '+' characters when they occur after '?' (both characters having regular expression semantics when used in blacklist path patterns)	6 years ago
luccioman	ed93221fa1	Improved normalization of blacklist path patterns having non ascii chars Normalize blacklist path patterns using percent-encoding, at pattern edition in web interface and at loading from configuration files. Fixes issue #237	6 years ago
luccioman	dbf4c1cd76	Improved blacklist entries editing operations : - Fixes issue #160 : handle properly syntax exceptions with a user friendly message - Fixes loss of information on multiple blacklist entries editions - Fixes loss of entries when moving entries from one list to another	7 years ago
luccioman	7baa99f26f	Fixed stored URL in web cache when redirection(s) occurs. Associate cached content to the last redirection location, instead of the first URL of a redirection(s) chain : - for proper base URL processing in parsers (fixes mantis 636 - http://mantis.tokeek.de/view.php?id=636) - to prevent duplicated content in Solr index when recrawling a redirected URL	7 years ago
luccioman	5db1c9155a	Do locale independant case conversion on hosts, schemes, and file exts. Required for proper operation when the default system locale is Turkish, as dottless and dotted i characters have specific case conversion rules in this language.	7 years ago
Michael Peter Christen	25573bd5ab	added a crawl filter based on <div> tag class names When a crawl is started, a new field to exclude content from scraping is available. The field can be identified with the class name of div tags. All text contained in such a div tag where the configured class name(s) match are not indexed, while the remaining page is indexed.	7 years ago
luccioman	d8eaf621cc	Fixed blacklist returned location URL on empty parameters	7 years ago
luccioman	1e84956721	Support loading local files with a per request specified maximum size. Consistently with the HTTP loader implementation.	7 years ago
luccioman	bf55f1d6e5	Started support of partial parsing on large streamed resources. Thus enable getpageinfo_p API to return something in a reasonable amount of time on resources over MegaBytes size range. Support added first with the generic XML parser, for other formats regular crawler limits apply as usual.	7 years ago
luccioman	0b75e92ac2	Do not wrap unnecessarily loader IOExceptions in IOExceptions	7 years ago
luccioman	433bdb7c0d	Respect maxFileSize limit also when streaming HTTP and when relevant. Constraint applied consistently with HTTP content full load in byte array.	7 years ago
luccioman	8399275142	Properly close file output streams even on exceptions scenarios.	8 years ago
luccioman	a04feac064	Ensure file input streams proper closing in both success and failures Also add when possible a warning level log message on input stream closing error instead of failing silently. This could help understanding some IO exceptions such as "too many files open".	8 years ago
luccioman	a9cb083fa1	Improved consistency between loader openInputStream and load functions	8 years ago
luccioman	522a268305	Improved new blacklist entries URL scheme detection.	8 years ago
luccioman	58d23047dd	Handle '?' and '+' chars as valid wild cards when adding to blacklist. An entry such as "domain.com/[a-z]+" is a valid regular expression and do not need additional "../.*" wildcards.	8 years ago
luccioman	f66438442e	Extended Mediawiki dump import to remote URLs. When using a public HTTP URL in /IndexImportMediawiki_p.html, the remote file now is directly streamed and processed, allowing import of several GB dumps even with a low memory remote peer, and without need to manually download the dump file first.	8 years ago
luccioman	54405577aa	Replaced absolute redirection locations by relative ones when possible. This makes integration of YaCy behind a reverse proxy subfolder easier.	8 years ago
luccioman	339f005ced	Blacklist import and update performance improvements. Measurement sample : import from blacklist local file containing about 15000 entries - before refactoring : several minutes - after refactoring : a few seconds!	8 years ago
reger	395f2e8946	Make ServletRequest implement the standardized HttpServletRequest interface, to make all readily available information from the original ServletRequest available to YaCy servlets (without converting data to internal structures). The implementation of the common interface allows easier integration of YaCy servlets with the servlet standard (e.g. shared login service with the servlet container etc.)	8 years ago
luccioman	f0639d810c	Customized name for Threads still using the default "Thread-n" pattern. This makes threads monitoring easier to read.	8 years ago
luccioman	da362628fb	Added fine log level for too long blacklist matching processing.	8 years ago
luccioman	4b699c469a	Blacklist refactoring : extracted a function for easier unit testing	8 years ago
luccioman	242707f9b4	Fixed loadFromCache with strategy IFFRESH. This fixes mantis 695 ( http://mantis.tokeek.de/view.php?id=695 ) : crawl start with 'Link-List of URL' option on websites using cookies.	8 years ago
Michael Peter Christen	5e165a8150	removed unused imports	8 years ago
reger	5e335b32da	fix Blacklist.contains() matching path pattern to string similar to `5e9e871192` + add proof testcase	8 years ago
reger	5e9e871192	fix Blacklist.remove by using pattern.toString to find pattern to remove, parameter String path did never equal Pattern. + delete unused removeAll, as it does not persist changes after restart	8 years ago
reger	1843ea7e69	on Blacklist.add pattern to source file also update internal entry maps as in Blacklist.add(blacklistType) to make entry effective w/o restart fix for http://mantis.tokeek.de/view.php?id=676	8 years ago
reger	efb9f1a8b7	save resource for unused blacklistFiles map	9 years ago
reger	06d0e2aeb9	result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode. - Above brought up that parser start url parameter, declared as AnchorURL uses only methodes of parent object DigestURL (changed parameter declaration accordingly).	9 years ago
reger	b7e8358645	make use of header.getContentType where possible (mime is normalized afterwards) otherwise use header.mime() differentiated in prev. commit.	9 years ago
luc	f01d49c37a	Process large or local file images dealing directly with content InputStream.	9 years ago
Michael Peter Christen	fed26f33a8	enhanced timezone managament for indexed data: to support the new time parser and search functions in YaCy a high precision detection of date and time on the day is necessary. That requires that the time zone of the document content and the time zone of the user, doing a search, is detected. The time zone of the search request is done automatically using the browsers time zone offset which is delivered to the search request automatically and invisible to the user. The time zone for the content of web pages cannot be detected automatically and must be an attribute of crawl starts. The advanced crawl start now provides an input field to set the time zone in minutes as an offset number. All parsers must get a time zone offset passed, so this required the change of the parser java api. A lot of other changes had been made which corrects the wrong handling of dates in YaCy which was to add a correction based on the time zone of the server. Now no correction is added and all dates in YaCy are UTC/GMT time zone, a normalized time zone for all peers.	10 years ago
Michael Peter Christen	b5ac29c9a5	added a html field scraper which reads text from html entities of a given css class and extends a given vocabulary with a term consisting with the text content of the html class tag. Additionally, the term is included into the semantic facet of the document. This allows the creation of faceted search to documents without the pre-creation of vocabularies; instead, the vocabulary is created on-the-fly, possibly for use in other crawls. If any of the term scraping for a specific vocabulary is successful on a document, this vocabulary is excluded for auto-annotation on the page. To use this feature, do the following: - create a vocabulary on /Vocabulary_p.html (if not existent) - in /CrawlStartExpert.html you will now see the vocabularies as column in a table. The second column provides text fields where you can name the class of html entities where the literal of the corresponding vocabulary shall be scraped out - when doing a search, you will see the content of the scraped fields in a navigation facet for the given vocabulary	10 years ago
Michael Peter Christen	69eacdf4eb	applying precompiled CommonPattern.COMMA.split to all places where split(",") was used	10 years ago
Michael Peter Christen	8df8ffbb6d	enhanced the snapshot functionality: - snapshots can now also be xml files which are extracted from the solr index and stored as individual xml files in the snapshot directory along the pdf and jpg images - a transaction layer was placed above of the snapshot directory to distinguish snapshots into 'inventory' and 'archive'. This may be used to do transactions of index fragments using archived solr search results between peers. This is currently unfinished, we need a protocol to move snapshots from inventory to archive - the SNAPSHOT directory was renamed to snapshot and contains now two snapshot subdirectories: inventory and archive - snapshots may now be generated by everyone, not only such peers running on a server with xkhtml2pdf installed. The expert crawl starts provides the option for snapshots to everyone. PDF snapshots are now optional and the option is only shown if xkhtml2pdf is installed. - the snapshot api now provides the request for historised xml files, i.e. call: http://localhost:8090/api/snapshot.xml?urlhash=Q3dQopFh1hyQ The result of such xml files is identical with solr search results with only one hit. The pdf generation has been moved from the http loading process to the solr document storage process. This may slow down the process a lot and a different version of the process may be needed.	10 years ago
reger	ff18129def	ViewFile servlet: update index if newer, so viewed text and metadata (stored) info is similar - to archive it, use request with profile to allow indexing (defaultglobaltext) and update index (the resource is loaded, parsed anyway, so it's not a expensive operation) Request: remove 2 unused init parameter - number of anchors of the parent - forkfactor sum of anchors of all ancestors	10 years ago
Michael Peter Christen	e586e423aa	in case that loading from the cache fails, load from wkhtmltopdf without cache using the user agent string given in the crawl profile	10 years ago
Michael Peter Christen	d5bac64421	recognize more html file types for snapshots	10 years ago
Michael Peter Christen	25a64c51b3	moved snapshot generation out of the html handler to prevent that existing cache entries cause that the handler is not executed	10 years ago
reger	48aed15c48	skip loader wait cycle on concurrent access in nocache configuration. In nocache config resource is loaded online, leaving no benefit to wait for a faster cache hit.	10 years ago

1 2 3 4

189 Commits (0fb77994aadacd11e48b6e5a9375c3583a287016)