yacy_search_server

Commit Graph

Author	SHA1	Message	Date
reger	b6a41df4f7	Remove deprecated YaCyProxyServlet was replaced by UrlProxyServlet	7 years ago
luccioman	8a94fef9e0	Prevent unwanted cached bytes duplication on stream parsing.	7 years ago
reger	4979439e87	Skip public post of jre version. Added to determine switch to java8 `596b5dfa59`	7 years ago
reger	e918ec199e	Replace deprecated ConcurrentHashSet with recommended Java8 ConcurrentHashMap.newKeySet() in postprocessDocuments()	7 years ago
reger	fb71994342	Harmonizing use of xml reader / sax parser in XMLBlacklistImporter eliminating the need for lib/xercesImpl.jar	7 years ago
reger	275d65fffe	Patch last_modified date with internal FirstSeenTime() if no date provided to make sure updated documents are indexed with their last-modified date as provided in current crawl. (to patch moddate always with firstseen might bear the risk of miss actual updates).	7 years ago
reger	d1b23afed6	Remove obsolete Protocol parameter ttl (time to live) not interpreted in target yacy/query.html also Protocol.querySeed() not used and parameter not interpreted in target servlet yacy/query.html	7 years ago
reger	15d78b1064	Replace deprecated getIP with getIPs in Protocol transferURL() and getProfile(). Remember used ip for error handling and departInterface	7 years ago
reger	ed36b47bec	Replace one more deprecated peerDeparture in Protocol.transferIndex() by moving/using interfaceDeparture() in transferRWI()	7 years ago
luccioman	0ee8c030c4	Log an error when Solr folder migration fails for some reason.	7 years ago
luccioman	5a646540cc	Support parsing gzip files from servers with redundant headers. Some web servers provide both 'Content-Encoding : "gzip"' and 'Content-Type : "application/x-gzip"' HTTP headers on their ".gz" files. This was annoying to fail on such resources which are not so uncommon, while non conforming (see RFC 7231 section 3.1.2.2 for "Content-Encoding" header specification https://tools.ietf.org/html/rfc7231#section-3.1.2.2)	7 years ago
luccioman	11a7f923d4	Distinguish response parsing failures from unexpected exceptions.	7 years ago
luccioman	eda7b0aeb6	Merge branch 'master' of https://github.com/yacy/yacy_search_server	7 years ago
reger	3005be7349	Clean up unmaintained and unused AugmentParser trail.	7 years ago
luccioman	cb4f1358e1	Added gzip parser support for max content bytes limit	7 years ago
luccioman	5216c681a9	Added HTML parser support for maximum content bytes parsing limit	7 years ago
luccioman	4aafebc014	Merge pull request #122 from Scarfmonster/patch-1 I also reproduced the issue, and the fix is working fine. Thanks @Scarfmonster	7 years ago
luccioman	651fad6da5	Added RSS parser support for maximum content bytes parsing limit	7 years ago
luccioman	452a17a8d5	Finer control on bounded input streams with custom stream implementation	7 years ago
luccioman	f8f1959ebb	Added parsing within bounds implementation to the generic parser.	7 years ago
luccioman	e0f400a0bd	Support trying multiple parsers even when streaming on large resources.	7 years ago
luccioman	1e84956721	Support loading local files with a per request specified maximum size. Consistently with the HTTP loader implementation.	7 years ago
luccioman	f369679d1c	Fixed read/copy on input streams reading sometimes less than expected.	7 years ago
luccioman	bf55f1d6e5	Started support of partial parsing on large streamed resources. Thus enable getpageinfo_p API to return something in a reasonable amount of time on resources over MegaBytes size range. Support added first with the generic XML parser, for other formats regular crawler limits apply as usual.	7 years ago
luccioman	90a7c1affa	HTML parser : removed unnecessary remaining recursive processing Recursive processing was removed in commit `67beef657f`, but one remained for anchors content(likely omitted from refactoring). It is no more necessary : other links such as images embedded in anchors are currently correctly detected by the parser. More annoying : that remaining recursive processing could lead to almost endless processing when encountering some (invalid) HTML structures involving nested anchors, as detected and reported by lucipher on YaCy forum ( http://forum.yacy-websuche.de/viewtopic.php?f=23&t=6005 ).	7 years ago
reger	e6e20dab52	upd to Jetty 9.4.6.v20170531 Modify loginservice to the changes in Jetty, partially based on pull request #101 https://github.com/yacy/yacy_search_server/pull/101 bu @automenta	7 years ago
luccioman	dcc56318bb	Made remote search max system load limits configurable from UI. As reported by davide on YaCy forums ( http://forum.yacy-websuche.de/viewtopic.php?f=23&t=6004 ) when the system is on high load, unless reading carefully YaCy configuration file, it could be difficult to understand why remote search results are not fetched.	7 years ago
reger	ddd13b776d	Add keyword constraint to rwi query result filter To discard rwi results not matching query keyword: parameter	7 years ago
luccioman	e82eaee4b6	Apply consistent behavior on HTTP resource size exceeding limit. On content size known from HTTP headers, terminates connection faster and improves error reports quality by reporting relevant message "Content to download exceed maximum value..." rather than previously "no response (NULL) for url...".	7 years ago
luccioman	0b75e92ac2	Do not wrap unnecessarily loader IOExceptions in IOExceptions	7 years ago
luccioman	433bdb7c0d	Respect maxFileSize limit also when streaming HTTP and when relevant. Constraint applied consistently with HTTP content full load in byte array.	7 years ago
luccioman	9b1bb2545e	Refactored plain-text URLs detection implementation. For faster processing (measured about 2 times faster on many real-world examples) and more advanced detection (previous algorithm detected only URLs separated from the rest of the text by a space character).	7 years ago
luccioman	8da3174867	Ensure lower case conversion consistency with any default locale. Especially for Turkish speaking users using "tr" as their system default locale : strings for technical stuff (URLs, tag names, constants...) must not be lower cased with the default locale, as 'I' doesn't becomes 'i' like in other locales such as "en", but becomes 'ı'.	7 years ago
luccioman	286f3018bd	Made mime type and extension normalization locale independent. Previously, upper cased mime type was incorrectly normalized when the default locale is Turkish.	7 years ago
luccioman	319231a458	Added a generic XML parser, able to parse elements text and URLs. This parser adds support for any XML based format other than already supported XML vocabularies such XHTML, RSS/Atom feeds... It will eventually be used as a fallback if one of these specific parsers fail, before falling back to the existing genericParser which extracts not that much useful information except URL tokens.	7 years ago
Ryszard Goń	3cedbbd4ed	Wrong password was removed after the SSL certificate import Removing the keystore password will prevent ssl from working after the next restart. The certificate password should be removed instead. Fixes http://mantis.tokeek.de/view.php?id=687	8 years ago
luccioman	64cec2790d	Improved character encoding detection from Content-Type header Also updated some related JavaDocs	8 years ago
luccioman	0487336ec3	Prevent integer overflow in table statistics and use strong typing	8 years ago
luccioman	d2a4a27f52	Improved stream-oriented parsing entering conditions.	8 years ago
luccioman	9dd790087d	Added HT Cache basic statistics (hit rate)	8 years ago
luccioman	5fdd5d16b1	Use volatile to ensure concurrent threads use up to date property value	8 years ago
luccioman	28b451a0b3	Made Cache compression level and lock timeout user configurable	8 years ago
luccioman	a7394b479b	Limit the synchronization blocking time on some Cache operations. Using a Reentrant lock instead of the intrinsic synchronization lock permits limiting the blocking time to acquire a lock. Useful on a very busy Cache concurrently accessed by many threads : when the time to acquire a lock is too high, getting/storing content on the cache becomes inefficient, and it is then better to fall back to loading remote resources. Illustrated by the CacheTest stress test and some traces reported in mantis 751 ( http://mantis.tokeek.de/view.php?id=751 )	8 years ago
Michael Peter Christen	c94a8c76bd	re-added solr synchronization hack	8 years ago
Michael Peter Christen	6fe735945d	migrated Solr 5.5 -> Solr 6.6 and from Java 1.7 -> 1.8 Also: now Version 1.921	8 years ago
luccioman	ce89492319	Ensure system resource release by closing document stream.	8 years ago
luccioman	8399275142	Properly close file output streams even on exceptions scenarios.	8 years ago
luccioman	4e4dc6c4e5	Removed unnecessary finalize implementation. On such private classes with limited scope but with frequent instance creations and removals within the application lifecycle, implementing the finalize method is particularly unwanted as it decreases the garbage collector performance. What's more the Object.finalize() method is now deprecated in the JDK 9 and will eventually disappear from future releases (see https://bugs.openjdk.java.net/browse/JDK-8177970)	8 years ago
luccioman	a04feac064	Ensure file input streams proper closing in both success and failures Also add when possible a warning level log message on input stream closing error instead of failing silently. This could help understanding some IO exceptions such as "too many files open".	8 years ago
luccioman	d98c04853d	Ensure proper closing of file input streams.	8 years ago
luccioman	c53c58fa85	Unsure closing ChunkIterator stream in every possible use case. Also trace in logs the eventual close failures instead of failing silently. This should help prevent holding too many unreleased system file handlers, as in the case reported by eros on YaCy forum (http://forum.yacy-websuche.de/viewtopic.php?f=23&t=5988&sid=b00e7486c1bf7e48a0d63eb328ccca02 )	8 years ago
luccioman	29e52bda39	Merge branch 'master' of https://github.com/yacy/yacy_search_server	8 years ago
luccioman	a9cb083fa1	Improved consistency between loader openInputStream and load functions	8 years ago
reger	a814f3d885	Introduce keyword query parameter This enables keyword navigator to filter on keywords. Added search page output and layout config for keywords, allowing e.g. in Intranet use to display the keywords. No styling or links applied to the keyword text (but is desirable possibly in combination with bootstrap-tagsinput for future/intranet).	8 years ago
luccioman	c226ded799	Fix unescape of URLs having some '%' chars but not percent-encoded	8 years ago
luccioman	306a82dd71	Fixed scraper NullPointerException cases on malformed URLs.	8 years ago
luccioman	aa55d71cf5	Fixed a NullPointerException case on Digest authentication. Could occur when upgrading from a Debian package configured with Basic authentication (as in release 1.92.9000) to a more recent one with Digest authentication, without having re-encoded the admin password (for example with dpkg-reconfigure). As reported by eros on YaCy forum (http://forum.yacy-websuche.de/viewtopic.php?f=23&t=5988#p33686).	8 years ago
luccioman	02ec0ed13c	Quoted param value in Solr query to avoid unwanted traces in logs When Webgraph Solr core is enabled, crawling and removing from index an URL whose hash starts with the '-' character (example URL : https://cs.wikipedia.org/ whose hash is "-2-HuTEndn4x") produced a full ParseException stack trace in YaCy logs. This was not blocking because the Solr query parser is able to escape itself the query and run it successfully, but filled uselessly YaCy logs.	8 years ago
reger	1737af37cf	Set request originator to own peer in warc importer in addition to change in `039162fbf0`	8 years ago
reger	039162fbf0	Change warc importer to use defaultsurrogate-crawl profile, as reported by LA_FORGE http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5990 and analysed by @luccioman (see comment `510f11d374`) it creates conflict using a other crawlprofile without setting originator.	8 years ago
Michael Peter Christen	3b1d640a3c	enhanced debugging	8 years ago
Michael Peter Christen	7de7879f13	added a cache to prevent too many seed enumerations	8 years ago
luccioman	bd7411a53a	Enable p2p and cluster communication when "Protection of all pages" on As reported by paul89 on YaCy forum (http://forum.yacy-websuche.de/viewtopic.php?f=23&t=5958 ), when setting the "Protection of all pages" to "On" in the "ConfigAccounts_p.html" page, the peer became completely unreachable by others, which is not the purpose of this feature. But the restriction still makes sense as a security enforcement and is maintained in private "Robinson mode" where by the way any peer-to-peer or cluster communication would be rejected.	8 years ago
luccioman	31ad043bb9	Added user interface feedback on results feeding termination status. Added as an additional icon with title in the search progress bar, to inform about background search feeder threads terminated or still running. While giving a bit more information to users about the p2p search process, this can help choosing whether or not wait a little bit more time before going to the next page, in order to get results from various sources sorted as best as possible (see #91 for a discussion about sorting accuracy and network latency). Other related modifications included : - regular updates to statistics in the progress bar until the background feeders are completely terminated. - removed some uses of unsecure and discouraged JavaScript elements	8 years ago
sgaebel	ff6392215e	added closing of lst-Tag in solr-Export	8 years ago
luccioman	d90b001e1b	Improved previous merge "Show ranking in HTML UI". - added the new setting as configurable in the "Debug/Analysis" settings page. Debug/analysis is its main purpose for now as there is currently no nice and "understansable" ranking score info servlet (see forum discussion http://forum.yacy-websuche.de/viewtopic.php?f=8&t=5884 ) - render in the "Search Page Layout" page preview when enabled - added constants	8 years ago
luccioman	0f0f42b509	Added some JavaDoc	8 years ago
reger	077d062be3	Adjust mergeDocuments to keep youngest last-modified date of document collection	8 years ago
luccioman	654801523e	Fixed StringIndexOutOfBoundsException case. Revealed by commit `c77e43a` : the exception was then thrown when indexing pages containing mailto: scheme URL links with the Solr Webgraph core enabled. Fixed the error case and restored filtering on mailto links in Document.resortLinks() as these URLs still should not appear in Document.hyperlinks.	8 years ago
luccioman	522a268305	Improved new blacklist entries URL scheme detection.	8 years ago
luccioman	532981b363	Updated putHTML() JavaDoc	8 years ago
luccioman	58d23047dd	Handle '?' and '+' chars as valid wild cards when adding to blacklist. An entry such as "domain.com/[a-z]+" is a valid regular expression and do not need additional "../.*" wildcards.	8 years ago
luccioman	a87281b498	Added MediaWiki dump import scheduling feature. Checking the last modified date by default to prevent unnecessary long running operations.	8 years ago
luccioman	edd7ccac40	Added some JavaDoc	8 years ago
luccioman	79fdf14b0a	Fixed regression introduced by commit `9ad4d16` On MediaWiki dump imports, the SurrogateReader was trying to unread too many bytes, then failing with the following exception : "java.io.IOException: Push back buffer is full".	8 years ago
Michael Peter Christen	7678fd67e3	copied fix from yacy_grid_parser for wrong array type	8 years ago
Michael Peter Christen	200b100fb8	added patch to rewrite altered yacy grid schema into yacy schema This generates the stub and protocol parts of an url for inboundlinks, outboundlinks and images	8 years ago
reger	9ad4d16829	Add a responsHeader to the solr index export with a format identifier and export parameter (in accordance with response xml format) for easier format detection on import.	8 years ago
luccioman	9697209ef6	Fixed Index Export feature for compatibility with old indexed documents. This is a fix for mantis 682 (http://mantis.tokeek.de/view.php?id=682) and issue #116	8 years ago
luccioman	88c062639b	Added some JavaDoc	8 years ago
luccioman	31fff2c986	Extended WikiCode template inclusion syntax support. Wiki templates are not rendered but syntax support is improved, which greatly enhance snippets rendering on search results coming from a MediaWiki dump import. Tested on various dumps from Wikimedia at https://dumps.wikimedia.org/backup-index.html See also Wikipedia transclusion documentation at https://en.wikipedia.org/wiki/Wikipedia:Transclusion	8 years ago
Michael Peter Christen	973d74712f	added yacy grid flatjson surrogate parser	8 years ago
luccioman	b1da92648e	Fixed surrogates import monitoring page (/CrawlResults.html?process=7) This page was always empty, as described in mantis 740 (http://mantis.tokeek.de/view.php?id=740)	8 years ago
luccioman	527d494c1a	Fixed "Unchecked conversion" compilation warnings.	8 years ago
reger	c77e43a391	Take out mailto collect in internal parsed document As earlier plans to make use of mailto as separate webgraph entity didn't materialize (see http://forum.yacy-websuche.de/viewtopic.php?f=8&t=5726&p=32493&hilit=mailto#p32493) free the unused handling and resources.	8 years ago
Michael Peter Christen	335868edba	Merge branch 'master' of git@github.com:yacy/yacy_search_server.git	8 years ago
reger	bec34d3546	Add url input field as source for WarcImporter allowing to import warc from url without prior download.	8 years ago
luccioman	f66438442e	Extended Mediawiki dump import to remote URLs. When using a public HTTP URL in /IndexImportMediawiki_p.html, the remote file now is directly streamed and processed, allowing import of several GB dumps even with a low memory remote peer, and without need to manually download the dump file first.	8 years ago
luccioman	e5c3b16748	Improved http client close time on stream processing errors.	8 years ago
luccioman	23775e76e2	Fixed endless loop case in wikicode processing. Detected when importing recent MediaWiki dumps containing some pages with script content in plain text format (see Scribunto extension https://www.mediawiki.org/wiki/Extension:Scribunto ). Further improvement : modify the MediawikiImporter to prevent processing revisions whose <model> is not wikitext.	8 years ago
luccioman	0bc868a819	Improved support for non ASCII chars in local file system URLs Creating a MultiProtocolURL instance from a File object and then retrieving a File with getFSFile() was inconsistent with file paths containing space or non ASCII chars.	8 years ago
reger	7b80189bda	Activate hosts navigator plugin. This includes rwi results in the navigator count. This might be tangential related to http://mantis.tokeek.de/view.php?id=736 as the example includes a local index search, while rwi results are not counted.	8 years ago
Michael Peter Christen	f5ad29edb1	Merge branch 'master' of git@github.com:yacy/yacy_search_server.git	8 years ago
Michael Peter Christen	76e9135526	added flatjson parser (stub, unfinished)	8 years ago
reger	b7417ac329	Introduce a Keyword search navigator using the index field keywords. The keywords field string is split into words as navigator entries. A keyword navigator facet is essential for search appliance usage were documents and metadata use often specialized keyword vocabularies to filter search results. This navi can be used without custom index schema. As we don't have defined a search query command to filter "keywords" yet, the filtering is limited by adding the keyword to the search query.	8 years ago
luccioman	09e72eb0a4	Set Config Portal as a private administration page. Consistently with its required action from submission credentials, and because external unauthenticated users do not need to access these settings.	8 years ago
reger	ba339a2a45	Add servlet to import warc file from filesystem IndexImportWarc_p.html. Apply Importer interface to WarcImporter	8 years ago
Michael Peter Christen	1d81b8f102	Merge branch 'master' of git@github.com:yacy/yacy_search_server.git	8 years ago
Michael Peter Christen	69081bce00	added export to elasticsearch. The export dump can easily be imported to elasticsearch using the command curl -XPOST localhost:9200/collection1/yacy/_bulk --data-binary @yacy_dump_XXX.flatjson	8 years ago
reger	510f11d374	Implement surrogate import from Warc archives (as first option handle warc = Web ARChive File Format. Warc files with extension .warc or compressed warc.gz can be placed in the DATA/surrogate/in and contained responses are imported to the index. The used library is stream based so we can easily extend it later to use and load warc's from the net.	8 years ago
luccioman	4b649b0a11	Fixed NPE case and API URL link on Solr HTML output for webgraph core.	8 years ago
luccioman	af28a07780	Updated API calls recording/replay with recent changes. - enabled HTTP POST calls with Digest HTTP authentication - made API calls compatible with API newly restricted to HTTP POST only with transaction token validation - ensured backward compatibility with older entries recorded as HTTP GET	8 years ago
reger	81670c3484	One more use of SwitchboardConstants.SERVER_PORT constant, apply standard servlet design pattern initialization of solrselectservlet	8 years ago
luccioman	cde237b687	Enforced access controls on some administrative actions. - ensure use of HTTP POST method : HTTP GET should only be used for information retrieval and not to perform server side effect operations (see HTTP standard https://tools.ietf.org/html/rfc7231#section-4.2.1) - a transaction token is now required for these administrative form submissions to ensure the request can not be included in an external site and performed silently/by mistake by the user browser	8 years ago
luccioman	df5970df6d	Extended Apache HTTP Digest Auth. for use of YaCy encoded password When programmatically requesting the local peer with Apache http client, authentication credentials must be passed as clear-text values. This extension to the apache org.apache.http.impl.auth.DigestScheme permits use of the YaCy encoded password stored in the adminAccountBase64MD5 configuration property.	8 years ago
reger	f05976c017	Display the local search word statistic in alphabetic order	8 years ago
reger	3dd23c178b	Introduce the option to configure a shutdown port. A port value of -1 will disable this option. If set to a value greater 0, YaCy listens on this of on the local loopback address (127.0.0.1) for a shutdown or restart signal. E.g. connect to http://localhost:8005/shutdown will stop the YaCy server. http://localhost:8005/restart will restart it. This option allows to stop YaCy locally independant from the web web frontend (which might be configured for password protected remote access).	8 years ago
reger	a2afb4bae0	add switchboardconstants for server ports config keys	8 years ago
reger	56d0a87a83	remove double occuance of geo:lat in rss tokens	8 years ago
reger	b4fa1141b8	implement RequestHeader getRequestURI, getRequestURL for legacy request	8 years ago
reger	209a7374bd	remove unused import pdfParser	8 years ago
reger	de1c1c16db	Improve pdf text extraction resource handling. For sort pdf <= 3 pages use already extracted content, only for long pdf > 3 pages reassign content and close internal writer (to direct free buffers)	8 years ago
reger	9b6d1abd9e	eliminate some compiler unchecked and deprecation warnings in nav plugins by explicite type declaration and replacing date.getYear with Calendar.get	8 years ago
reger	18c7563dbe	Extend DCEntry.getLanguage convert to ISO639-1 codes for more languages by using icu.ULocale for languages not already covered (ICU normalizes to ISO639-1 2 char codes). Add test class Use DublinCore vocabulary declarations in DCEntry and SurrogateReader for easier usage debugging, Init SurrogateReader.inputSource on first use.	8 years ago
reger	ce87025462	further avoid to set connect info properties as header value following comment "use of properties as header values is discouraged" in case where (proxy)HTTPClient overwrites values with supplied url. Use defined request.referer procedure in response class.	8 years ago
reger	cd4d891ea4	use pre-defined "Connection" header key, replace depreceated	8 years ago
luccioman	0173b0bc32	Added an advanced settings page for referrer policy settings. Feedback will be welcome, notably on the descriptive content of this page.	8 years ago
reger	81963a89fe	fix proxyservlet response url to respect http scheme if a relative Location header is returned.	8 years ago
luccioman	cdcd923375	Privacy enhancement : added settings to control referrer policy. HTTP "Referer" header sent by the browser when using YaCy can now be controlled either with the referrer meta tag as a global policy, or only for search result links by adding the attribute rel="noreferrer". To improve privacy with the less possible regressions, the default is set as meta tag with value "origin-when-cross-origin" : internal YaCy links behavior is not affected, but when visiting external websites referrer url is not empty but stripped from query parameters and path. Older browsers, Safari, MS IE and Edge do not support the referrer meta tag, so the standard but less flexible noreferrer link type can also be enabled as an alternative. User-friendly settings page to be implemented.	8 years ago
reger	86534a56f7	fixed ReindexSolrBusyThread new and unexpected repeat of same query with low number of found documents - by adding additional end condition to remove processed query with number of found docs <= process-chunck-size. Noticed on query h4_txt:[* TO *], found 21, process 21, call of commit happend but on next cycle same query again 21 docs found (while h4_txt was removed from schema and committed inputdocuments).	8 years ago
reger	275c0cddd1	Adjust DefaultServlet test case to recent change, depreciate unused CONNECTION_PROP_PROTOCOL (also as it might be misleading with getProtocol vs getScheme)	8 years ago
reger	41e2ee0eca	Fix call parameter for ConnectionInfo in MonitorHandler (expected scheme e.g. http, was protocol version). Depreceate obsolete custom X-...-Scheme header constant. Use existing FORMAT_ANSIC Dateformatter in HeaderFramework. Correct htmlParserTest (del one not intended println)	8 years ago
luccioman	ac766327d3	Switched a few more Solr fields from strictly mandatory to optional	8 years ago
reger	f254fcfc67	fix htmlParser <script> text extraction on code containing expression recognized as tag like 1<a reported in https://github.com/yacy/yacy_search_server/issues/109 Script content is ignored by default, but the text is filtered for html tags. Modified scraper to skip tag filtering while within a <script> section (until a closing tag is detected </script>. Possible side effect, missing </script> end-tag will truncate trailing content text.	8 years ago
luccioman	2f191e0e1c	Improved MultiprocotolURL non ASCII characters support. After @sinkuu Pull Request #108 added JUnit tests, updated some JavaDoc and also improved URL tokenization to support non ASCII characters.	8 years ago
luccioman	18e8b3a220	Merge branch 'escape' of https://github.com/sinkuu/yacy_search_server	8 years ago
reger	7419989de3	Correct dublincore title property text to lowercase in htmlresponsewriter, remove unused (carry over) local variable Do the same for other responsewriter.	8 years ago
Burkhard	4fdc11cae8	Update SearchEvent.java Fix NPE on disabled local SolrIndex, occuring on search moving to the 2nd result page. The debug purpose only setting to disabeling local SolrIndex (System Admin -> Debug Settings) should long term probably be removed from production code.	8 years ago
luccioman	cdc7f3e431	Switched some Solr fields from mandatory to optional These fields are default enabled but with no doubt not strictly mandatory with the current code base. As reported by @reger24, splitting between essential mandatory and optional fields is still to be improved to reflect the current YaCy needs.	8 years ago
luccioman	3475d8c1a9	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git	8 years ago
luccioman	c68a8be2d9	Refactored and enforced Solr mandatory fields for proper operation - Added a new method to check activation of mandatory fields on Collection Configuration commit, consistently with checks previously performed in Switchboard startup and with mandatory fields in the default schema. - Reorganized default schema and CollectionConfiguration enumeration : moved no more mandatory fields in a specific section, and moved fields enabled at startup to the mandatory section. - Marked mandatory fields as required and with stronger font in the IndexSchema_p.html page	8 years ago
reger	334c70c37a	correct fromDate init value on missing param in api/timeline_p servlet revert test modification from last commit in AccessTracker.main	8 years ago
reger	cc770512d5	add hint of query syntax in AccessTracker log (qs=normal querystring, sq=solr-querystring) to allow to filter simple text queries for processing, remove toString for counter parameter use more predefined constants in solrservlet	8 years ago
luccioman	e5858bc8c8	Fixed a NullPointerException case possible on Index Export As reported by Palulukas in YaCy forum (http://forum.yacy-websuche.de/viewtopic.php?f=18&t=5944&sid=dcef5b899ab4aa9b40e3a3d158c13aed#p33454) the Index Export operation can fails, notably when the Solr index contains one or more documents with empty (despite required) "load_date_dt" field. This fixes the export failure when the situation finally occurs, but more should be done to harden verifications on minimum required fields.	8 years ago
reger	7e53860fc7	fix NPE in HTMLResponseWriter on missing document title	8 years ago
reger	5e8879beb7	Reduce self generated content for text_t (visible text index field) to avoid repeat of tokenized url as description, continuation of `7e09bff4a1` `1409cabe8b` Add some javadoc, and not needed remove of omitted fields in postprocessing.	8 years ago
luccioman	6e89d125f2	Added robots.txt support for heuristics federated search. As noticed by @reger24, abusive use of OpenSearch systems should be prevented, especially if allowing to parse and reuse HTML results. robots.txt file is now checked before requesting an external OpenSearch system to respect the host exclusions and eventual crawl-delay value. The check is also performed when trying to add a new OpenSearch URL template through the /ConfigHeuristics_p.html admin page.	8 years ago
sinkuu	a46b232bf1	Use java.net.URLDecoder	8 years ago
luccioman	bf16de29c1	Added support for HTML OpenSearch results. Many OpenSearch systems do not provide results as standard RSS/Atom feeds but only as HTML. This modification add some support for custom OpenSearch HTML results through the use of mapping files (as already done for federated Solr search) relying on CSS-like selectors to retrieve information from HTML content. An example mapping file is provided to map results from the www.npmjs.com OpenSearch URL.	8 years ago
luccioman	54405577aa	Replaced absolute redirection locations by relative ones when possible. This makes integration of YaCy behind a reverse proxy subfolder easier.	8 years ago
luccioman	1857651988	Added a new Debug/Analysis advanced settings subsection. As discussed in PR #93 with @JeremyRand and @reger24 this new advanced settings page includes: - a new setting to control remote Solr responses encoding - some existing debug settings which could not be set through the admin user interface	8 years ago
luccioman	526f2d6a8b	Fixed NPE case occurring when local solr index is disabled in search.	8 years ago
luccioman	def55ec166	Improved termination of timed out remote solr requests to peers. On timeout, closing remote Solr requests is proper than simply using Thread.interrupt() that is not effective in most cases. Closing does not ask commit on remote solr, but release http connections resources and is more likely to end those threads that can else wait indefinitely. Other related improvements included : - no more marking remote peer as not available when remote search is interrupted before timeout by the cleanup job. - added a short fine log level trace of failing remote solr requests	8 years ago
luccioman	08de58b6d3	Named a Thread without name for easier monitoring	8 years ago
luccioman	9a5a124bf2	Distinguished solr connectors thread names for easier monitoring.	8 years ago
reger	1f497ccad5	Add consistency check for related index fields upon load and save of index schema. To assemble the original link url for out-/inboundlinks, icons and pictures the _protocol_sxt and _urlstub_sxt is needed (due to the used data-reduced storage methode). Auto-enable _protocol_sxt if _urlstub_sxt is enabled. to be able to correctly assemble the original link url.	8 years ago
luccioman	68afe900d0	Added user-friendly controls over disk usage configuration settings. As mentioned in issue #103, control settings over YaCy disk usage already existed but lacked a user-friendly way to set them. I added it to the Performance_p.html administration page with a little refactoring on the "Resource Observer" fieldset for improved accessibility and HTML standards respect. Also added the possibility to enable/disable the autoregulation fonction from this page.	8 years ago
reger	95d2a28599	adjust the Field-Reindex Thread to verify and update the document id in case hash (ID) doesn't match document url (sku field).	8 years ago
luccioman	fc01b69eca	Fixed local image search pagination regression. As reported by @tglman on issue #90, when searching images on the local index only, pages next to the first were always empty. This was a regression from commit `c25e48e969`.	8 years ago
Michael Peter Christen	02d0b3172c	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git	8 years ago
Michael Peter Christen	d4f45cf05e	added dc.date.modified and dc.date.created to date parser	8 years ago
reger	f9180fabc4	assure that RWI Index.Segment IODispatcher is not blocking on shudown waiting on a semaphore permit. see desc. http://mantis.tokeek.de/view.php?id=723	8 years ago
reger	e61ee180a7	Group all proxy settings on System Administration by adding settings of UrlProxyAccss page (moved from deleted AugmentedBrowsing_p), adjust submenu (remove Augmented Browsing) and translation files.	8 years ago
luccioman	39e081ef38	Fixed display of crawler pending URLs counts in HostBrowser.html page. As described in mantis 722 (http://mantis.tokeek.de/view.php?id=722) Also updated some Javadoc.	8 years ago
reger	df80c57842	add ukr and pol to DCEntry.getLanguage ISO639-2 3-char language code conversion to deliver uk, pl 2-char code and use if else to return on match	8 years ago
luccioman	e048e74072	Added an optional parameter to webstructure.xml api. This new "documentStructure" parameter can be set to false to only get hosts accumulated references on a resource and thus prevent scraping the specified URL and getting citations references. Also set WebStructureGraph constants as final and updated the Javadoc with example api call URLs.	8 years ago
reger	581b00cc20	remove obsolete lastmodified calculation in WebgraphConfig	8 years ago
luccioman	5c8958bcea	Updated Javadoc and Junit tests for the WebStructureGraph class.	8 years ago
luccioman	d9766ca981	Fixed WatchWebStructure_p.html render to include https URLs. As described in mantis 721 (http://mantis.tokeek.de/view.php?id=721) WatchWebStructure_p.html failed to include in its structure view https and other protocols and ports than default http.	8 years ago
luccioman	ed3dd5e31a	Fixed webstructure.xml API used with a domain name 'about' parameter. As described in mantis 720 (http://mantis.tokeek.de/view.php?id=720), when requesting this API with a domain name instead of a complete URL only HTTP references on default port were listed.	8 years ago
luccioman	0da1e6ba16	Factored code re-implementing DigestURL.hosthash() method. This ensure consistent implementation of the url host hash generation and easier usage finding in source code. Also added a unit test for this function.	8 years ago
luccioman	86adfef30f	Added automated unit tests and perfs test for WebStructureGraph class. Fixed references count when multiple links target the same domain name in one document.	8 years ago
luccioman	9cea7cbb10	Detailed some Javadoc related to /api/webstructure.xml usage.	8 years ago
luccioman	6a4d51d8f9	Cleaned up some Javadoc warnings.	8 years ago
luccioman	86dc198698	Fixed some JavaDocs broken links.	8 years ago
reger	16beb551ea	fix DC.Elements namespace in DublinCore vocabulary class delete redundant (unused) DCElements.	8 years ago
luccioman	339f005ced	Blacklist import and update performance improvements. Measurement sample : import from blacklist local file containing about 15000 entries - before refactoring : several minutes - after refactoring : a few seconds!	8 years ago
luccioman	e3892b0957	Added some JavaDoc.	8 years ago
reger	4c9be29a55	fix concurrency issue with htmlParser using not current scraper data resulting in incorrect data for some html index metadata. Details see http://mantis.tokeek.de/view.php?id=717	8 years ago
reger	eedee6eabb	fix exception on URIMetadataNote instantiation with corrected id hash on host_id_s. Use Solr setField instead of addField to prevent java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String at net.yacy.kelondro.data.meta.URIMetadataNode.hosthash(URIMetadataNode.java:247) at net.yacy.search.query.SearchEvent.addNodes(SearchEvent.java:966) at net.yacy.peers.Protocol.solrQuery(Protocol.java:1242) at net.yacy.peers.RemoteSearch$2.run(RemoteSearch.java:349)	8 years ago
luccioman	c1401d821e	Adjusted crawl depth control for FTP crawl start URLs.	8 years ago
reger	68d4dc5cc5	Complete harmonization RequestHeader getCookie with std ServletRequest to use javax.servlet.http.Cookie parameters. Depreciate now obsolete getHeaderCookies. Adjust setting of MaxAge to spec if >= 0 otherwise keep default.	8 years ago
reger	a1e5f7dbca	fix of fulltext.remove() by id of webgraph document webgraph has document hash in source_id_s	8 years ago
luccioman	1df558a6c6	Fixed YaCy proper shutdown triggered by SIGTERM signal. The main shutdown hook thread was not properly waiting for the main thread termination which consequently could not properly close resources and threads. After terminating a running YaCy peer this way (Ctrl+C in console, or kill <pid> for example), you could see the still existing DATA/yacy.running file. Tested with : - Debian Jessie openjdk 7 and 8 : regular shutdown, Ctrl+C, kill command, system restart while yacy is running - Windows 10 Oracle JDK 7 and 8 : non regression on regular shutdown	8 years ago
reger	b522d540b9	Include itemprop latitude/longitude (see schema.org) in attribute parsing for lat/lon. Harmonize number parsing for lat/lon to parseDouble. Fix endDate_dts value assignment.	8 years ago
reger	083df255e4	fix html tag attribute parsing containing attribute w/o value e.g. itemscope or autofocus (in such case the next key was not properly recognized).	8 years ago
reger	cb95b7339a	include html5 <time> tag in content scraper, add "datetime" property of <time> tag to scrapers startdate list. Datetime is parsed as iso8601 (xml) date, html5 allows partial as well as duration (not handled by this)	8 years ago
reger	7bf2bcf504	fix and prevent exception on missing required cookie name skip cookie creation if name is empty.	8 years ago
luccioman	3ca695390c	FTP crawl start URLs : applied crawl profile depth control Applied rules : - when the FTP URL denotes a file resource, stack it as any start URL : eventually embedded links can be followed applying the usual depth rules - when the FTP URL denotes a directory, list files under this directory and stack them for crawl, and repeat the process on sub folders until crawl depth is reached	8 years ago
luccioman	128c8ef8d4	Fixed title rendering having non ASCII chars in QuickCrawlLink_p.html.	8 years ago
reger	8eb6fba59c	activate filetype navigator plugin and restrict config (append) of navs to not already actives. Dht results are now included in count this might over shoot on redundant dht and solr, while the previous solr facet based was always low.	8 years ago
luccioman	c25e48e969	Enabled displaying results after 14th page for local search queries. Fixes issue #90 for local queries only: Stealth mode, Portal mode or Intranet mode. For P2p mode, the issue would probably be difficult to solve with reasonable performance. This is still to dig. Also switched some InterreputedException catch log messages to warn level as this is normal behavior when shutting down a peer. Fixed yacysearch buttons navbar behavior to deal correctly with total results count or offset over 1000. Also improved the buttons navbar to be able to navigate over 10th page for local queries.	8 years ago
luccioman	a3886c6adb	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git	8 years ago
luccioman	feaa87005e	Improved indentation for easier debugging steps.	8 years ago
reger	bab4804d11	add FileTypeNavigator plugin	8 years ago
reger	d35c47090c	remove obsolete put of HttpServletRequest attributes to YaCy servlet parameters on SSI (server side includes). Query parameters are already merged by dispatcher.include, making copy of parameter (RequestDispatcher.INCLUDE_QUERY_STRING) obsolete. All other parameter are not used as YaCy servlet arguments.	8 years ago
reger	0959038624	correct DefaultServlet resource pathinContext calculation exclude servletPath option as resources are always relative to htroot or htdocs, the change reflects this. Theoretically it and the recent adjustments arcording relative urls allows to configure the instance to be configurable in a path other as root (/)	8 years ago
reger	c50e23c495	reduce creation of empty legacy RequestHeader() in situation where null is acceptable (less for garbage collection).	8 years ago
reger	87f6631a2a	adjust Cache getHeader to prev. changes/commit	8 years ago
reger	6be7339b1d	remove the overhead of unused reverseMappingCache of HeaderFramewor / RequestHeader	8 years ago
reger	c702eb6786	del dead menu link to /repository (directory not created in current distribution -> old)	8 years ago
reger	baa5d9b9e3	adjust DomainHandler working on resolved .yacy domain (remove obsolete check for path on hostname)	8 years ago
luccioman	1ba705c23d	Use loaderDispatcher instead of HTTPClient to download releases. The default redirection strategy when using directly HTTPClient is incorrect when redirection is cross host (the original Host header is still sent when requesting the redirected location). YaCy LoaderDispatcher handles redirections properly, thus release archive files using redirected URLs (such as the URLs on a GitHub Release page) are successfully downloaded.	8 years ago
luccioman	467650c042	Hardened system update checks. When a downloaded archive release is corrupted, empty, or can not be opened for any reason, the update script must not be launched because it erases the existing lib/*.jar libraries.	8 years ago
luccioman	b5711b8fe1	Added some Javadocs.	8 years ago
reger	0d2964cf2b	expanded error message on rejected crawl url due to faile dns lookup close of http://mantis.tokeek.de/view.php?id=678	8 years ago
luccioman	00e81fcc15	Check HTTP status when downloading a release, and report eventual error.	8 years ago
reger	0758c868c9	add HostNavigator plugin	8 years ago
reger	60160877f5	bundle initialization of search navigation plugins in separate handler class to allow to use navigator map in config servlets (without need to create a search event)	8 years ago
reger	3151cda3a5	catch ip-format exception on wrong server access setting ip filter as reported in http://mantis.tokeek.de/view.php?id=713 to prevent abort of initialization. This jetty/whitelist ipaccesshandler accepts currently only ipv4	8 years ago

... 2 3 4 5 6 ...

8604 Commits (7496df93c38ee63c032bf6791c65623faf4e76f8)