yacy_search_server

Commit Graph

Author	SHA1	Message	Date
luccioman	da3dbf9ea1	Use Javadoc style comments on SearchEvent properties. For better code readability and understanding.	8 years ago
luccioman	c6ae87168a	Added unit tests on the gzip parser.	8 years ago
luccioman	169ffdd1c7	Finer control on max links to parse in the html parser.	8 years ago
luccioman	4743a104b5	Added some unit tests on FileUtils.	8 years ago
luccioman	e41d046a9d	Improved parsing support for OOXML spreadsheets (.xlsx) As reported edycop in mantis 765 ( http://mantis.tokeek.de/view.php?id=765 ), parsing of xlsx files was quite incomplete. Now properly support "Shared String Table" entry in Office Open XML spreadsheets, an also detect embedded URLs. Integrating the Apache poi-ooxml library could be an option for finer OOXML formats support, but their SAX style parsing example ( http://poi.apache.org/spreadsheet/how-to.html#xssf_sax_api ) tends to show that a custom SAX handler is still efficient for lightweight and low memory footprint processing.	8 years ago
reger	51a4e03c93	Allow to stop currently running warc import (stop button)	8 years ago
luccioman	6cec2cdcb5	Use unredirected robots.txt URL when adding an entry to the table.	8 years ago
luccioman	3f0446f14b	Ensure proper synchronous robots entry retrieval on first check. Previously, when checking for the first time the robots.txt policy on a unknown host (not cached in the robots table), result was always empty in the /getpageinfo_p.xml api and in the /CrawlCheck_p.html page. Next calls returned however the correct information.	8 years ago
luccioman	9da75ac76d	Upgraded Docker base image from deprecated java to openjdk.	8 years ago
luccioman	b23a563065	Prevent search result failure on incomplete images information. Complements the recent modification related to images in commit `7f395ef`. Unfortunately many documents metadata fetched from the freeworld p2p network have only partial information about embedded images. Without proper error handling, this made many searches in p2p mode to fail completely.	8 years ago
Michael Peter Christen	30d71c6359	added usage of X-Real-IP http header to identify request IPs which came through NGINX reverse proxy configurations	8 years ago
Michael Peter Christen	f45378c11c	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git	8 years ago
Michael Peter Christen	7f395ef937	added image link in search results This should be a help to make a preview of search results. The image is computed from the list of embedded images, it is always the first image in that list. In rss-type results the image is presented like <media:content medium="image" url="https://abc.xyz/logo.png"/> as defined in http://www.rssboard.org/media-rss#media-content	8 years ago
luccioman	780173008e	Implemented partial stream parsing of tar archives. Also added JUnit tests for the tar parser and fixed unwanted use of the tar parser as a fallback on files included in a tar archive.	8 years ago
luccioman	acab6a6def	Also handle text content when parsing XML within limits.	8 years ago
reger	f38fb7f02c	Add junit test for AbstractOperations.addOperand()	8 years ago
reger	2a07799ad1	Correction of `d03e2c98ea` Fix Conjunction.addOperator to do nothing if term is empty prevent to result in query string with repeated logical operator like "field:term AND AND field:term" possibliy causing out of mem in postprocessing_doublecontent	8 years ago
reger	d03e2c98ea	Fix Conjunction.addOperator to do nothing if term is empty prevent to result in query string with repeated logical operator like "field:term AND AND field:term" possibliy causing out of mem in postprocessing_doublecontent	8 years ago
reger	b6a41df4f7	Remove deprecated YaCyProxyServlet was replaced by UrlProxyServlet	8 years ago
luccioman	8a94fef9e0	Prevent unwanted cached bytes duplication on stream parsing.	8 years ago
luccioman	ed678186a8	Updated xml parser limited parsing test for use latest jdk.	8 years ago
luccioman	366ceae35a	Fixed missing transitive dependency to commons-collections4-4.1 Dependency required by poi-3.16. Dependency was not provided in YaCy but already defined on previous poi versions. This only became problematic since upgrade from poi-3.15 to poi-3.16 (commit `dedc6552d3`). Indeed in this new poi release, a poi component used in some YaCy parsers code paths now explicitely needs a class from the commons-collections4 library : org.apache.poi.hpsf.Section uses now org.apache.commons.collections4.bidimap.TreeBidiMap. Impacted YaCy parsers : xlsParser, pptParser, docParser. Issue detected by the folowing JUnit tests failing : ParserTest.testpptParsers(), ParserTest.testdocParsers(), xlsParserTest.testParse()	8 years ago
luccioman	bf72cbffa3	Updated debian package configuration to match new Java 1.8 target Following migration from Java 1.7 to Java 1.8 in commit `6fe735945d`	8 years ago
reger	119b65389d	upde to icu4j-59_1.jar	8 years ago
reger	4979439e87	Skip public post of jre version. Added to determine switch to java8 `596b5dfa59`	8 years ago
reger	e918ec199e	Replace deprecated ConcurrentHashSet with recommended Java8 ConcurrentHashMap.newKeySet() in postprocessDocuments()	8 years ago
reger	fb71994342	Harmonizing use of xml reader / sax parser in XMLBlacklistImporter eliminating the need for lib/xercesImpl.jar	8 years ago
reger	275d65fffe	Patch last_modified date with internal FirstSeenTime() if no date provided to make sure updated documents are indexed with their last-modified date as provided in current crawl. (to patch moddate always with firstseen might bear the risk of miss actual updates).	8 years ago
reger	d1b23afed6	Remove obsolete Protocol parameter ttl (time to live) not interpreted in target yacy/query.html also Protocol.querySeed() not used and parameter not interpreted in target servlet yacy/query.html	8 years ago
reger	dedc6552d3	upd to poi-3.16.jar	8 years ago
reger	15d78b1064	Replace deprecated getIP with getIPs in Protocol transferURL() and getProfile(). Remember used ip for error handling and departInterface	8 years ago
reger	ed36b47bec	Replace one more deprecated peerDeparture in Protocol.transferIndex() by moving/using interfaceDeparture() in transferRWI()	8 years ago
reger	37f44941fb	upd to pdfbox-2.0.7.jar	8 years ago
reger	41616de0b8	Add SolrConfig ClassicIndexSchemaFactory to prevent Solr startup warning. This overrides Solr default to use managed schema. As we don't use programatic schema changes this directs Solr to use schema.xml, eliminating the warning.	8 years ago
luccioman	0ee8c030c4	Log an error when Solr folder migration fails for some reason.	8 years ago
reger	44d455dfed	upd to jwat-warc-1.1.0.jar	8 years ago
reger	588c6e96fb	upd version for typeahead.jquery.js in jslicense.html	8 years ago
luccioman	5a646540cc	Support parsing gzip files from servers with redundant headers. Some web servers provide both 'Content-Encoding : "gzip"' and 'Content-Type : "application/x-gzip"' HTTP headers on their ".gz" files. This was annoying to fail on such resources which are not so uncommon, while non conforming (see RFC 7231 section 3.1.2.2 for "Content-Encoding" header specification https://tools.ietf.org/html/rfc7231#section-3.1.2.2)	8 years ago
luccioman	11a7f923d4	Distinguish response parsing failures from unexpected exceptions.	8 years ago
luccioman	8100c033a2	URL Viewer : apply crawler size limits when adding to local index. This allow large files parsing and preview, while preventing unwanted OutOfMemory errors which are likely to occur when adding to the Solr Index resources larger than configured crawler limits.	8 years ago
luccioman	eda7b0aeb6	Merge branch 'master' of https://github.com/yacy/yacy_search_server	8 years ago
reger	3005be7349	Clean up unmaintained and unused AugmentParser trail.	8 years ago
reger	e5cff062b5	Clean up redundant but obsolete jquery.rdfquery-core-1.0.js script lib	8 years ago
luccioman	cb4f1358e1	Added gzip parser support for max content bytes limit	8 years ago
luccioman	5216c681a9	Added HTML parser support for maximum content bytes parsing limit	8 years ago
luccioman	4aafebc014	Merge pull request #122 from Scarfmonster/patch-1 I also reproduced the issue, and the fix is working fine. Thanks @Scarfmonster	8 years ago
luccioman	651fad6da5	Added RSS parser support for maximum content bytes parsing limit	8 years ago
luccioman	452a17a8d5	Finer control on bounded input streams with custom stream implementation	8 years ago
luccioman	f8f1959ebb	Added parsing within bounds implementation to the generic parser.	8 years ago
luccioman	e0f400a0bd	Support trying multiple parsers even when streaming on large resources.	8 years ago

1 2 3 4 5 ...

13305 Commits (da3dbf9ea1666b033c0a59661398986bbd574939) All Branches Search

13305 Commits (da3dbf9ea1666b033c0a59661398986bbd574939)

All Branches