fixed doku link

pull/419/head
Michael Peter Christen 3 years ago
parent c4659f0fb0
commit 3959d43a5c

@ -217,7 +217,7 @@
#%env/templates/submenuIndexCreate.template%#
<div id="api">
<a href="http://www.yacy-websearch.net/wiki/index.php/Dev:APICrawler" id="apilink" target="_blank"><img src="env/grafics/api.png" width="60" height="40" alt="API"/></a>
<a href="https://wiki.yacy.net/index.php/Dev:APICrawler" id="apilink" target="_blank"><img src="env/grafics/api.png" width="60" height="40" alt="API"/></a>
<span>Click on this API button to see a documentation of the POST request parameter for crawl starts.</span>
</div>
@ -228,7 +228,7 @@
You can define URLs as start points for Web page crawling and start crawling here.
"Crawling" means that YaCy will download the given website, extract all links in it and then download the content behind these links.
This is repeated as long as specified under "Crawling Depth".
A crawl can also be started using wget and the <a href="http://www.yacy-websearch.net/wiki/index.php/Dev:APICrawler" target="_blank">post arguments</a> for this web page.
A crawl can also be started using wget and the <a href="https://wiki.yacy.net/index.php/Dev:APICrawler" target="_blank">post arguments</a> for this web page.
</p>
<form id="Crawler" action="Crawler_p.html" method="post" enctype="multipart/form-data" accept-charset="UTF-8">

@ -34,7 +34,7 @@
If you switch off this index, a remote Solr must be activated.</dd>
<dt>Use remote Solr server(s)&nbsp;<input type="checkbox" name="solr.indexing.solrremote" id="solr_indexing_solrremote" #(solr.indexing.solrremote.checked)#:: checked="checked"#(/solr.indexing.solrremote.checked)# onclick="if(!document.getElementById('config').solr_indexing_solrremote.checked) {document.getElementById('config').core_service_fulltext.checked = true;}"/></dt>
<dd>It's easy to <a href="http://www.yacy-websearch.net/wiki/index.php/Dev:Solr" target="_blank">attach an external Solr to YaCy</a>.
<dd>It's easy to <a href="https://wiki.yacy.net/index.php/Dev:Solr" target="_blank">attach an external Solr to YaCy</a>.
This external Solr can be used instead the internal Solr. It can also be used additionally to the internal Solr, then both Solr indexes are mirrored.
</dd>

@ -4935,7 +4935,7 @@
<source>Index Size</source>
</trans-unit>
<trans-unit id="e587b435" xml:space="preserve" approved="no" translate="yes">
<source>It's easy to &lt;a href="http://www.yacy-websearch.net/wiki/index.php/Dev:Solr" target="_blank"&gt;attach an external Solr to YaCy&lt;/a&gt;.</source>
<source>It's easy to &lt;a href="https://wiki.yacy.net/index.php/Dev:Solr" target="_blank"&gt;attach an external Solr to YaCy&lt;/a&gt;.</source>
</trans-unit>
<trans-unit id="35e7810" xml:space="preserve" approved="no" translate="yes">
<source>This external Solr can be used instead the internal Solr. It can also be used additionally to the internal Solr, then both Solr indexes are mirrored.</source>

@ -578,7 +578,7 @@ Use remote Solr server(s)==Использовать удалённую базу
Solr Hosts==Хосты Solr
Solr Host Administration Interface==Интерфейс управления Solr
Index Size==Документов в индексе
It's easy to <a href="http://www.yacy-websearch.net/wiki/index.php/Dev:Solr" target="_blank">attach an external Solr to YaCy</a>.==Присоединить внешнюю базу Solr <a href="http://www.yacy-websearch.net/wiki/index.php/Dev:Solr" target="_blank">просто</a>.
It's easy to <a href="https://wiki.yacy.net/index.php/Dev:Solr" target="_blank">attach an external Solr to YaCy</a>.==Присоединить внешнюю базу Solr <a href="https://wiki.yacy.net/index.php/Dev:Solr" target="_blank">просто</a>.
This external Solr can be used instead the internal Solr. It can also be used additionally to the internal Solr, then both Solr indexes are mirrored.==Внешняя база данных Solr будет использоваться вместо встроенной. Вы также можете использовать дополнительно встроенную базу, но тогда индексы будут сохраняться в обе базы.
Solr URL(s)==Ссылки на базу Solr
You can set one or more Solr targets here which are accessed as a shard. For several targets, list them using a ',' (comma) as separator.==Вы можете установить одну или более баз Solr, которые будут доступны распределённо. Адреса нескольких баз указывайте через запятую.

@ -22,26 +22,26 @@ fi
if [ ! -x "$JAVA" ]
then
echo "The java command is not executable."
echo "Either you have not installed java or it is not in your PATH"
#Cron supports setting the path in
#echo "Has this script been invoked by CRON?"
#echo "if so, please set PATH in the crontab, or set the correct path in the variable in this script."
exit 1
echo "The java command is not executable."
echo "Either you have not installed java or it is not in your PATH"
#Cron supports setting the path in
#echo "Has this script been invoked by CRON?"
#echo "if so, please set PATH in the crontab, or set the correct path in the variable in this script."
exit 1
fi
usage() {
cat - <<USAGE
cat - <<USAGE
startscript for YaCy on UNIX-like systems
Options
-h, --help show this help
-t, --tail-log show the output of "tail -f DATA/LOG/yacy00.log" after starting YaCy
-l, --logging save the output of YaCy to yacy.log
-d, --debug show the output of YaCy on the console and enable remote monitoring with JMX
-f, --foreground run as a foreground process, showing the output of YaCy on the console
-p, --print-out only print the command, which would be executed to start YaCy
-h, --help show this help
-t, --tail-log show the output of "tail -f DATA/LOG/yacy00.log" after starting YaCy
-l, --logging save the output of YaCy to yacy.log
-d, --debug show the output of YaCy on the console and enable remote monitoring with JMX
-f, --foreground run as a foreground process, showing the output of YaCy on the console
-p, --print-out only print the command, which would be executed to start YaCy
-s, --startup [data-path] start YaCy using the specified data folder path, relative to the current user home
-g, --gui start a gui for YaCy
-g, --gui start a gui for YaCy
USAGE
}
@ -50,12 +50,12 @@ YACY_PARENT_DATA_PATH="`dirname $0`"
cd "$YACY_PARENT_DATA_PATH"
case "$OS" in
*"BSD"|"Darwin")
if [ $(echo $@ | grep -o "\-\-" | wc -l) -ne 0 ]
then
echo "WARNING: Unfortunately this script does not support long options in $OS."
fi
*"BSD"|"Darwin")
if [ $(echo $@ | grep -o "\-\-" | wc -l) -ne 0 ]
then
echo "WARNING: Unfortunately this script does not support long options in $OS."
fi
options="`getopt hdlptsg: $*`"
;;
*)
@ -65,7 +65,7 @@ esac
if [ $? -ne 0 ];then
exit 1;
exit 1;
fi
isparameter=0; #options or parameter part of getopts?
@ -79,76 +79,76 @@ TAILLOG=0
STARTUP=0
GUI=0
for option in $options;do
if [ $isparameter -ne 1 ];then #option
case $option in
-h|--help)
usage
exit 3
;;
-l|--logging)
LOGGING=1
if [ $DEBUG -eq 1 ];then
echo "can not combine -l and -d"
exit 1;
fi
if [ $FOREGROUND -eq 1 ];then
echo "can not combine -l and -f"
exit 1;
fi
;;
-d|--debug)
DEBUG=1
if [ $isparameter -ne 1 ];then #option
case $option in
-h|--help)
usage
exit 3
;;
-l|--logging)
LOGGING=1
if [ $DEBUG -eq 1 ];then
echo "can not combine -l and -d"
exit 1;
fi
if [ $FOREGROUND -eq 1 ];then
echo "can not combine -l and -f"
exit 1;
fi
;;
-d|--debug)
DEBUG=1
# enable asserts
JAVA_ARGS="$JAVA_ARGS -ea -Dcom.sun.management.jmxremote.port=9999 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false"
if [ $LOGGING -eq 1 ];then
echo "can not combine -l and -d"
exit 1;
fi
;;
-f|--foreground)
FOREGROUND=1
if [ $LOGGING -eq 1 ];then
echo "can not combine -l and -f"
exit 1;
fi
;;
-p|--print-out)
PRINTONLY=1
;;
-t|--tail-log)
TAILLOG=1
;;
-s|--startup)
STARTUP=1
isparameter=1
;;
-g|--gui)
GUI=1
isparameter=1
;;
esac #case option
else #parameter
if [ $option = "--" ];then #option / parameter separator
isparameter=1;
continue
else
if [ $parameter ];then
parameter="$parameter $option"
else
parameter="$option"
fi
fi
fi #parameter or option?
if [ $LOGGING -eq 1 ];then
echo "can not combine -l and -d"
exit 1;
fi
;;
-f|--foreground)
FOREGROUND=1
if [ $LOGGING -eq 1 ];then
echo "can not combine -l and -f"
exit 1;
fi
;;
-p|--print-out)
PRINTONLY=1
;;
-t|--tail-log)
TAILLOG=1
;;
-s|--startup)
STARTUP=1
isparameter=1
;;
-g|--gui)
GUI=1
isparameter=1
;;
esac #case option
else #parameter
if [ $option = "--" ];then #option / parameter separator
isparameter=1;
continue
else
if [ $parameter ];then
parameter="$parameter $option"
else
parameter="$option"
fi
fi
fi #parameter or option?
done
if [ ! -z "$parameter" ] && [ "$STARTUP" -eq 1 -o "$GUI" -eq 1 ]; then
# The data path is explicitely provided with startup or gui option
YACY_PARENT_DATA_PATH="`echo $parameter | cut -d' ' -f1`"
if [ ! "`echo $YACY_PARENT_DATA_PATH | cut -c1`" = "/" ]; then
# Parent DATA path is relative to the user home
YACY_PARENT_DATA_PATH="$HOME/$YACY_PARENT_DATA_PATH"
fi
CONFIGFILE="$YACY_PARENT_DATA_PATH/DATA/SETTINGS/yacy.conf"
# The data path is explicitely provided with startup or gui option
YACY_PARENT_DATA_PATH="`echo $parameter | cut -d' ' -f1`"
if [ ! "`echo $YACY_PARENT_DATA_PATH | cut -c1`" = "/" ]; then
# Parent DATA path is relative to the user home
YACY_PARENT_DATA_PATH="$HOME/$YACY_PARENT_DATA_PATH"
fi
CONFIGFILE="$YACY_PARENT_DATA_PATH/DATA/SETTINGS/yacy.conf"
fi
#echo $options;exit 0 #DEBUG for getopts
@ -169,8 +169,8 @@ then
# JAVA_ARGS="$JAVA_ARGS -XX:+UnlockExperimentalVMOptions -XX:+UseG1GC"
elif [ $OS = "SunOS" ]
then
# the UseConcMarkSweepGC option caused a full CPU usage - bug on Darwin.
# It was reported that the same option causes good performance on solaris.
# the UseConcMarkSweepGC option caused a full CPU usage - bug on Darwin.
# It was reported that the same option causes good performance on solaris.
JAVA_ARGS="$JAVA_ARGS -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode"
ENABLEHUGEPAGES=1
fi
@ -184,29 +184,33 @@ fi
#turn on MMap for Solr if OS is a 64bit OS
if [ -n "`uname -m | grep 64`" ]; then JAVA_ARGS="$JAVA_ARGS -Dsolr.directoryFactory=solr.MMapDirectoryFactory"; fi
if [ ! -f $CONFIGFILE -a -f "$YACY_PARENT_DATA_PATH/DATA/SETTINGS/httpProxy.conf" ]
then
# old config if new does not exist
CONFIGFILE="$YACY_PARENT_DATA_PATH/DATA/SETTINGS/httpProxy.conf"
fi
if [ -f $CONFIGFILE ]
then
# startup memory
j="`grep javastart_Xmx $CONFIGFILE | sed 's/^[^=]*=//'`";
if [ -n "$j" ]; then JAVA_ARGS="-$j $JAVA_ARGS"; fi;
# startup memory
# Priority
j="`grep javastart_priority $CONFIGFILE | sed 's/^[^=]*=//'`";
if [ -z "$YACY_JAVASTART_XMX" ]
then
# When YACY_JAVASTART_XMX is not set or empty:
# Read from $CONFIGFILE
j="`grep javastart_Xmx $CONFIGFILE | sed 's/^[^=]*=//'`";
if [ -n "$j" ]; then JAVA_ARGS="-$j $JAVA_ARGS"; fi;
else
# use the YACY_JAVASTART_XMX variable
JAVA_ARGS="-$YACY_JAVASTART_XMX $JAVA_ARGS"
fi
# Priority
j="`grep javastart_priority $CONFIGFILE | sed 's/^[^=]*=//'`";
if [ -n "$j" ]; then JAVA="nice -n $j $JAVA"; fi;
if [ -n "$j" ]; then JAVA="nice -n $j $JAVA"; fi;
PORT="`grep ^port= $CONFIGFILE | sed 's/^[^=]*=//'`";
if [ -z "$PORT" ]; then PORT="8090"; fi;
# for i in `grep javastart $CONFIGFILE`;do
# i="${i#javastart_*=}";
# JAVA_ARGS="-$i $JAVA_ARGS";
# done
# for i in `grep javastart $CONFIGFILE`;do
# i="${i#javastart_*=}";
# JAVA_ARGS="-$i $JAVA_ARGS";
# done
else
JAVA_ARGS="-Xmx600m $JAVA_ARGS";
PORT="8090"
@ -224,43 +228,43 @@ cmdline="$JAVA $JAVA_ARGS -classpath $CLASSPATH net.yacy.yacy";
if [ $STARTUP -eq 1 ] #startup
then
cmdline="$cmdline -startup $parameter"
cmdline="$cmdline -startup $parameter"
elif [ $GUI -eq 1 ];then #gui
cmdline="$cmdline -gui $parameter"
cmdline="$cmdline -gui $parameter"
fi
if [ $DEBUG -eq 1 ] #debug
then
cmdline=$cmdline
cmdline=$cmdline
elif [ $FOREGROUND -eq 1 ];then # foreground process without remote JMX monitoring
cmdline=$cmdline
cmdline=$cmdline
elif [ $LOGGING -eq 1 ];then #logging
cmdline="$cmdline >> yacy.log & echo \$! > $PIDFILE"
cmdline="$cmdline >> yacy.log & echo \$! > $PIDFILE"
else
cmdline="$cmdline >/dev/null 2>/dev/null &"
cmdline="$cmdline >/dev/null 2>/dev/null &"
fi
if [ $PRINTONLY -eq 1 ];then
echo $cmdline
echo $cmdline
else
echo "****************** YaCy Web Crawler/Indexer & Search Engine *******************"
echo "**** (C) by Michael Peter Christen, usage granted under the GPL Version 2 ****"
echo "**** USE AT YOUR OWN RISK! Project home and releases: http://yacy.net/ ****"
echo "** LOG of YaCy: DATA/LOG/yacy00.log (and yacy<xx>.log) **"
echo "** STOP YaCy: execute stopYACY.sh and wait some seconds **"
echo "****************** YaCy Web Crawler/Indexer & Search Engine *******************"
echo "**** (C) by Michael Peter Christen, usage granted under the GPL Version 2 ****"
echo "**** USE AT YOUR OWN RISK! Project home and releases: http://yacy.net/ ****"
echo "** LOG of YaCy: DATA/LOG/yacy00.log (and yacy<xx>.log) **"
echo "** STOP YaCy: execute stopYACY.sh and wait some seconds **"
echo "** GET HELP for YaCy: join our community at https://searchlab.eu **"
echo "*******************************************************************************"
if [ $DEBUG -eq 1 ] #debug
then
# with exec the java process become the main process and will receive signals such as SIGTERM
exec $cmdline
elif [ $FOREGROUND -eq 1 ];then # foreground process without remote JMX monitoring
# with exec the java process become the main process and will receive signals such as SIGTERM
exec $cmdline
else
echo " >> YaCy started as daemon process. Administration at http://localhost:$PORT << "
eval $cmdline
if [ "$TAILLOG" -eq "1" -a ! "$DEBUG" -eq "1" ];then
sleep 1
tail -f DATA/LOG/yacy00.log
fi
fi
echo "*******************************************************************************"
if [ $DEBUG -eq 1 ] #debug
then
# with exec the java process become the main process and will receive signals such as SIGTERM
exec $cmdline
elif [ $FOREGROUND -eq 1 ];then # foreground process without remote JMX monitoring
# with exec the java process become the main process and will receive signals such as SIGTERM
exec $cmdline
else
echo " >> YaCy started as daemon process. Administration at http://localhost:$PORT << "
eval $cmdline
if [ "$TAILLOG" -eq "1" -a ! "$DEBUG" -eq "1" ];then
sleep 1
tail -f DATA/LOG/yacy00.log
fi
fi
fi

@ -23,9 +23,9 @@
package net.yacy.document.parser;
import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertFalse;
import static org.junit.Assert.assertNotNull;
import static org.junit.Assert.assertTrue;
import static org.junit.Assert.assertFalse;
import java.io.ByteArrayInputStream;
import java.io.File;
@ -45,7 +45,7 @@ import net.yacy.document.VocabularyScraper;
/**
* Unit tests for the {@link GenericXMLParser} class
*
*
* @author luccioman
*
*/
@ -58,13 +58,13 @@ public class GenericXMLParserTest {
@Before
public void setUp() {
this.parser = new GenericXMLParser();
parser = new GenericXMLParser();
}
/**
* Unit test for the GenericXMLParser.parse() function with some small XML
* test files.
*
*
* @throws Exception
* when an unexpected error occurred
*/
@ -77,7 +77,7 @@ public class GenericXMLParserTest {
FileInputStream inStream = new FileInputStream(new File(folder, fileName));
DigestURL location = new DigestURL("http://localhost/" + fileName);
try {
Document[] documents = this.parser.parse(location, "text/xml", null, new VocabularyScraper(), 0,
Document[] documents = parser.parse(location, "text/xml", null, new VocabularyScraper(), 0,
inStream);
assertNotNull("Parser result must not be null for file " + fileName, documents);
assertNotNull("Parsed text must not be empty for file " + fileName, documents[0].getTextString());
@ -90,7 +90,7 @@ public class GenericXMLParserTest {
}
/**
*
*
* @param parser
* generic xml parser instance. Must not be null.
* @param encodedXML
@ -123,10 +123,10 @@ public class GenericXMLParserTest {
/**
* Test UTF-8 charset detection
*
*
* @see RFC 7303 "UTF-8 Charset" example
* (https://tools.ietf.org/html/rfc7303#section-8.1)
*
*
* @throws Exception
* when an unexpected error occurred
*/
@ -138,7 +138,7 @@ public class GenericXMLParserTest {
*/
byte[] encodedXML = ("<?xml version=\"1.0\" encoding=\"utf-8\"?>" + UMLAUT_TEXT_TAG)
.getBytes(StandardCharsets.UTF_8);
testCharsetDetection(this.parser, encodedXML, "application/xml; charset=utf-8", StandardCharsets.UTF_8.name(),
testCharsetDetection(parser, encodedXML, "application/xml; charset=utf-8", StandardCharsets.UTF_8.name(),
"Maßkrügen");
/*
@ -146,18 +146,18 @@ public class GenericXMLParserTest {
* declaration
*/
encodedXML = ("<?xml version=\"1.0\"?>" + UMLAUT_TEXT_TAG).getBytes(StandardCharsets.UTF_8);
testCharsetDetection(this.parser, encodedXML, "application/xml; charset=utf-8", StandardCharsets.UTF_8.name(),
testCharsetDetection(parser, encodedXML, "application/xml; charset=utf-8", StandardCharsets.UTF_8.name(),
"Maßkrügen");
}
/**
* Test UTF-16 charset detection
*
*
* @see RFC 7303 "UTF-16 Charset" and
* "Omitted Charset and 16-Bit MIME Entity" examples
* (https://tools.ietf.org/html/rfc7303#section-8.2 and
* https://tools.ietf.org/html/rfc7303#section-8.4)
*
*
* @throws Exception
* when an unexpected error occurred
*/
@ -169,7 +169,7 @@ public class GenericXMLParserTest {
*/
byte[] encodedXML = ("<?xml version=\"1.0\" encoding=\"utf-16\"?>" + UMLAUT_TEXT_TAG)
.getBytes(StandardCharsets.UTF_16);
testCharsetDetection(this.parser, encodedXML, "application/xml; charset=utf-16", StandardCharsets.UTF_16.name(),
testCharsetDetection(parser, encodedXML, "application/xml; charset=utf-16", StandardCharsets.UTF_16.name(),
"Maßkrügen");
/*
@ -177,7 +177,7 @@ public class GenericXMLParserTest {
* XML declaration having only BOM (Byte Order Mark)
*/
encodedXML = ("<?xml version=\"1.0\"?>" + UMLAUT_TEXT_TAG).getBytes(StandardCharsets.UTF_16);
testCharsetDetection(this.parser, encodedXML, "application/xml; charset=utf-16",
testCharsetDetection(parser, encodedXML, "application/xml; charset=utf-16",
StandardCharsets.UTF_16BE.name(), "Maßkrügen");
/*
@ -186,22 +186,22 @@ public class GenericXMLParserTest {
*/
encodedXML = ("<?xml version=\"1.0\" encoding=\"utf-16\"?>" + UMLAUT_TEXT_TAG)
.getBytes(StandardCharsets.UTF_16);
testCharsetDetection(this.parser, encodedXML, "application/xml", StandardCharsets.UTF_16.name(), "Maßkrügen");
testCharsetDetection(parser, encodedXML, "application/xml", StandardCharsets.UTF_16.name(), "Maßkrügen");
/*
* Charset is omitted in both Content-Type HTTP header and XML
* declaration with BOM (Byte Order Mark)
*/
encodedXML = ("<?xml version=\"1.0\"?>" + UMLAUT_TEXT_TAG).getBytes(StandardCharsets.UTF_16);
testCharsetDetection(this.parser, encodedXML, "application/xml", StandardCharsets.UTF_16BE.name(), "Maßkrügen");
testCharsetDetection(parser, encodedXML, "application/xml", StandardCharsets.UTF_16BE.name(), "Maßkrügen");
}
/**
* Test ISO-8859-1 charset detection
*
*
* @see RFC 7303 "Omitted Charset and 8-Bit MIME Entity" example
* (https://tools.ietf.org/html/rfc7303#section-8.3)
*
*
* @throws Exception
* when an unexpected error occurred
*/
@ -213,7 +213,7 @@ public class GenericXMLParserTest {
*/
byte[] encodedXML = ("<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>" + UMLAUT_TEXT_TAG)
.getBytes(StandardCharsets.ISO_8859_1);
testCharsetDetection(this.parser, encodedXML, "application/xml", StandardCharsets.ISO_8859_1.name(),
testCharsetDetection(parser, encodedXML, "application/xml", StandardCharsets.ISO_8859_1.name(),
"Maßkrügen");
}
@ -221,10 +221,10 @@ public class GenericXMLParserTest {
* Test charset detection when the character encoding is omitted in
* Content-Type header, and content has a XML declaration with no encoding
* declaration
*
*
* @see RFC 7303 "Omitted Charset, No Internal Encoding Declaration" example
* (https://tools.ietf.org/html/rfc7303#section-8.5)
*
*
* @throws Exception
* when an unexpected error occurred
*/
@ -242,15 +242,15 @@ public class GenericXMLParserTest {
encodedXML = ("<?xml version=\"1.0\"?>"
+ "<text>In M&#x000FC;nchen steht ein Hofbr&#x000E4;uhaus, dort gibt es Bier in Ma&#x000DF;kr&#x000FC;gen</text>")
.getBytes(StandardCharsets.US_ASCII);
testCharsetDetection(this.parser, encodedXML, "application/xml", StandardCharsets.UTF_8.name(), "Maßkrügen");
testCharsetDetection(parser, encodedXML, "application/xml", StandardCharsets.UTF_8.name(), "Maßkrügen");
}
/**
* Test UTF-16BE charset detection
*
*
* @see RFC 7303 "UTF-16BE Charset" example
* (https://tools.ietf.org/html/rfc7303#section-8.6)
*
*
* @throws Exception
* when an unexpected error occurred
*/
@ -262,13 +262,13 @@ public class GenericXMLParserTest {
*/
byte[] encodedXML = ("<?xml version='1.0' encoding='utf-16be'?>" + UMLAUT_TEXT_TAG)
.getBytes(StandardCharsets.UTF_16BE);
testCharsetDetection(this.parser, encodedXML, "application/xml; charset=utf-16be",
testCharsetDetection(parser, encodedXML, "application/xml; charset=utf-16be",
StandardCharsets.UTF_16BE.name(), "Maßkrügen");
}
/**
* Test absolute URLs detection in XML elements attributes.
*
*
* @throws Exception
* when an unexpected error occurred
*/
@ -288,7 +288,7 @@ public class GenericXMLParserTest {
String charsetFromHttpHeader = HeaderFramework.getCharacterEncoding(contentTypeHeader);
DigestURL location = new DigestURL("http://localhost/testfile.xml");
try {
Document[] documents = this.parser.parse(location, contentTypeHeader, charsetFromHttpHeader,
Document[] documents = parser.parse(location, contentTypeHeader, charsetFromHttpHeader,
new VocabularyScraper(), 0, inStream);
assertEquals(1, documents.length);
Collection<AnchorURL> detectedAnchors = documents[0].getAnchors();
@ -304,7 +304,7 @@ public class GenericXMLParserTest {
/**
* Test absolute URLs detection in XML elements text.
*
*
* @throws Exception
* when an unexpected error occurred
*/
@ -324,7 +324,7 @@ public class GenericXMLParserTest {
String charsetFromHttpHeader = HeaderFramework.getCharacterEncoding(contentTypeHeader);
DigestURL location = new DigestURL("http://localhost/testfile.xml");
try {
Document[] documents = this.parser.parse(location, contentTypeHeader, charsetFromHttpHeader,
Document[] documents = parser.parse(location, contentTypeHeader, charsetFromHttpHeader,
new VocabularyScraper(), 0, inStream);
assertEquals(1, documents.length);
Collection<AnchorURL> detectedAnchors = documents[0].getAnchors();
@ -337,7 +337,7 @@ public class GenericXMLParserTest {
inStream.close();
}
}
/**
* Test parsing well-formed XML fragment (no XML declaration, no DTD or schema)
* @throws Exception when an unexpected error occurred
@ -351,18 +351,18 @@ public class GenericXMLParserTest {
String charsetFromHttpHeader = HeaderFramework.getCharacterEncoding(contentTypeHeader);
DigestURL location = new DigestURL("http://localhost/testfile.xml");
try {
Document[] documents = this.parser.parse(location, contentTypeHeader, charsetFromHttpHeader,
Document[] documents = parser.parse(location, contentTypeHeader, charsetFromHttpHeader,
new VocabularyScraper(), 0, inStream);
assertEquals(1, documents.length);
assertEquals("Node content1 Node content2", documents[0].getTextString());
} finally {
inStream.close();
}
}
}
/**
* Test URLs detection when applying limits.
*
*
* @throws Exception
* when an unexpected error occurred
*/
@ -376,7 +376,7 @@ public class GenericXMLParserTest {
+ "Home page : http://yacy.net - International Forum : "
+ "https://searchlab.eu "
+ "and this is a mention to a relative URL : /document.html</p>"
+ "<p>Here are YaCy<a href=\"http://mantis.tokeek.de\">bug tracker</a> and <a href=\"http://www.yacy-websearch.net/wiki/\">Wiki</a>."
+ "<p>Here are YaCy<a href=\"http://mantis.tokeek.de\">bug tracker</a> and <a href=\"https://wiki.yacy.net/index.php/\">Wiki</a>."
+ "And this is a relative link to another <a href=\"/document2.html\">sub document</a></p>"
+ "</body>" + "</html>";
@ -386,12 +386,12 @@ public class GenericXMLParserTest {
String charsetFromHttpHeader = HeaderFramework.getCharacterEncoding(contentTypeHeader);
DigestURL location = new DigestURL("http://localhost/testfile.xml");
try {
Document[] documents = this.parser.parseWithLimits(location, contentTypeHeader, charsetFromHttpHeader, new VocabularyScraper(), 0, inStream, Integer.MAX_VALUE, Long.MAX_VALUE);
Document[] documents = parser.parseWithLimits(location, contentTypeHeader, charsetFromHttpHeader, new VocabularyScraper(), 0, inStream, Integer.MAX_VALUE, Long.MAX_VALUE);
assertEquals(1, documents.length);
assertFalse(documents[0].isPartiallyParsed());
assertTrue(documents[0].getTextString().contains("And this is a relative link"));
Collection<AnchorURL> detectedAnchors = documents[0].getAnchors();
assertNotNull(detectedAnchors);
assertEquals(5, detectedAnchors.size());
@ -399,22 +399,22 @@ public class GenericXMLParserTest {
assertTrue(detectedAnchors.contains(new AnchorURL("http://yacy.net")));
assertTrue(detectedAnchors.contains(new AnchorURL("https://searchlab.eu")));
assertTrue(detectedAnchors.contains(new AnchorURL("http://mantis.tokeek.de")));
assertTrue(detectedAnchors.contains(new AnchorURL("http://www.yacy-websearch.net/wiki/")));
assertTrue(detectedAnchors.contains(new AnchorURL("https://wiki.yacy.net/index.php/")));
} finally {
inStream.close();
}
/* Links limit exceeded */
inStream = new ByteArrayInputStream(xhtml.getBytes(StandardCharsets.UTF_8.name()));
try {
Document[] documents = this.parser.parseWithLimits(location, contentTypeHeader, charsetFromHttpHeader,
Document[] documents = parser.parseWithLimits(location, contentTypeHeader, charsetFromHttpHeader,
new VocabularyScraper(), 0, inStream, 2, Long.MAX_VALUE);
assertEquals(1, documents.length);
assertTrue(documents[0].isPartiallyParsed());
assertTrue(documents[0].getTextString().contains("Home page"));
assertFalse(documents[0].getTextString().contains("And this is a relative link"));
Collection<AnchorURL> detectedAnchors = documents[0].getAnchors();
assertNotNull(detectedAnchors);
assertEquals(2, detectedAnchors.size());
@ -423,7 +423,7 @@ public class GenericXMLParserTest {
} finally {
inStream.close();
}
/* Bytes limit exceeded */
StringBuilder xhtmlBuilder = new StringBuilder("<?xml version=\"1.0\" encoding=\"UTF-8\" ?>")
.append("<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">")
@ -436,25 +436,25 @@ public class GenericXMLParserTest {
.append("Home page : http://yacy.net - International Forum : ")
.append("https://searchlab.eu ")
.append("and this is a mention to a relative URL : /document.html</p>");
/* Add some filler text to reach a total size beyond SAX parser internal input stream buffers */
while(xhtmlBuilder.length() < 1024 * 20) {
xhtmlBuilder.append("<p>Some text to parse</p>");
}
int firstBytes = xhtmlBuilder.toString().getBytes(StandardCharsets.UTF_8.name()).length;
xhtmlBuilder.append("<p>Here are YaCy<a href=\"http://mantis.tokeek.de\">bug tracker</a> and <a href=\"http://www.yacy-websearch.net/wiki/\">Wiki</a>.")
xhtmlBuilder.append("<p>Here are YaCy<a href=\"http://mantis.tokeek.de\">bug tracker</a> and <a href=\"https://wiki.yacy.net/index.php/\">Wiki</a>.")
.append("And this is a relative link to another <a href=\"/document2.html\">sub document</a></p>")
.append("</body></html>");
inStream = new ByteArrayInputStream(xhtmlBuilder.toString().getBytes(StandardCharsets.UTF_8.name()));
try {
Document[] documents = this.parser.parseWithLimits(location, contentTypeHeader, charsetFromHttpHeader, new VocabularyScraper(), 0, inStream, Integer.MAX_VALUE, firstBytes);
Document[] documents = parser.parseWithLimits(location, contentTypeHeader, charsetFromHttpHeader, new VocabularyScraper(), 0, inStream, Integer.MAX_VALUE, firstBytes);
assertEquals(1, documents.length);
assertTrue(documents[0].isPartiallyParsed());
assertTrue(documents[0].getTextString().contains("and this is a mention to a relative URL"));
assertFalse(documents[0].getTextString().contains("And this is a relative link to another"));
Collection<AnchorURL> detectedAnchors = documents[0].getAnchors();
assertNotNull(detectedAnchors);
assertEquals(3, detectedAnchors.size());

Loading…
Cancel
Save