Background: some user report problem with connecting/crawling some sites via https which require SNI support (by default switched off in YaCy). On the other hand systems not demanding SNI support are sometimes not properly configured and due to a bug/feature in java 1.7 connection is aborted. The later is more often the case, so the default is still fine. With the java start parameter expert user can no alter the startparameter to -Djsse.enableSNIExtension=true (java default) if they crawl more hosts requiring SNI support.
The alternative to let YaCy try both during https handshake (deep inside the httpclient) is not pursut at this time.
*) will try other ports if YaCy standard ports are not available
*) distinguish between internal and external port (not sure if this
works 100%)
Still to add: propery in config to enter own external port (in case of
manually configured NAT)
tested with IE11 and Firefox 32 (change worked for both to show 2nd line without cutting off height)
+fix charset parameter in metadataImageParser
+update start errMsgTxt to "java 1.7"
removed preferred IPv4 in start options and added a new field IP6 in
peer seeds which will contain one or more IPv6 addresses. Now every peer
has one or more IP addresses assigned, even several IPv6 addresses are
possible. The peer-ping process must check all given and possible IP
addresses for a backping and return the one IP which was successful when
pinging the peer. The ping-ing peer must be able to recognize which of
the given IPs are available for outside access of the peer and store
this accordingly. If only one IPv6 address is available and no IPv4,
then the IPv6 is stored in the old IP field of the seed DNA.
Many methods in Seed.java are now marked as @deprecated because they had
been used for a single IP only. There is still a large construction site
left in YaCy now where all these deprecated methods must be replaced
with new method calls. The 'extra'-IPs, used by cluster assignment had
been removed since that can be replaced with IPv6 usage in p2p clusters.
All clusters must now use IPv6 if they want an intranet-routing.
found a situation after crash (reboot) with existing running semaphore but YaCy not running.
Ping generated exception which finally deleted the conf file (during pre-read procedure)
- change to ping (catch exception solved it)
- additionally removed delete yacy.conf file (if needed we need to make a backup)
besides adjustments in code it makes the servlet settings in web.xml significant.
This applies to solr, gsa and proxy servlet. There is no longer a default setup in code during init (as jetty 9 checks for double definition).
- move htroot exist check from old httpdfilehandler to startup, remove from filehandler and legacy proxyhandler
- use SwitchboardConstant.htroot where appropriate
- useful for remote installation to set any config file property
- multipe parameter can be set at once, on Windows enclose parameter in doublequotes
- special handling "adminAccount=adminuser:adminpwd" sets adminusername and md5 encoded admin-pwd
- adjusted windows startbatch to allow command line parameter handling
- remove not needed classpath calculation from startYACY_debug.bat
introduced, it was also used for search facets. The generic search
facets are now deduced from generic solr fields which makes jena as tool
for facet semantics superfluous.
request into a separate thread and ignores the furthure result of a
request if that does not answer within the requested time-out. This is a
try to solve a problem with the peer-ping, which hangs whenever a peer
appears to be dead or blocked.
- the admin user name can be configured, in apiExec calls the default "admin" username is used.
TODO: the bin/apicall.sh script should likely take that into account.
via Jetty IPAccessHandler to allow only configured IP's to access.
Handler is only loaded if a restriction is configured.
Since IPAcessHandler (Jetty 8) does not support IPv6 system property java.net.preferIPv4Stack=true
Testing showed system.setProperty seems to be sensitive to point of calling (earliest possible time seems to be best = early in yacy.main).
Moved the "isrunning..." just open browser check also to the new routine to preread the yacy.config only once.
- userDB is not sync'ed with Jetty credentials as of now only the std. admin account can login
switched initial browser open with ssl active back to std. http port
!!! attention !!! to make sure YaCy can start, https will be disabled if port 8443 is used
- added ping test for above to migration
- as of now port for https is hardcoded to default 8443
- if not urgend required I'd leave it this way (it's standard) to use different ports for http and https
- post https port on ConfigBasic.html (if active)
- introduce a YaCyHttp interface to modulize/separate http server
- adjust the Jetty version specific implementation part (in package net.yacy.http)
- putting the version specific code in classes starting with Jetty8xxxx
- moved existing Jetty9xxx implementation into a test class (to keep the code)
- adjust build to the changed jars
- make use of the introduced YaCyHttpServer interface in related htroot servlets
- adjust other test cases/classes
ServerSideIncludes and servlet return values need further work (for working jetty integration)
- TODO: added nasty quickfix to allow SSI - needs further work
- TODO: YaCy servlet return values/parameters are not handled
in intranets and the internet can now choose to appear as Googlebot.
This is an essential necessity to be able to compete in the field of
commercial search appliances, since most web pages are these days
optimized only for Google and no other search platform any more. All
commercial search engine providers have a built-in fake-Google User
Agent to be able to get the same search index as Google can do. Without
the resistance against obeying to robots.txt in this case, no
competition is possible any more. YaCy will always obey the robots.txt
when it is used for crawling the web in a peer-to-peer network, but to
establish a Search Appliance (like a Google Search Appliance, GSA) it is
necessary to be able to behave exactly like a Google crawler.
With this change, you will be able to switch the user agent when portal
or intranet mode is selected on per-crawl-start basis. Every crawl start
can have a different user agent.
jdk-based logger tend to block
at java.util.logging.Logger.log(Logger.java:476) in concurrent
environments. This makes logging a main performance issue. To overcome
this problem, this is a add-on to jdk logging to put log entries on a
concurrent message queue and log the messages one by one using a
separate process.
- FTPClient uses the concurrent logging instead of the log4j logger
- move setting of system property solr.directoryFactory=solr.MMapDirectoryFactory to solrcore.properties
- add check of os.arch for 64bit system, if it fails use default/solrcore.x86.properties (if exists) as solrcore.properties
reason: on 32bit MMapDirectoryFactory may fail with.....
Caused by: java.io.IOException: Map failed
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:849)
at org.apache.lucene.store.MMapDirectory.map(MMapDirectory.java:283)
- to prevent following log if YaCy was previously not properly shutdown
E ... STARTUP WARNING: the file C:\src\git\yacy-rc1\DATA\yacy.running exists, this usually means that a YaCy instance is still running
E ... STARTUP FATAL ERROR: java.util.concurrent.TimeoutException
java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException
at net.yacy.cora.protocol.TimeoutRequest.call(TimeoutRequest.java:91)
at net.yacy.cora.protocol.TimeoutRequest.ping(TimeoutRequest.java:112)
at net.yacy.yacy.startup(yacy.java:200)
at net.yacy.yacy.main(yacy.java:638)
Caused by: java.util.concurrent.TimeoutException
- adjust Netbeans path (to solr4.1.jars)
YaCy; YaCy is still running and the user additionally expect that
another doubleclick on the YaCy icon simply opens the search windows
(again) I decided to add a function that complies to the expectation to
the user: simply open the browser pop-up page again if the user starts
YaCy while YaCy is still running.