The project invests a lot of work
to make builds reproducible. This is a security property. It allows
to compare the build of binaries from different builder machines.
If they are identical, it means that either the builds have not
been manipulated or an attacker managed to attack all builder
machines in exactly the same way.
One problem that the reproducible-builds project often sees is
that projects include the build time in their binaries. This
makes builds unreproducible for apparently no reason. The build
date should not be of interest since binaries built on different
dates but from the same source code should not be different.
Thus I decided to remove the build date instead of re-implementing
the functionality without the GitRev task. Anyways the reported
date was not the build date but the date of the last git commit
which is even less informative. The git commit ID would have
information value but should only be relevant for "nightly builds".
PKGMANAGER is always false, thus the java code wrapped in
if statements for this property is dead code and can also
be removed.
The Debian packaging removed in c4659f0fb0
did set the PKGMANAGER property to true. When we do distro
packages again, we can revisit this commit and redo it with
property files instead.
RESTARTCMD is only used inside those dead code.
DESTDIR is never used even in the build.xml
to be able to use/reuse Ant targets where task has not been implemented in Gradle build.
- use the import to include the compile of htroot as first important task
! it is possible that first build fails an compile of GitRevTask.jar !
! solution/workaround -> use "ant all" once to compile GitRevTask.jar !
- adjusted build.xml a little
- split compile-core into compile-core and compile-htroot to have a target for htroot comp. only
- set build-path to reuse Gradles build directory
- (fix javadoc failure)
- changed the filtered-copy of to ! the build path :-(
as current (copy,delete,exclude) is complicated and not migration worthy,
used simple/straigt forward approach (using a file as copy source)
To make seed upload (in /Settings_p.html?page=seed page) with SCP easier
when the user specify a remote target directory path.
See report by @vikulin in issue #227
Especially for Turkish speaking users using "tr" as their system default
locale : strings for technical stuff (URLs, tag names, constants...)
must not be lower cased with the default locale, as 'I' doesn't becomes
'i' like in other locales such as "en", but becomes 'ı'.
The default redirection strategy when using directly HTTPClient is
incorrect when redirection is cross host (the original Host header is
still sent when requesting the redirected location).
YaCy LoaderDispatcher handles redirections properly, thus release
archive files using redirected URLs (such as the URLs on a GitHub
Release page) are successfully downloaded.
When a downloaded archive release is corrupted, empty, or can not be
opened for any reason, the update script must not be launched because it
erases the existing lib/*.jar libraries.
- removed the default version blacklist regex from init (for future versions)
!!! left existing update blacklist setting untouched !!!
(existing installation wanting autoupdate for 1.71 need to change blacklist in ConfigUpdate_p.html)
- moved old blacklist patch to
request into a separate thread and ignores the furthure result of a
request if that does not answer within the requested time-out. This is a
try to solve a problem with the peer-ping, which hangs whenever a peer
appears to be dead or blocked.
- the admin user name can be configured, in apiExec calls the default "admin" username is used.
TODO: the bin/ script should likely take that into account.
all unique links! This made it necessary, that a large portion of the
parser and link processing classes must be adopted to carry a different
type of link collection which carry a property attribute which are
attached to web anchors.
- introduction of a new URL class, AnchorURL
- the other url classes, DigestURI and MultiProtocolURI had been renamed
and refactored to fit into a new document package schema,
- cleanup of net.yacy.cora.document package and refactoring
in intranets and the internet can now choose to appear as Googlebot.
This is an essential necessity to be able to compete in the field of
commercial search appliances, since most web pages are these days
optimized only for Google and no other search platform any more. All
commercial search engine providers have a built-in fake-Google User
Agent to be able to get the same search index as Google can do. Without
the resistance against obeying to robots.txt in this case, no
competition is possible any more. YaCy will always obey the robots.txt
when it is used for crawling the web in a peer-to-peer network, but to
establish a Search Appliance (like a Google Search Appliance, GSA) it is
necessary to be able to behave exactly like a Google crawler.
With this change, you will be able to switch the user agent when portal
or intranet mode is selected on per-crawl-start basis. Every crawl start
can have a different user agent.
jdk-based logger tend to block
at java.util.logging.Logger.log( in concurrent
environments. This makes logging a main performance issue. To overcome
this problem, this is a add-on to jdk logging to put log entries on a
concurrent message queue and log the messages one by one using a
separate process.
- FTPClient uses the concurrent logging instead of the log4j logger
The default schema uses only some of them and the resting search index
has now the following properties:
- webgraph size will have about 40 times as much entries as default
- the complete index size will increase and may be about the double size
of current amount
As testing showed, not much indexing performance is lost. The default
index will be smaller (moved fields out of it); thus searching
can be faster.
The new index will cause that some old parts in YaCy can be removed,
i.e. specialized webgraph data and the noload crawler. The new index
will make it possible to:
- search within link texts of linked but not indexed documents (about 20
times of document index in size!!)
- get a very detailed link graph
- enhance ranking using a complete link graph
To get the full access to the new index, the API to solr has now two
access points: one with attribute core=collection1 for the default
search index and core=webgraph to the new webgraph search index. This is
also avaiable for p2p operation but client access is not yet