java.lang.NullPointerException
at net.yacy.kelondro.index.RowCollection.<init>(RowCollection.java:97)
at net.yacy.kelondro.index.RowSet.<init>(RowSet.java:48)
at net.yacy.kelondro.rwi.ReferenceContainer.<init>(ReferenceContainer.java:58)
at net.yacy.kelondro.rwi.ReferenceIterator.next(ReferenceIterator.java:69)
at net.yacy.kelondro.rwi.ReferenceIterator.next(ReferenceIterator.java:43)
at net.yacy.kelondro.blob.ArrayStack.merge(ArrayStack.java:1023)
at net.yacy.kelondro.blob.ArrayStack.mergeWorker(ArrayStack.java:922)
at net.yacy.kelondro.blob.ArrayStack.mergeMount(ArrayStack.java:869)
at net.yacy.kelondro.rwi.IODispatcher$MergeJob.merge(IODispatcher.java:267)
at net.yacy.kelondro.rwi.IODispatcher$MergeJob.access$300(IODispatcher.java:239)
at net.yacy.kelondro.rwi.IODispatcher.run(IODispatcher.java:180)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7822 6c8d7289-2bf4-0310-a012-ef5d649a1542
it parses text files with the following syntax:
- all lines beginning with '##' are comments
- all non-empty lines not beginning with '#' are keyword lines
- all lines beginning with '#' and where the second character is not '#' are commented-out keyword lines
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7808 6c8d7289-2bf4-0310-a012-ef5d649a1542
for example search for "passwd /ftp". This can also be done with /http /https and /smb
- fixed some search throttling processes that should protect your peer against search DoS or strong search load
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7794 6c8d7289-2bf4-0310-a012-ef5d649a1542
- added a memory limitation in the zip parser and the pdf parser
- added a search throttling: if there are too many search queries are still to be computed, then new requests are not accepted for some time. if after a one second still no space is there to perform another search, the search terminates with no results. this case should only happen in case of DoS-like situations and in case of strong load on a peer like if it is integrated in metager.
- added a search cache deletion process that removes search requests in case that throttling happens
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7766 6c8d7289-2bf4-0310-a012-ef5d649a1542
- forced a possible short memory status when a search is started to flush caches that may cause search-heaps with resource contention effects
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7747 6c8d7289-2bf4-0310-a012-ef5d649a1542
used a ASCII String <-> byte[] conversion wherever possible. Many Strings in YaCy are hashes which are pure ASCII (base64 hashes).
The new ASCII String <-> byte[] conversion method have less computation overhead than the UTF8 conversion.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7746 6c8d7289-2bf4-0310-a012-ef5d649a1542
- many speed/performance hacks
- added solr charding and new charding web interface
- added option to switch off the yacy index when using solr
- added new fail-url categories which are used to make a distinction which fail-urls to be sent to solr
- refactoring/renaming of some method names to distinguish host/url hashes better
- a large number of bug/npe fixes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7738 6c8d7289-2bf4-0310-a012-ef5d649a1542
- less threads for blocking threads
- disable all threads for DHT transmission for networks with zero peers
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7737 6c8d7289-2bf4-0310-a012-ef5d649a1542
This time it works like this:
- each peer provides its ranking information using the yacy/idx.json servlet
- peers with more than 1 GB ram will load this information from all other peers, combine that into one ranking table and store it locally. This happens during the start-up of the peer concurrently. The new generated file with the ranking information is at DATA/INDEX/<network>/QUEUES/hostIndex.blob
- this index is then computed to generate a new fresh ranking table. Peers which can calculate their own ranking table will do that every start-up to get latest feature updates until the feature is stable
- I computed new ranking tables as part of the distribition and commit it here also
- the YBR feature must be enabled manually by setting the YBR value in the ranking servlet to level 15. A default configuration for that is also in the commit but it does not affect your current installation only fresh peers
- a recursive block rank refinement is implemented but disabled at this point. it needs more testing
Please play around with the ranking settings and see if this helped to make search results better.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7729 6c8d7289-2bf4-0310-a012-ef5d649a1542
This servlet currently only serves for indexes to the web structure hosts. It can be tested by calling
http://localhost:8090/yacy/idx.json?object=host
This yacy protocol servlet is the first one that returns JSON code and that also shows index entries in a readable format. This will make the development of API applications much easier. This is also an example implementation for possible json versions of the other existing YaCy protocol interfaces.
The main purpose of this new feature is to provide a distributed block rank collection feature. Creating a block rank is very difficult if the forward-link data is first collected and then one peer must create a backward-link index. This interface provides already a partial backward index and therefore a collection of all these indexes needs only to be joined which is very easy. The result should be the computation of new block rank tables that all peers can perform.
To reduce load from peers this servlet buffers all data and refreshes it only once in 12 hours. This very slow update cycle is needed because the interface will be called round-robin from all peers once after start-up.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7724 6c8d7289-2bf4-0310-a012-ef5d649a1542
- added transport of string-search pattern to remote search protocol
- fixed a problem parsing snippets with a '-' inside
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7700 6c8d7289-2bf4-0310-a012-ef5d649a1542
- added option to crawler to send error-URLs to solr
- changed solr scheme slightly (no multi-value fields where no multi values are)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7693 6c8d7289-2bf4-0310-a012-ef5d649a1542
- added more properties to solr index
- refactoring
- more constants in switchboard
- fix for some NPEs
- recognition of more images
- removed synchronization in HandleMap (obviously not necessary?)
- added a nolocal configuration to remove excessive dns lookup (works only on allip - default off). Indexes produced with this setting are all flagged with 'local' and are (on purpose) not usable for freeworld because they will be rejected as beeing local.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7672 6c8d7289-2bf4-0310-a012-ef5d649a1542
- added deletion of solr index in IndexControlRWIs
- added asynchronous adding of large url lists (happens when crawls are startet with file)
- fixed npe in Image display
- replaced language warning with fine logging
- added a domain name cache in Domains that helps to speed up the isLocal property (less DNS lookups)
- added a new storage class for this new cache: KeyList. The domain key list is stored in DATA/WORK/globalhosts.list
- added concurrent solr updates and chunked transfers (50 documents until a commit is done) for high-speed feeding (> 40000 ppm)
- fixed a bug in content scraper that chopped off large parts of crawl lists (using crawl start from file)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7666 6c8d7289-2bf4-0310-a012-ef5d649a1542
search is now done using verify=false (instead of verify=cacheonly) which will cause that much more targets can be found.
This showed a bug where no location information was used from the metadata (and other metadata information) if cache=false is requested. The bug was fixed.
Added also location parsing from wikimedia dumps. A wikipedia dump can now also be a source for a location search.
Fixed many smaller bugs in connection with location search.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7657 6c8d7289-2bf4-0310-a012-ef5d649a1542
the httpclient is required by solrj but no class from solrj is used which references to httpclient-3.1
Instead the YaCy http client library based on the apache http client 4.1 is used using a wrapper class
which is in net.yacy.cora.services.federated.solr.SolrHTTPClient
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7655 6c8d7289-2bf4-0310-a012-ef5d649a1542
YaCy supports now the storage to remote solr indexes.
More federated storage (and search) methods may follow.
The remote index scheme is the same as produced by the SolrCell; see
http://wiki.apache.org/solr/ExtractingRequestHandler
Because this default scheme is used, the default example scheme can be used as solr configuration
This is also the same scheme that solr uses if documents are imported with apache tika.
federated solr storage is switched off by default.
To use this, do the following:
- set federated.service.solr.indexing.enabled = true
- download solr from http://www.apache.org/dyn/closer.cgi/lucene/solr/
- extract the solr (3.1) package, 'cd example' and start solr with 'java -jar start.jar'
- start yacy and then start a crawler. The crawler will fill both, YaCy and solr indexes.
- to check whats in solr after indexing, open http://localhost:8983/solr/admin/
Until now it is not possible to use the solr index to search with YaCy in that solr index.
This functionality is now available for two reasons:
1) to compare the functionality of Solr and YaCy and to compare the search speed
2) to use YaCy as a search appliance for people who need a crawler or other source harvesting methods
that YaCy provides (like dublin core reading, wikimedia dump reading, rss feed reader etc) if people still
want to use solr instead of YaCy.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7654 6c8d7289-2bf4-0310-a012-ef5d649a1542
- added parser for in-text appearing geo-locations
- added geo-locations to rss search result
- added evaluation of metadata-attached geo-locations in yacysearch_location to show search results within a map
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7631 6c8d7289-2bf4-0310-a012-ef5d649a1542
- extended metadata information in index with geolocalisation
- added display of location in yacydoc and ViewFile
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7629 6c8d7289-2bf4-0310-a012-ef5d649a1542
* log requires poison to finish, so Base64Order main-function doesn't finish, when called from debian configure script
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7616 6c8d7289-2bf4-0310-a012-ef5d649a1542
"THREADS WITH STATES: LOCK FOR OTHERS"
will show only such threads that lock other threads. This is the 'opposite part' of the blocked threads.
Because that this uses a thread dump that is produced with a kill -3 on the PID of the process and such thread dumps are written by the Java core outside of System.out and Sytem.err it is necessary to read the dump from a log in the file system. Such a log is only written if YaCy is started with startYACY.sh on a linux system. That means:
this feature is only available on linux and Mac OS X if YaCy is started with ./startYACY.sh -l
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7595 6c8d7289-2bf4-0310-a012-ef5d649a1542
- new concurrent score map using atom operation from java concurrency classes
- redesigned difference beween StaticScore and Dynamic Score into ScoreMap and ReversibleScoreMap allowed that many classes can now use simple ScoreMap Objects which can be used better in concurrent environments using the ConcurrentScoreMap
- switched from DynamicScore to ConcurrentScoreMap usage wherever possible
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7586 6c8d7289-2bf4-0310-a012-ef5d649a1542
1) if the file is changed for a re-crawl this is not reflected in the steering because it would take the previously uploaded crawl start file
2) browsers do not submit the full path of the selected file even if this path is shown in the input field because of security reasons. There is no work-around or hack to make the submission of the full path possible
- fixed deletion of crawl start point urls in crawl stack and balancer double-check
- fixed a problem with steering self-call (no resolving of localhost)
- added more logging for the crawler to supervise why crawl urls are not taken by the loader
- added a javascript onload-function to select domain restriction in all cases where a crawl is started from a file or from a url
- fixed the restrict-to-domain pattern computation, added a 'www.'-prefix and added this functionality also to a crawl start from file
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7574 6c8d7289-2bf4-0310-a012-ef5d649a1542
STARTUP: Trying to load logging configuration from file /Data/workspace1/yacy/DATA/LOG/yacy.logging
Can't load log handler "net.yacy.kelondro.logging.ConsoleOutErrHandler"
java.lang.ClassNotFoundException: net.yacy.kelondro.logging.ConsoleOutErrHandler
java.lang.ClassNotFoundException: net.yacy.kelondro.logging.ConsoleOutErrHandler
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
at java.util.logging.LogManager$3.run(LogManager.java:359)
at java.security.AccessController.doPrivileged(Native Method)
at java.util.logging.LogManager.loadLoggerHandlers(LogManager.java:346)
at java.util.logging.LogManager.initializeGlobalHandlers(LogManager.java:898)
at java.util.logging.LogManager.access$900(LogManager.java:130)
at java.util.logging.LogManager$RootLogger.getHandlers(LogManager.java:979)
at java.util.logging.Logger.log(Logger.java:454)
at java.util.logging.Logger.doLog(Logger.java:480)
at java.util.logging.Logger.log(Logger.java:503)
at net.yacy.kelondro.logging.Log$logRunner.run(Log.java:332)
Can't load log handler "net.yacy.kelondro.logging.LogalizerHandler"
java.lang.ClassNotFoundException: net.yacy.kelondro.logging.LogalizerHandler
java.lang.ClassNotFoundException: net.yacy.kelondro.logging.LogalizerHandler
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
at java.util.logging.LogManager$3.run(LogManager.java:359)
at java.security.AccessController.doPrivileged(Native Method)
at java.util.logging.LogManager.loadLoggerHandlers(LogManager.java:346)
at java.util.logging.LogManager.initializeGlobalHandlers(LogManager.java:898)
at java.util.logging.LogManager.access$900(LogManager.java:130)
at java.util.logging.LogManager$RootLogger.getHandlers(LogManager.java:979)
at java.util.logging.Logger.log(Logger.java:454)
at java.util.logging.Logger.doLog(Logger.java:480)
at java.util.logging.Logger.log(Logger.java:503)
at net.yacy.kelondro.logging.Log$logRunner.run(Log.java:332)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7559 6c8d7289-2bf4-0310-a012-ef5d649a1542
- background thread for bookmark initialization: this uses a DNS lookup which may cause long waiting times during startup
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7554 6c8d7289-2bf4-0310-a012-ef5d649a1542
Reason: if this is not used in StringBody-Class initialization, a default charset name is parsed.
This is a synchronized process and all classes using default charsets synchronize at that point
Synchronization is omitted if this class is used
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7530 6c8d7289-2bf4-0310-a012-ef5d649a1542
- added more instances of formatter objects to different classes to make them independent in case of lockings that may applay during synchronization of the date formatter object (date formatting is not thread-safe and must be synchronized therefore)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7528 6c8d7289-2bf4-0310-a012-ef5d649a1542
- less synchronizations
- better speed
..at most important and commonly used classes: http headers, url parsing and html parsing
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7526 6c8d7289-2bf4-0310-a012-ef5d649a1542
- filtering out too old peers when reading seed lists (limit is now 240 minutes)
- added concurrent host names resolving in front of the http client because the http client uses the java built-in DNS resolve which is not multithreading-safe (i have seen deadlocks in thread dumps showing that this bug in jdk is still there)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7515 6c8d7289-2bf4-0310-a012-ef5d649a1542
- YaCy runs on Windows
- YaCy is started with the -gui option
in all other cases YaCy runs in headless mode
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7481 6c8d7289-2bf4-0310-a012-ef5d649a1542
- if YaCy is started with the option -gui, it is not in headless mode. Then the java 1.6 browse method is used if all other methods fail
- in linux, the path /etc/alternatives/www-browser is used if no firefox is installed
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7480 6c8d7289-2bf4-0310-a012-ef5d649a1542
Please see new coments in yacy.network.freeworld.unit for details of the new DHT selection methods.
The number of maximum peers is now not fixed to a specific number but may increase with
- the partition exponent
- the number of redundant peers
- the robinson burst percentage
- the multiword burst percentage
The maximum can then be the number of senior peers (all visible peers).
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7479 6c8d7289-2bf4-0310-a012-ef5d649a1542
- some restructuring of the document counting and logging structures was necessary
- better abstraction of CrawlProfiles
- added deletion of logs to the index deletion option (if the index is deleted using the servlets) which is necessary to reset the domain counters for the page limitation
- more refactoring to get the LibraryProvider more clean
- some refactoring of the Condenser class
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7478 6c8d7289-2bf4-0310-a012-ef5d649a1542
* yacy crawls local domains also, if no password is set (the interface is already protected)
* it's not required anymore, to set a password in intranet mode
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7436 6c8d7289-2bf4-0310-a012-ef5d649a1542
- cleaned up (removed special code and documentation for 27c3)
- added remote search functions to be used within cora
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7420 6c8d7289-2bf4-0310-a012-ef5d649a1542
- better document names
- fixed problem with ftp crawling
- added automatic removal of search results from services that are not online according to the latest network scan: this does not delete the index but just does not show them. after the next network scan when the server is available again, the results are again showed.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7385 6c8d7289-2bf4-0310-a012-ef5d649a1542
- multiple hosts for environment scans can be given (comma-separated)
- each service (ftp, smb, http, https) for the scan can be selected
- the scan result can be accumulated or refreshed each time a network scan is made
- a scheduler was added to repeat a scan and add all found urls to the indexer automatically
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7378 6c8d7289-2bf4-0310-a012-ef5d649a1542
- integrated new parser into loader processes: enrich document parser
- fixed a concurrent modification exception in kelondro iterator
- hand-over of document size from crawler to indexer
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7374 6c8d7289-2bf4-0310-a012-ef5d649a1542
- added information about granted access
- enhanced servlet design
- added submit-feedback (because it is a long-running task)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7372 6c8d7289-2bf4-0310-a012-ef5d649a1542
- added a new queue 'noload' which can be filled with urls where it is already known that the content cannot be loaded. This may be because there is no parser available or the file is too big
- the noload queue is emptied with the parser process which indexes the file names only
- the 'start from file' functionality now also reads from ftp crawler
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7368 6c8d7289-2bf4-0310-a012-ef5d649a1542
- when a site-crawl for ftp sites is now started, then a special directory-tree harvester gets the complete directory structure of a ftp server at once
- the harvester runs concurrently and feeds into the normal crawl queue
also in this:
- fixed the 'start from file' crawl function
- added a link detector for the html parser. The html parser can now also extract links that are not included in <a> tags.
- this causes that a crawl start is now also possible from clear text link files
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7367 6c8d7289-2bf4-0310-a012-ef5d649a1542
when a search fails for a single url because the snippet cannot be generated, then the url reference is deleted from the index. This mechanism was redesign and enhanced. The process now also writes into the work tables into the table searchfl to prepare a re-indexing mechanism.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7364 6c8d7289-2bf4-0310-a012-ef5d649a1542
this causes that the search result view switches from list format to image preview format when a search is restricted to png, gif or jpg documents
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7358 6c8d7289-2bf4-0310-a012-ef5d649a1542
- enhanced the pdf and torrent parser: better documents titles
- enhanced the ftp client: more time-out time
- fixed bugs in json for search results
- enhanced yacyinteractive.html: added a file type navigator and a download-script generator for search result files
Please have a look at yacyinteractive.html: this will become the hacker-download tool for 27c3!
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7355 6c8d7289-2bf4-0310-a012-ef5d649a1542
- renamed YaCys search result modifications keywords for RECENT, NEAR and language: to the blekko slashtag naming scheme. YaCy now supports the following blekko-like slash built-in slashtags:
/date
- for search results ordered by date (most recent up)
/near
- for search results where search words appear near to each other (closest up)
/language/<lang>
- for a sorting by language where the wanted language gets up. Example: /language/de
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7350 6c8d7289-2bf4-0310-a012-ef5d649a1542
this was never used and extended in the last years. The resulting YBR ranking criteria
is still a good idea and will be used in the future. Possible generation methods for YBR
ranking are:
- "trust-rank" using the link structure as can be discovered in a single crawl (idea from FSCONS)
- "block-rank" calculated from the local link structure
- a distributed "block-rank" using the xml API to the link structure from other peers
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7349 6c8d7289-2bf4-0310-a012-ef5d649a1542