- move the header framework classes to cora
- move the ARC caching classes to cora
- refactoring of code to call these classes from cora
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7068 6c8d7289-2bf4-0310-a012-ef5d649a1542
- added new network definition 'allip' which can be used in networks where intranet and internet-addresses shall be indexed
- added a auto-switch-off for global search if there are no global peers
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7030 6c8d7289-2bf4-0310-a012-ef5d649a1542
some file types are containers for several files. These containers had been parsed in such a way that the set of resulting parsed content was merged into one single document before parsing. Using this parser infrastructure it is not possible to parse document containers that contain individual files. An example is a rss file where the rss messages can be treated as individual documents with their own url reference. Another example is a surrogate file which was treated with a special operation outside of the parser infrastructure.
This commit introduces a redesigned parser interface and a new abstract parser implementation. The new parser interface has now only one entry point and returns always a set of parsed documents. In case of single documents the parser method returns a set of one documents.
To be compliant with the new interface, the zip and tar parser had been also completely redesigned. All parsers are now much more simple and cleaner in its structure. The switchboard operations had been extended to operate with sets of parsed files, not single parsed files.
additionally, parsing of jar manifest files had been added.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6955 6c8d7289-2bf4-0310-a012-ef5d649a1542
- a site-operation heuristic that loads all direct links from a portal page if the site-operator is used
- a direct crawl for search results from scroogle for the given search terms
The configuration page can be found directly beside the network configuration page
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6951 6c8d7289-2bf4-0310-a012-ef5d649a1542
Here a new concept called 'search heuristics' is introduced. A heuristic is a kind of 'shortcut' to good results in IT, here for good search results. In this case it will be used to get a very transparent way to compare what YaCy is able to produce as search result and what g**gle produces as search result. Here is what your can do now:
- add the phrase 'heuristic:scroogle' to your search query, like 'oil spill heuristic:scroogle' and then a call to scroogle is made to get anonymous search results from g**gle.
- these results are _not_ taken as meta-search results, but are used to instantly feed a crawling and indexing process. This happens very fast, here 20 results from scroogle are taken and loaded all simultanously, parsed and indexed immediately and from the results of the parsed content the search result is feeded, along to the normal p2p search
- when new results from that heuristic (more to come) get part of the search results, then it is verified if such results are redundant to existing (they had been part of the normal YaCy search result anyway) or if they had been completely new to YaCy.
- in the search results the new search results from heuristics are marked with a 'H ++' and search results from heuristics that had been already found by YaCy are marked with a 'H ='. That means:
- you can now see YaCy and Scroogle search results in one result page but you also see that you would not have 'missed' the g**gle results when you would only have used YaCy.
- to make it short: YaCy now subsumes g**gle results. If you use only YaCy, you miss nothing.
to come: a configuration page that let you configure the usage of heuristics and get this feature by default.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6944 6c8d7289-2bf4-0310-a012-ef5d649a1542
implemented a hint from dulcedo "use site: - operator as crawl start point".
YaCy already was able to search using a site-constraint. This function is now extended with a instant crawling feature.
When you now use the site-operator, then the landing page of the site iand every page that is linked from this page are loaded, indexed and selected for the search result within that search request. When the remote server responds quickly enough, then this process can result in search results during the normal search result preparation .. just in some seconds.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6941 6c8d7289-2bf4-0310-a012-ef5d649a1542
a cache access shall not made directly to the cache any more, all loading attempts shall use the LoaderDispatcher.
To control the usage of the cache, a enum instance from CrawlProfile.CacheStrategy shall be used.
Some direct loading methods without the usage of a cache strategy have been removed. This affects also the verify-option
of the yacysearch servlet. If there is a 'verify=false' now after this commit this does not necessarily mean that no snippets
are generated. Instead, all snippets that can be retrieved using the cache only are presented. This still means that the search hit was not verified because the snippet was generated using the cache. If a cache-based generation of snippets is not possible, then the verify=false causes that the link is not rejected.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6936 6c8d7289-2bf4-0310-a012-ef5d649a1542
- a new news db will be created (news1024.db), the old one (news.db) can be deleted
- peers with too large news payload are not ignored any more (they may have been invisible because they had a too large news payload!)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6917 6c8d7289-2bf4-0310-a012-ef5d649a1542
- it is now possible to crawl the local file system with an intranet peer
- redesign of URL handling
- refactoring: created LGPLed package cora: 'content retrieval api' which may be used externally by other applications without yacy core elements because it has no dependencies to other parts of yacy
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6902 6c8d7289-2bf4-0310-a012-ef5d649a1542
- order locations by (primary) population and (secondary) longitude (reverse ordering, both)
- added population from GeoNames, OpenGeoDB does not have that information
- changed default viewpoint of map to (30,15); shows more land and europe in the center
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6893 6c8d7289-2bf4-0310-a012-ef5d649a1542
Added alt tag to page tabs in yacysearch.java for HTML validity
Added new German translations for geo search phrase in de.lng
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6890 6c8d7289-2bf4-0310-a012-ef5d649a1542
- used that to display two layers on map: cities and search result locations
- added many marker grafics for the display of the markers on the map
- some refactoring of the yacy news code plus bugfixes for latest move from Tree to Table data structure
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6889 6c8d7289-2bf4-0310-a012-ef5d649a1542
- added downloader option in DictionaryLoader
- added generalization (interfaces and overarching localization)
- more abstraction using the libraries
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6879 6c8d7289-2bf4-0310-a012-ef5d649a1542
- fixes to http file server session handling
- fixes and enhancements to metadata date/time handling
- added dc:publisher metadata field and updated all document parser
- fixed bug in metdata read procedure
- enhanced dublin core and rss parser to understand more fields more properly
- enhanced url selection in case that multiple urls are given in surrogates
- fix for condenser; failure when last word does not end with termination symbol
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6863 6c8d7289-2bf4-0310-a012-ef5d649a1542
The result should be a less usage of new String() and less memory usage (since a String-encapsulated byte[] has 40 bytes overhead)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6815 6c8d7289-2bf4-0310-a012-ef5d649a1542
pass value as byte[], not as String. This should cause that less
byte[] <-> String conversions are made during time-critical tasks.
This redesign is not yet complete, more to come ..
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6775 6c8d7289-2bf4-0310-a012-ef5d649a1542
so far only search requests at the remote search interface had been counted.
This was done to protect the privacy of searchers, because counting was not done and published at the own search interface.
This caused that no search requests of robinson peers had been counted, becuase they cannot be counted at remote peer.
This change introduces a distinction of locally done search requests at the local search interface from search requests that are on the local interface but had been submitted from a remote IP without authentication.
Now 3 counters are maintained:
- partial count of remote searches
- total count of local searches on robinson peers from non-authenticated clients
- total count of local searches on robinson peers from localhost or authenticated clients
In the global statistic of search requests now the first two counters of the three cases are added
Because we habe a large number of robinson peers with a large number of remote non-authenticated requests the statistic should show at least three times of the number of search requests.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6696 6c8d7289-2bf4-0310-a012-ef5d649a1542
- increased sort limit, now 3000 entries, before: 1000
this should cause that more results can be shown in case
of strong limitating constraints, like domain navigation
- enhanced the sort process
- check against domain navigator bugs
- fix in sort stack
- showing now all naviagtion pages at first search (not only next page)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6569 6c8d7289-2bf4-0310-a012-ef5d649a1542
- some refactoring in search process
- fixed image search for json and rss output
- search navigation on bottom of search result page in cases where there are more than 6 results on page
- fixes for number of displayed documents
- disabled pseudostemming
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6504 6c8d7289-2bf4-0310-a012-ef5d649a1542