some file types are containers for several files. These containers had been parsed in such a way that the set of resulting parsed content was merged into one single document before parsing. Using this parser infrastructure it is not possible to parse document containers that contain individual files. An example is a rss file where the rss messages can be treated as individual documents with their own url reference. Another example is a surrogate file which was treated with a special operation outside of the parser infrastructure.
This commit introduces a redesigned parser interface and a new abstract parser implementation. The new parser interface has now only one entry point and returns always a set of parsed documents. In case of single documents the parser method returns a set of one documents.
To be compliant with the new interface, the zip and tar parser had been also completely redesigned. All parsers are now much more simple and cleaner in its structure. The switchboard operations had been extended to operate with sets of parsed files, not single parsed files.
additionally, parsing of jar manifest files had been added.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6955 6c8d7289-2bf4-0310-a012-ef5d649a1542
- found and fixed a possible memory leak in YaCy internal RSS feed system
- some refactoring in RSS feed mechanisms to make this possible
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6950 6c8d7289-2bf4-0310-a012-ef5d649a1542
- cleaned uo the code. The new eclipse helios provided new warnings for dead code. This change cleans up most of these warnings
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6945 6c8d7289-2bf4-0310-a012-ef5d649a1542
Here a new concept called 'search heuristics' is introduced. A heuristic is a kind of 'shortcut' to good results in IT, here for good search results. In this case it will be used to get a very transparent way to compare what YaCy is able to produce as search result and what g**gle produces as search result. Here is what your can do now:
- add the phrase 'heuristic:scroogle' to your search query, like 'oil spill heuristic:scroogle' and then a call to scroogle is made to get anonymous search results from g**gle.
- these results are _not_ taken as meta-search results, but are used to instantly feed a crawling and indexing process. This happens very fast, here 20 results from scroogle are taken and loaded all simultanously, parsed and indexed immediately and from the results of the parsed content the search result is feeded, along to the normal p2p search
- when new results from that heuristic (more to come) get part of the search results, then it is verified if such results are redundant to existing (they had been part of the normal YaCy search result anyway) or if they had been completely new to YaCy.
- in the search results the new search results from heuristics are marked with a 'H ++' and search results from heuristics that had been already found by YaCy are marked with a 'H ='. That means:
- you can now see YaCy and Scroogle search results in one result page but you also see that you would not have 'missed' the g**gle results when you would only have used YaCy.
- to make it short: YaCy now subsumes g**gle results. If you use only YaCy, you miss nothing.
to come: a configuration page that let you configure the usage of heuristics and get this feature by default.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6944 6c8d7289-2bf4-0310-a012-ef5d649a1542
- removed concurrency overhead for small number of index normalizations as it happens during remote search
- removed 'load only parseable' constraint for snippet fetch because some resources may not have any url file extension and these had therefore not been parseable and searcheable since they may become parseable after loading when their mime type is known
- this partly fixes some problems with http://forum.yacy-websuche.de/viewtopic.php?p=20300#p20300 but more changes are necessary to get all expected search results
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6926 6c8d7289-2bf4-0310-a012-ef5d649a1542
- a new news db will be created (news1024.db), the old one (news.db) can be deleted
- peers with too large news payload are not ignored any more (they may have been invisible because they had a too large news payload!)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6917 6c8d7289-2bf4-0310-a012-ef5d649a1542
- it is now possible to crawl the local file system with an intranet peer
- redesign of URL handling
- refactoring: created LGPLed package cora: 'content retrieval api' which may be used externally by other applications without yacy core elements because it has no dependencies to other parts of yacy
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6902 6c8d7289-2bf4-0310-a012-ef5d649a1542
- added number of online peers at the last day and the last week
- changed design of statistic table
- network picture now shows exactly those peers that are counted in the statistic overview for one day
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6897 6c8d7289-2bf4-0310-a012-ef5d649a1542
- changed description text to 'title' entity (subject is a list of keywords and was very messed)
- added ViewFile in location pop-up
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6891 6c8d7289-2bf4-0310-a012-ef5d649a1542
- used that to display two layers on map: cities and search result locations
- added many marker grafics for the display of the markers on the map
- some refactoring of the yacy news code plus bugfixes for latest move from Tree to Table data structure
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6889 6c8d7289-2bf4-0310-a012-ef5d649a1542
- added downloader option in DictionaryLoader
- added generalization (interfaces and overarching localization)
- more abstraction using the libraries
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6879 6c8d7289-2bf4-0310-a012-ef5d649a1542
- fix for initial generation of crawl profiles (one more reason to remove your crawl profiles)
- more String -> byte[] migration
- more logging for cache store/hit
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6874 6c8d7289-2bf4-0310-a012-ef5d649a1542
- fixes to http file server session handling
- fixes and enhancements to metadata date/time handling
- added dc:publisher metadata field and updated all document parser
- fixed bug in metdata read procedure
- enhanced dublin core and rss parser to understand more fields more properly
- enhanced url selection in case that multiple urls are given in surrogates
- fix for condenser; failure when last word does not end with termination symbol
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6863 6c8d7289-2bf4-0310-a012-ef5d649a1542
for such cases the delete() operation was introduced which is sometimes much cheaper in operation since it does not need to create objects to hold the removed content and it does not need to read those objects.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6824 6c8d7289-2bf4-0310-a012-ef5d649a1542
- removed usage of URL-Caches which could have been a memory leak
- removed unused classes and methods
- removed not necessary synchronizations
- added synchronization hacks where possible
- fine-tuned crawling speed to prevent IO of balancer
- fixed a bug in IODispatcher that may have caused that no merges were done
- reduced number of parameters in very often called methods (compare methods)
- reduced complexity of data structures of now massively used HandleSet class
- reduction of new String() and getBytes() usage / new methods to support this transition
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6820 6c8d7289-2bf4-0310-a012-ef5d649a1542
The result should be a less usage of new String() and less memory usage (since a String-encapsulated byte[] has 40 bytes overhead)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6815 6c8d7289-2bf4-0310-a012-ef5d649a1542
pass value as byte[], not as String. This should cause that less
byte[] <-> String conversions are made during time-critical tasks.
This redesign is not yet complete, more to come ..
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6775 6c8d7289-2bf4-0310-a012-ef5d649a1542