- for each crawl start, there is now a flag for text and media
- the localCrawl flag is superfluous
- added new crawl profiles
- if an image search is done, only media links are crawled for the snippets
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3100 6c8d7289-2bf4-0310-a012-ef5d649a1542
- moved all url and index(RWI) entries to index package
- better naming to distinguish RWI entries and URL entries
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2937 6c8d7289-2bf4-0310-a012-ef5d649a1542
- added test migration method to migrate the old LURL to a new LURL
the new LURL will be splitted into different tables for each month
this solves several problems:
- the biggest table in YaCy is splitted in different parts and can
also be managed in filesystems that are limited to 2GB
- the oldest entries can easily be identified, used for re-crawl und
deleted
- The complete database can be limited to a specific size (as wanted many times)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2755 6c8d7289-2bf4-0310-a012-ef5d649a1542
- method loadResourceContent defined as deprecated.
Please do not use this function to avoid OutOfMemory Exceptions
when loading large files
- new function getResourceContentStream to get an inputstream of a cache file
- new function getResourceContentLength to get the size of a cached file
*) httpc.java:
- Bugfix: resource content was loaded into memory even if this was not requested
*) Crawler:
- new option to hold loaded resource content in memory
- adding option to use the worker class without the worker pool
(needed by the snippet fetcher)
*) plasmaSnippetCache
- snippet loader does not use a crawl-worker from pool but uses
a newly created instance to avoid blocking by normal crawling
activity.
- now operates on streams instead of byte arrays to avoid OutOfMemory
Exceptions when operating on large files
- snippet loader now forces the crawl-worker to keep the loaded
resource in memory to avoid IO
*) plasmaCondenser: adding new function getWords that can directly operate on input streams
*) Parsers
- keep resource in memory whenever possible (to avoid IO)
- when parsing from stream the content length must be passed to the parser function now.
this length value is needed by the parsers to decide if the parsed resource content is to large
to hold it in memory and must be stored to file
- AbstractParser.java: new function to pass the contentLength of a resource to the parsers
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2701 6c8d7289-2bf4-0310-a012-ef5d649a1542
*) better logging of parser failures
*) simplified usage of plasmaparser through switchboard
*) restructuring of crawler
- crawler now returns an error message if it is used in sync mode (e.g. by snippet fetcher)
*) snippet-fetcher: more verbose error messages
*) serverByteBuffer.java: adding new function append(String,encoding)
*) serverFileUtils.java: adding functions to copy only a given number of bytes between streams
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2641 6c8d7289-2bf4-0310-a012-ef5d649a1542
This shall be seen as an experiment to exclude all cases where
there could be a DNS lookup during URL comparisment.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2290 6c8d7289-2bf4-0310-a012-ef5d649a1542