yacy_search_server

Commit Graph

Author	SHA1	Message	Date
orbiter	1969522dc1	removed lowercase of snippets (and other things): - added new sentence parser to condenser - sentence parsing can now handle charsets to do: charsets must be handed over to new sentence parser git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2712 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
theli	a2e3095044	*) Bugfix. Add missing plasmaParserDocument.close() calls git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2680 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
theli	cd5f349666	) Better handling of large files during parsing Extracted text of files that are larger than 5MB is stored in a temp file instead of keeping it in memory ) plasmaParserDocument.java; getText now returnes an inputStream instead of a byte array ) plasmaParserDocument.java: new function getTextBytes returns the parsed content as byte array Attention: the caller of this function has to ensure that enough memory is available to do this to avoid OutOfMemory Exceptions ) httpd.java: better error handling if the soaphander is not installed ) pdfParser.java: - better handling of documents with exotic charsets - better handling of large documents - better error logging of encrypted documents ) rtfParser.java: Bugfix for UTF-8 support ) tarParser.java: better handling of large documents ) zipParser.java: better handling of large documents ) plasmaCrawlEURL.java: new errorcode for encrypted documents ) plasmaParserDocument.java: the extracted text can now be passed to this object as byte array or temp file git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2679 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
orbiter	df1629b05a	- code cleanup - version 0.471 - moved surftipps to own web page git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2676 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
orbiter	3aac5b26da	- added automatic tag generation when a web page from the search results is added - added new image 'B' in front of search results for bookmark generation - added news generation when a public bookmark is added - the '+' in front of search results has new meaning: positive rating for that result - added news generation when a '+' is hit git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2613 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
theli	74c3e7cf29	) storing document charset into plasmaParserDocument object (is needed later by the condenser) ) htmlFilterContentScraper.java: using proper charset for document title *) serverByteBuffer.java: adding new toString which allows to specify the charset for byte encoding git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2593 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
orbiter	abf22f6e60	removed url normalform computation from htmlFilterContentScraper. This method was implemented in de.anomic.net.URL git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2377 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	3879a0ecd0	replaced java.net.URL usage by use of new class de.anomic.net.URL This shall be seen as an experiment to exclude all cases where there could be a DNS lookup during URL comparisment. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2290 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	f77775220b	fixed parser error git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1999 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	83e0e765ec	redesigned some parts of the html scanner & parser to better support image tags git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1995 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	6b63e26cbb	- removed search function from index.html/java, only imput left - added media fetcher/crawler class (not ready yet) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1992 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	37f88b4017	code cleanup git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1176 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	44fa94ac52	) Modifications for dbImport functionality - dbImporter threads are now shutdown by the switchboard on server shutdown - adding possibility to pause a importer thread via GUI - Bugfix for abort function See: http://www.yacy-forum.de/viewtopic.php?p=13363#13363 ) Modification of content parser configuration - now it's possible to configure which parsers should be enabled for the proxy, crawler, icap, etc. separately - ) htmlFilterContentScraper.java - adding regular expression to normalize URLs containing /../ and /./ parts ) httpc.java - adding functionality to unzip gzipped content - requested by roland: should be used later to allow gzipped seed lists git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1170 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	3d8a5ae652	code cleanup git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1166 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	d2731418bf	added creation of global ranking files and changed url normal form usage git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1046 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	9b7f37fc37	*) Minor changes - more debugging output: storageTime for indexed document is logged now - saving memory in plasmaParserDocument.java, plasmaWordIndexEntryContainer.java (not a big deal) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@798 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
theli	6c722706b7	*) Moving yacyDebugMode intialization to switchboard git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@660 6c8d7289-2bf4-0310-a012-ef5d649a1542	19 years ago
orbiter	d8fdc2526e	added experimental snipplet-generation (to be disabled for 0.38) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@206 6c8d7289-2bf4-0310-a012-ef5d649a1542	20 years ago
theli	361f05978d	Multiple updates regarding the yacy seedUpload facility, optional content parsers, thread pool configuration ... Please help me testing if everything works correct. ) Migration of yacy seedUpload functionality See: http://www.yacy-forum.de/viewtopic.php?t=256 - new uploaders can now be easily introduced because of a new modulare uploader system - default uploaders are: none, file, ftp - adding optional uploader for scp - each uploader provides its own configuration file that will be included into the settings page using the new template include feature - Each uploader can define its libx dependencies. If not all needed libs are available, the uploader is deactivated automatically. ) Migration of optional parsers See: http://www.yacy-forum.de/viewtopic.php?t=198 - Parsers can now also define there libx dependencies - adding parser for bzip compressed content - adding parser for gzip compressed content - adding parser for zip files - adding parser for tar files - adding parser to detect the mime-type of a file this is needed by the bzip/gzip Parser.java - adding parser for rtf files - removing extra configuration file yacy.parser the list of enabled parsers is now stored in the main config file ) Adding configuration option in the performance dialog to configure See: http://www.yacy-forum.de/viewtopic.php?t=267 - maxActive / maxIdle / minIdle values for httpd-session-threadpool - maxActive / maxIdle / minIdle values for crawler-threadpool ) Changing Crawling Filter behaviour See: http://www.yacy-forum.de/viewtopic.php?p=2631 ) Replacing some hardcoded strings with the proper constants of the httpHeader class ) Adding new libs to libx directory. This libs are - needed by new content parsers - needed by new optional seed uploader - needed by SOAP API (which will be committed later) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@126 6c8d7289-2bf4-0310-a012-ef5d649a1542	20 years ago
orbiter	1d7fed87dc	redesign of index caching - removed indexCache.db git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@86 6c8d7289-2bf4-0310-a012-ef5d649a1542	20 years ago
theli	351c86d5d9	) Migration of optional Content Parser integration - each additional parser must be in a subpackage of plasma.parser - each parser must have its own ant build file (which will be called automatically from the main build file) - Calling the main build file results in building a separate zip file for each optional parser. This zip file includes: + sources of the Parser.java + compiled classes of the Parser.java + needed additional libs (libx) - To install an additional parser the user simply needs to extract the zip file listed above into his/her yacy directory. - The configuration (enabling/disabling) of a parser can be done via the webinterface (currently the settings dialoge) and is done "on-the-fly". The installation can not be done "on-the-fly" at the moment because of classpath issues. - The classpath of the linux startup/stop scripts is generated automatically now (including all libraries from lib and libx). ) Bugfix: File Extension was not calculated correctly by the crawler e.g.: file extension was accidentally: .php?param=value Corrected. *) Adding additional parser for parsing of rss/atom feeds - added needed libs to do this. TODO: - automatic building classpath for windows startup scripts git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@78 6c8d7289-2bf4-0310-a012-ef5d649a1542	20 years ago
orbiter	995673d795	several bugfixes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@71 6c8d7289-2bf4-0310-a012-ef5d649a1542	20 years ago
theli	fd584c113c	*) some minor changes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@49 6c8d7289-2bf4-0310-a012-ef5d649a1542	20 years ago
theli	f44b219e44	*) Eclipse has accidentally copied in the wrong file header into the new files (because these headers were accidentally set as default for the whole workspace instead of the project) Fixed. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@48 6c8d7289-2bf4-0310-a012-ef5d649a1542	20 years ago
theli	58b1a0ba40	) adding an new package for extra content parsers ) adding content parser for - pdf (using the pdf-box library) - doc (using the textmining.org library) ) adding a Interface for content parsers ) adding a configuration file which can be used to configure which parser is used for which mimeType ) Sempahore class was moved and renamed to serverSemaphore ) Changing yacy shutdown behaviour Buzy waiting loop for shutdown was removed and replaced with a blocking call (using the semaphore class mentioned above) to the new switchboard.waitForShutdown method. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@46 6c8d7289-2bf4-0310-a012-ef5d649a1542	20 years ago

25 Commits (74f09a05102b19bf91a81c2376a2890423f4af52)