Commit Graph

77 Commits (29fa17bd406675c19b21242212e5b2d53a4a67e9)

Author SHA1 Message Date
theli 92f774edd1 *) Better charset encoding detection
18 years ago
theli decb09df6d *) Trying to be more tolerant against wrong charset names
18 years ago
theli e9afe39cbb *) Trying to be more tolerant against wrong charset names
18 years ago
theli 7526c831a8 *) Suppressing stracktrace
18 years ago
theli 22649408ad *) Better errorhandling for charset encoding problem during content parsing
18 years ago
orbiter 1969522dc1 removed lowercase of snippets (and other things):
18 years ago
theli f17ce28b6d *) plasmaHTCache:
18 years ago
theli a2e3095044 *) Bugfix. Add missing plasmaParserDocument.close() calls
18 years ago
theli cd5f349666 *) Better handling of large files during parsing
18 years ago
theli 813a8a8179 *) migration of mimeTypeParser to jmimemagic 0.1
18 years ago
theli b6c7b91582 *) Parser now throws an ParserException instead of returning null on parsing errors (e.g. needed by snippet fetcher)
18 years ago
theli 5c6251bced *) some improvements for extended html document charset support
18 years ago
orbiter f453c14b5d removed unreacheable catch blocks and unused imports
18 years ago
theli ad7f600f25 *) Bugfix. re-enabling inheritance of serverCharBuffer from writer class
18 years ago
theli 97d2a08ef1 *) restructuring needed to support parsing of documents using various charsets
18 years ago
orbiter 3aac5b26da - added automatic tag generation when a web page from the search results is added
18 years ago
allo 2fd610b556 http://www.yacy-forum.de/viewtopic.php?p=25611#25611
18 years ago
theli 06fa891152 *) htmlFilterContentScraper.java: using proper charset for document title
18 years ago
theli 74c3e7cf29 *) storing document charset into plasmaParserDocument object (is needed later by the condenser)
18 years ago
theli c5d3020941 *) better errorhandling for last commit
18 years ago
theli d0a5a53789 *) changes needed for multi-language support
18 years ago
theli eb9b138986 *) next step of restructuring for new crawlers
18 years ago
theli f3ac4dbbb9 *) better handling of server shutdown
18 years ago
orbiter abf22f6e60 removed url normalform computation from htmlFilterContentScraper.
19 years ago
orbiter 3879a0ecd0 replaced java.net.URL usage by use of new class de.anomic.net.URL
19 years ago
orbiter 015d044c25 tried to fix some problems with latest changes to httpc
19 years ago
orbiter 47b541b2d1 added better option handling in yacysearch
19 years ago
orbiter 22de954a57 added some log output to parser
19 years ago
orbiter 83e0e765ec redesigned some parts of the html scanner & parser
19 years ago
orbiter b21b9df2d0 added section headlines generation to html parser
19 years ago
theli 79667a172e *) Bugfix for additional parser problem
19 years ago
theli e7d16ef831 *) Corrections in jMimeMagic MagicRule-file to detect some special rss feeds
19 years ago
theli 5a1d45715d *) Bugfix for parser configuration bug
19 years ago
orbiter ec2b39c1ce code cleanup
19 years ago
theli 44fa94ac52 *) Modifications for dbImport functionality
19 years ago
orbiter 3d8a5ae652 code cleanup
19 years ago
theli 8ed0aaae8d *) Adding content Parser for RPM Files
19 years ago
theli bdf30117c1 *) Redesign of parser configuration
19 years ago
orbiter 40621a5663 anhancements in ranking preparation and fixed problem with parser/mime recognition
19 years ago
theli c2fe3a1670 *) Updating jMimeMagic Ruleset
19 years ago
theli 445e3a620f *) Avoid rejecting of html content by the crawler when the file extension is not set properly
19 years ago
orbiter d2731418bf added creation of global ranking files and changed url normal form usage
19 years ago
theli b8ceb1ffde *) Adding better https support for crawler
19 years ago
hydrox cb69047b91 *)cleanup access static methods and fields
19 years ago
theli a2fa75e688 *) Asynchronous queuing of crawl job URLs (stackCrawl)
19 years ago
theli 0fd9aa6c6e *) Bugfix: supportedFileExt Function didn't detect the file extension correctly because of missing conversion to lower case
19 years ago
theli 8a33c9b309 *) Bugfix: supportedFileExt Function didn't detect the file extension correctly if there was a dot
19 years ago
theli 2b3f964037 *) Bugfix: supportedFileExt Function didn't chop http parameters before trying to detect the file extension
19 years ago
theli b990dc1ad1 *) Replacing jsch 0.1.19 lib with newer version 0.1.21
19 years ago
theli 4fd5b95b1f *) Renaming Logger function names to reflect the proper Java Logging API Loglevels
19 years ago