Commit Graph

136 Commits (59b767eebd2ed73249618d4434736d9f914dd985)

Author SHA1 Message Date
orbiter 299af4943c added another memory protection hack
14 years ago
orbiter b06faab9d3 do not allocate a StringBuilder object in case that there is not enough memory for that
14 years ago
orbiter bda3eec0ff added parsing of canonical link element to html parser
14 years ago
orbiter 9706fc55aa enhanced content scraper (should discover urls much faster in case of very large plain texts)
14 years ago
orbiter 77fe69395d added jempbox-1.5.0.jar which is required by pdfbox-1.5 as stated in http://pdfbox.apache.org/dependencies.html
14 years ago
orbiter 0c1b29f3c9 - applied many small performance hacks
14 years ago
orbiter 4bea3f9714 hack to reduce resource contention caused by massive UTF8 decodings which use java.nio resources:
14 years ago
orbiter e28bd0d038 fix for some possible causes of memory leaks
14 years ago
orbiter 10e2f588f8 - enhanced ybr ranking computation
14 years ago
orbiter 3ed4a09368 small features, some bug fixes and performance hacks
14 years ago
orbiter 021840e5ba removed (almost) deadlocks and unnecessary CPU load
14 years ago
orbiter 9248a4eef4 reduce teh effect of 'Bildersuche findet generierte HTML-Seiten als Bilder'
14 years ago
orbiter 4e8fa03514 added more attributes to html evaluation
14 years ago
orbiter 528da7c9ea removed unused class and added license header for new class
14 years ago
orbiter f6077b3cc0 added more attributes for html parser and enhanced data structures
14 years ago
orbiter d8e934c085 better abstraction of http client identification
14 years ago
orbiter b77b8cac0c - enhanced html parser: recognized much more details in the content
14 years ago
orbiter 3d5104d357 - fixed a bug in crawl start with file name (npe in new url)
14 years ago
orbiter 958ff4778e enhanced location search:
14 years ago
orbiter 4c013d9088 more UTF8 getBytes() performance hacks
14 years ago
orbiter 0430a94eaa the location search shows now not re-evaluated locations but only such locations that are attached as metadata to web pages
14 years ago
orbiter 9b25d07295 - added geo information parsing to html parser
14 years ago
orbiter 78d4c45d09 enhancement during search process: fast fail of search in case that all index feeder have terminated.
14 years ago
orbiter 1989ebc24b removed more warnings
14 years ago
orbiter 694fa3a2a5 - replaced more direct string-based UTF-8 conversions by predefined UTF-8 conversion
14 years ago
orbiter 30aed9824a moved getBytes() to UTF8.getBytes() to use a default String encoding
14 years ago
lotus cb6d307bba adding extension for parser
14 years ago
orbiter e1b6916423 always try to guess the size of a StringBuilder to prevent too many memory re-allocations
14 years ago
orbiter cb1f49d0f2 replaced all 'new String' with default encoding (missing) or UTF-8 encoding with a String generation method that uses a pre-defined Charset constant for UTF-8. This avoids a cache-lookup for the Charset object using String hashing of the String 'UTF-8'.
14 years ago
orbiter 8d14916c74 more patches for a better out-of-memory management
14 years ago
orbiter f8d0454c53 small bug fixes and experiments with search speed enhancement
14 years ago
orbiter a92d80a545 performance enhancements using an alternative to a insensitive collator (a complex string compare):
14 years ago
orbiter e717bf74ba more logging, more care about OOMs
14 years ago
orbiter 4588b5a291 - fixed document number limitation for crawls that restrict the number of documents per domain
14 years ago
orbiter 88773e4daa changed the default port from 8080 to 8090
14 years ago
low012 9f38c0023d *) Minor changes, mainly cleaning up a little bit, no functional changes.
14 years ago
orbiter 10ae8d961b - cora package has now no dependencies to other yacy packages and becomes a 'base' package (refactoring)
14 years ago
low012 9eae33f886 *) Ooops...
14 years ago
low012 a001e8075c *) minor enhancements
14 years ago
low012 11ea966f9e *) added SID file (Commodore 64) sound file parser
14 years ago
low012 936e976c23 *) added FreeMind (http://freemind.sourceforge.net/) mindmap parser
14 years ago
low012 3d95981f7d *) cleaning up the code a little bit
14 years ago
low012 2a6499364d *) minor changes
14 years ago
low012 c0274bd123 *) minor changes
14 years ago
orbiter 59b70a5a92 another fix to the ftp crawler: now correct directory listings according to rfc2640 (path with spaces) and better title names for such files
14 years ago
orbiter 9b25a33fd9 - fixed numerous bugs
14 years ago
orbiter 7bdb13bf7f more fixes to smb crawling: better file names
14 years ago
orbiter c288fcf634 redesigned CrawlStartScanner user interface and added more features:
14 years ago
orbiter 56264dcc17 - added CamelCase parser to MultiProtocolURI: generate better to-be-indexed words from urls
14 years ago
orbiter c36da90261 added a very fast ftp file list generator to site crawler:
14 years ago