Commit Graph

468 Commits (c21966bb439034431b5ca7344ae260c983dc413d)

Author SHA1 Message Date
sixcooler ce248cc8dd less byte-arrays of response-content, less byte-array <-> stream conversation
14 years ago
sixcooler 59b767eebd stop loading via http at defined maximum of bytes - even size is unknown before loading
14 years ago
orbiter 1912d0cccc changed handling of RowSet element retrieval: until today all elements had been copied from the underlying byte[] arrays into a new Entry object that again had a copy of a portion of that byte[] in its own bye[]. There was an option to just refer to the underlying byte[] with a pointer but that was almost never used. This commit now changes an interface to the Row class where it is now necessary to tell if a copy is always required. Fortunately the copy is only needed in very rare cases. That means that this change should cause much less memory allocation; it is expected that this happens especially during search situations.
14 years ago
low012 c7b95e8c81 *) Invalid crawl profiles (containing invalid mustmatch/mustnotmatch filters) will be moved from active crawls to invalid crawls (new file: DATA/INDEX/freeworld/QUEUES/crawlProfilesInvalid.heap). This file can not be edited yet, but it shoudl be easy to extend the CrawlProfileEditor accordingly.
14 years ago
orbiter 719777b2a7 replaced method to call getUsableSpace using reflection with direct call since we now use java 1.6
14 years ago
orbiter 115abc8917 - more attributes for search progress bar
14 years ago
orbiter 4bea3f9714 hack to reduce resource contention caused by massive UTF8 decodings which use java.nio resources:
14 years ago
orbiter 746e3c3b06 Replaced a widely-used Property Object in the httpd with HashMap<String, Object> which is not synchronized like Properties
14 years ago
orbiter 10e2f588f8 - enhanced ybr ranking computation
14 years ago
orbiter bd55dcee50 - commented out experimental distributed ranking loading
14 years ago
orbiter 3ed4a09368 small features, some bug fixes and performance hacks
14 years ago
orbiter 5b579e21a3 code cleanup
14 years ago
orbiter 6fa439c82b - refactoring of robots
14 years ago
f1ori 0b02083e97 * function for simple crawl of one url
14 years ago
orbiter d8e934c085 better abstraction of http client identification
14 years ago
orbiter b77b8cac0c - enhanced html parser: recognized much more details in the content
14 years ago
orbiter 3d5104d357 - fixed a bug in crawl start with file name (npe in new url)
14 years ago
orbiter 958ff4778e enhanced location search:
14 years ago
orbiter 19fd13d3bc Added federated index storage to solr.
14 years ago
orbiter 4c013d9088 more UTF8 getBytes() performance hacks
14 years ago
orbiter 17530ca7b5 fix for bug http://bugs.yacy.net/view.php?id=10
14 years ago
orbiter 96c32e87b0 fixes to crawler and new user-agent crawl-delay handling
14 years ago
orbiter b2fe4b7b1a added a handling of appearances of yacy bot entries in robots.txt if this entry addresses the yacy peer
14 years ago
orbiter 9b25d07295 - added geo information parsing to html parser
14 years ago
f1ori efcf37a953 * show info in log, if robots.txt is rejected due to wrong mime-type
14 years ago
orbiter b1a8d0c020 enhancements to web cache and less strict caching rules
14 years ago
orbiter f3baaca920 - enhancements to DNS IP caching and crawler speed
14 years ago
orbiter 2b5f8585bf performance hack for Balancer and ip address parsing
14 years ago
orbiter 8f11d3a5bb redesigned the ScoreMap classes:
14 years ago
orbiter dc0db3550e avoid string conversion
14 years ago
orbiter 694fa3a2a5 - replaced more direct string-based UTF-8 conversions by predefined UTF-8 conversion
14 years ago
orbiter 30aed9824a moved getBytes() to UTF8.getBytes() to use a default String encoding
14 years ago
orbiter 1214615185 fix for 'invisible entry', see http://forum.yacy-websuche.de/viewtopic.php?p=22133#p22133
14 years ago
orbiter 3820525464 more memory protection: auto-flush of caches in case of memory shortage
14 years ago
orbiter 7962d35425 - removed file upload function in crawl start and replaced it with an input field for a file path where the crawl start file is loaded. This was necessary to support the API steering for file crawl starts, for two reasons:
14 years ago
low012 3b40b98256 *) set SVN properties
14 years ago
orbiter cb1f49d0f2 replaced all 'new String' with default encoding (missing) or UTF-8 encoding with a String generation method that uses a pre-defined Charset constant for UTF-8. This avoids a cache-lookup for the Charset object using String hashing of the String 'UTF-8'.
14 years ago
orbiter 1110d16af9 performance hack: replaced generic row.getColBytes() call with row.getPrimaryKeyBytes() where the column is 0
14 years ago
low012 ce012e11aa *) deleted LogStatistics since the page did not work anymore and it seemed to be obsolete, tell me if you miss it and I will add it again
14 years ago
low012 c5051c4020 *) fixed bug which caused entries to not be deleted when deleting by URL on IndexCreateWWWLocalQueue_p.html (I hope this did not break anything else)
14 years ago
orbiter 4588b5a291 - fixed document number limitation for crawls that restrict the number of documents per domain
14 years ago
orbiter 10ae8d961b - cora package has now no dependencies to other yacy packages and becomes a 'base' package (refactoring)
14 years ago
lotus b1484299b2 same units for memory observer configuration (MiB)
14 years ago
orbiter 0769f4caa6 added search suggestions for interactive search: is only shown if there are no search results
14 years ago
f1ori e4aabaa1c3 * fix negative filelength for files >2G
14 years ago
f1ori ee3cef91e8 * fix filesize in ftp crawls
14 years ago
low012 3d95981f7d *) cleaning up the code a little bit
14 years ago
orbiter e88c428008 fix to ftp loader
14 years ago
orbiter 9b25a33fd9 - fixed numerous bugs
14 years ago
orbiter 7bdb13bf7f more fixes to smb crawling: better file names
14 years ago
orbiter 94c48500cc several fixes
14 years ago
f1ori 9d2159582f * fix system update if urls are in blacklist (for example for very general blacklists like *.de)
14 years ago
orbiter 56264dcc17 - added CamelCase parser to MultiProtocolURI: generate better to-be-indexed words from urls
14 years ago
orbiter a563b05b60 enhanced crawler:
14 years ago
orbiter c36da90261 added a very fast ftp file list generator to site crawler:
14 years ago
orbiter 4e2c14efbb fixed bugs in parser and ftp client
14 years ago
orbiter fffb91447a fixed crawl queue delete function
14 years ago
orbiter b769cce433 - added a catch-all parser for all documents that cannot be parsed: they will contributed with their document url for the search index only
14 years ago
f1ori 741a87a3e9 * make .yacy-domains crawlable (.yacy-domains are local domains, so only in custom networks/peers)
14 years ago
f1ori dca9e16f51 * don't index pages, which redirect, twice
14 years ago
orbiter 09badc697b - low-memory patch for crawler
14 years ago
orbiter 93c535d111 fixed http://forum.yacy-websuche.de/viewtopic.php?p=21113#p21113
14 years ago
orbiter 4c72885cba added a sitemap entry parser and loader for sitemaps
14 years ago
f1ori def4253555 * add option to network definition to provide a domainlist (syntax like in blacklists)
14 years ago
f1ori 7d8de34778 * add a bit documentation to DigestURI, use DigestURI(string) instead of DigestURI(string, null)
14 years ago
orbiter e3e3b49d52 - enhanced main release recognition
14 years ago
orbiter ca738ac924 - added a tag cloud to search results (using the topics)
15 years ago
orbiter e4d561971e added more score cluster options and made score cluster usage more transparent
15 years ago
orbiter e8f90201a5 fix for scheduling of rss feeds
15 years ago
orbiter 6a166c2040 patches for bad proxy behaviour
15 years ago
orbiter 45b1ab3d07 custom + generic skins:
15 years ago
orbiter 0d363a94d7 more performance hacks
15 years ago
orbiter 091dd3f6ec - enhanced intranet search speed
15 years ago
orbiter aacf572a26 - enhancements for search speed
15 years ago
orbiter 2c549ae341 fixed a number of small bugs:
15 years ago
orbiter f6eebb6f99 replaced auto-dom filter with easy-to-understand Site Link-List crawler option
15 years ago
orbiter d2fd93135c - moved yacybot user agent string definition to MultiProtocolURI since there are basic access mechanisms where the bot string is needed
15 years ago
orbiter 48c0d508ac fixes for crawling of smb links (file length not always available)
15 years ago
orbiter 461a2a6ec7 enhanced remote crawling:
15 years ago
orbiter 5870b13f3a - code cleanup / added debug line for further investigation in HTTPDemon.parseMultipart
15 years ago
sixcooler 17eebd4ef8 counting crawler traffic again:
15 years ago
orbiter 348dece62f redesign of the SortStack and SortStore classes:
15 years ago
orbiter 114bdd8ba7 fixed old sitemap importer which was not able to parse urls containing post elements
15 years ago
orbiter 5fe828fa06 - replaced pdfbox and fontbox version 1.1.0 with 1.2.1
15 years ago
orbiter 22047ffad5 enhanced computation speed of many replaceAll string operations
15 years ago
orbiter 9d080f387e change in handling of the all-visible home path for storage in YaCy:
15 years ago
orbiter 65eaf30f77 redesign of crawl profiles data structure. target will be:
15 years ago
orbiter 104318d58a - added nice colors to feed indexing state messages
15 years ago
orbiter 0f276dd63f - MapHeap now implements Map<byte[], Map<String, String>>
15 years ago
orbiter c60d0282fd more abstraction for tables stored in heaps:
15 years ago
orbiter 3197ca42ed preparations to move the HTCache into cora:
15 years ago
orbiter 844f158686 - removed dependencies in header framework:
15 years ago
orbiter 5e7081cd19 refactoring towards a unified loading mechanism for MultiProtocolURIs
15 years ago
orbiter caece04f26 removed System.err and System.out usage from FTPClient; changed logging to log4j (preferred in yacy.cora)
15 years ago
orbiter 90531f78ff refactoring of the cora package to get subpackages for http and ftp (smb to come)
15 years ago
sixcooler 661867923a ... migrating to HttpComponents-Client-4.x ...
15 years ago
orbiter 7aa860c505 - more logging
15 years ago
orbiter 70dd26ec95 added the new crawl scheduling function to the crawl start menu:
15 years ago
orbiter 7fdb17bb96 redirect uncaught exceptions to logging + small other changes
15 years ago
orbiter 87b1684211 additional double-check in balancer
15 years ago