{"id":87225,"date":"2019-01-10T08:00:06","date_gmt":"2019-01-10T16:00:06","guid":{"rendered":"https:\/\/www.backblaze.com\/blog\/?p=87225"},"modified":"2025-12-12T07:06:38","modified_gmt":"2025-12-12T15:06:38","slug":"wide-partitions-in-apache-cassandra-3-11","status":"publish","type":"post","link":"https:\/\/www.backblaze.com\/blog\/wide-partitions-in-apache-cassandra-3-11\/","title":{"rendered":"How We Optimized Storage and Performance of Apache Cassandra at Backblaze"},"content":{"rendered":"<p><a href=\"\/blog\/wp-content\/uploads\/2019\/01\/blog-guest-post-header.jpg\" data-rel=\"lightbox-gallery-sFsQPakb\" data-rl_title=\"\" data-rl_caption=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-87375 size-full\" title=\"\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2019\/01\/blog-guest-post-header.jpg\" alt=\"Guest post by Mick Semb Wever\" width=\"1440\" height=\"810\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/blog-guest-post-header.jpg 1440w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/blog-guest-post-header-300x169.jpg 300w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/blog-guest-post-header-1024x576.jpg 1024w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/blog-guest-post-header-768x432.jpg 768w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/blog-guest-post-header-730x411.jpg 730w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/blog-guest-post-header-560x315.jpg 560w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/blog-guest-post-header-220x124.jpg 220w\" sizes=\"auto, (max-width: 1440px) 100vw, 1440px\" \/><\/a><\/p>\n<div class=\"abstract\" style=\"line-height: 1.8; padding: 10px 15px 18px 15px;\">\n<p>Backblaze uses Apache Cassandra, a high-performance, scalable distributed database to help manage hundreds of <a href=\"\/blog\/petabytes-on-a-budget-10-years-and-counting\/\">petabytes<\/a> of data. We engaged the folks at The Last Pickle to use their extensive experience to optimize the capabilities and performance of our Cassandra 3.11 cluster, and now they want to share their experience with a wider audience to explain what they found. We agree; enjoy!<\/p>\n<p style=\"margin: 0 0 0 70%; padding: 0;\">&#8212; Andy<\/p>\n<\/div>\n<h2 style=\"text-align: center; margin: 32px auto 18px auto;\">Wide Partitions in Apache Cassandra 3.11<\/h2>\n<p style=\"text-align: center;\">by Mick Semb Wever, Consultant, <a href=\"http:\/\/thelastpickle.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">The Last Pickle<\/a><\/p>\n<p>Wide partitions in Cassandra can put tremendous pressure on the Java heap and garbage collector, impact read latencies, and can cause issues ranging from load shedding and dropped messages to crashed and downed nodes.<\/p>\n<p>While the theoretical limit on the number of cells per partition has always been two billion cells, the reality has been quite different, as the impacts of heap pressure show. To mitigate these problems, the community has offered a standard recommendation for Cassandra users to keep partitions under 400MB, and preferably under 100MB.<\/p>\n<p>However, in version 3 many improvements were made that affected how Cassandra handles wide partitions. Memtables, caches, and SSTable components were moved off-heap, the storage engine was rewritten in <a href=\"https:\/\/issues.apache.org\/jira\/browse\/CASSANDRA-8099\" target=\"_blank\" rel=\"noopener noreferrer\">CASSANDRA-8099<\/a>, and Robert Stupp made a number of other improvements listed under <a href=\"https:\/\/issues.apache.org\/jira\/browse\/CASSANDRA-11206\" target=\"_blank\" rel=\"noopener noreferrer\">CASSANDRA-11206<\/a>.<\/p>\n<p>While working with <a href=\"https:\/\/www.backblaze.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">Backblaze<\/a> and operating a Cassandra version 3.11 cluster, we had the opportunity to test and validate how Cassandra actually handles partitions with this latest version. We will demonstrate that well designed data models can go beyond the existing 400MB recommendation without nodes crashing through heap pressure.<\/p>\n<p>Below, we walk through how Cassandra writes partitions to disk in 3.11, look at how wide partitions impact read latencies, and then present our testing and verification of wide partition impacts on the cluster using the work we did with Backblaze.<\/p>\n<h2 class=\"b2\">The Art and Science of Writing Wide Partitions to Disk<\/h2>\n<p>First we need to understand what a partition is and how Cassandra writes partitions to disk in version 3.11.<\/p>\n<p>Each SSTable contains a set of files, and the (<code>\u2013Data.db<\/code>) file contains numerous partitions.<\/p>\n<p>The layout of a partition in the <code>\u2013Data.db<\/code> file has three components: a header, followed by zero or one static rows, which is followed by zero or more ordered <a href=\"https:\/\/github.com\/apache\/cassandra\/blob\/cassandra-3.11\/src\/java\/org\/apache\/cassandra\/db\/Clusterable.java\" target=\"_blank\" rel=\"noopener noreferrer\">Clusterable<\/a> objects. The Clusterable object in this file may either be a row or a RangeTombstone that deletes data with each wide partition containing many Clusterable objects. For an excellent in-depth examination of this, see Aaron\u2019s blog post <a href=\"http:\/\/thelastpickle.com\/blog\/2016\/03\/04\/introductiont-to-the-apache-cassandra-3-storage-engine.html\" target=\"_blank\" rel=\"noopener noreferrer\">Cassandra 3.x Storage Engine<\/a>.<\/p>\n<p>The <code>\u2013Index.db<\/code> file stores offsets for the partitions, as well as the <code>IndexInfo<\/code> serialized objects for each partition. These indices facilitate locating the data on disk within the <code>\u2013Data.db<\/code> file. Stored partition offsets are represented by a subclass of the <a href=\"https:\/\/github.com\/apache\/cassandra\/blob\/cassandra-3.11\/src\/java\/org\/apache\/cassandra\/db\/RowIndexEntry.java\" target=\"_blank\" rel=\"noopener noreferrer\">RowIndexEntry<\/a>. This subclass is chosen by the the <a href=\"https:\/\/github.com\/apache\/cassandra\/blob\/cassandra-3.11\/src\/java\/org\/apache\/cassandra\/db\/RowIndexEntry.java#L207-L230\" target=\"_blank\" rel=\"noopener noreferrer\">ColumnIndex<\/a> and depends on the size of the partition:<\/p>\n<ul>\n<li><code>RowIndexEntry<\/code> is used when there are no Clusterable objects in the partition, such as when there is only a static row. In this case there are no <code>IndexInfo<\/code> objects to store and so the parent <code>RowIndexEntry<\/code> class is used rather than a subclass.<\/li>\n<li>The <code>IndexEntry<\/code> subclass holds the <code>IndexInfo<\/code> objects in memory until the partition has finished writing to disk. It is used in partitions where the total serialized size of the <code>IndexInfo<\/code> objects is <strong>less<\/strong> than the <code>column_index_cache_size_in_kb<\/code> configuration setting (which defaults to 2KB).<\/li>\n<li>The <code>ShallowIndexEntry<\/code> subclass serializes <code>IndexInfo<\/code> objects to disk as they are created and references these objects using only their position in the file. This is used in partitions where the total serialized size of the <code>IndexInfo<\/code> objects is <strong>more<\/strong> than the <code>column_index_cache_size_in_kb<\/code> configuration setting.<\/li>\n<\/ul>\n<p>These <a href=\"https:\/\/github.com\/apache\/cassandra\/blob\/cassandra-3.11\/src\/java\/org\/apache\/cassandra\/io\/sstable\/IndexInfo.java\" target=\"_blank\" rel=\"noopener noreferrer\">IndexInfo<\/a> objects provide a sampling of positional offsets for rows within a partition, creating an index. Each object specifies the offset the page starts at, the first row and the last row.<\/p>\n<p>So, in general, the bigger the partition, the more <code>IndexInfo<\/code> objects need to be created when writing to disk &#8212; and if they are held in memory until the partition is fully written to disk they can cause memory pressure. This is why the <code>column_index_cache_size_in_kb<\/code> setting was added in Cassandra 3.6 and the objects are now serialized as they are created.<\/p>\n<p>The relationship between partition size and the number of objects was quantified by Robert Stupp in his presentation, <a href=\"https:\/\/www.slideshare.net\/DataStax\/myths-of-big-partitions-robert-stupp-datastax-cassandra-summit-2016\" target=\"_blank\" rel=\"noopener noreferrer\">Myths of Big Partitions<\/a>.<\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2019\/01\/TLP-index-info-numbers.png\" data-rel=\"lightbox-gallery-sFsQPakb\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-87234\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2019\/01\/TLP-index-info-numbers.png\" alt=\"IndexInfo numbers from Robert Stupp\" width=\"1440\" height=\"748\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/TLP-index-info-numbers.png 1440w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/TLP-index-info-numbers-300x156.png 300w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/TLP-index-info-numbers-1024x532.png 1024w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/TLP-index-info-numbers-768x399.png 768w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/TLP-index-info-numbers-560x291.png 560w\" sizes=\"auto, (max-width: 1440px) 100vw, 1440px\" \/><\/a><\/p>\n<h2 class=\"b2\">How Wide Partitions Impact Read Latencies<\/h2>\n<p>Cassandra\u2019s key cache is an optimization that is enabled by default and helps to improve the speed and efficiency of the read path by reducing the amount of disk activity per read.<\/p>\n<p>Each key cache entry is identified by a combination of the keyspace, table name, SSTable, and the partition key. The value of the key cache is a <code>RowIndexEntry<\/code> or one of its subclasses &#8212; either <code>IndexedEntry<\/code> or the new <code>ShallowIndexedEntry<\/code>. The size of the key cache is limited by the <code>key_cache_size_in_mb<\/code> configuration setting.<\/p>\n<p>When a read operation in the storage engine gets a cache hit it avoids having to access the <code>\u2013Summary.db<\/code> and <code>\u2013Index.db<\/code> SSTable components, which reduces that read request\u2019s latency. Wide partitions, however, can decrease the efficiency of this key cache optimization because fewer hot partitions will fit into the allocated cache size.<\/p>\n<p>Indeed, before the <code>ShallowIndexedEntry<\/code> was added in Cassandra version 3.6, a single wide row could fill the key cache, reducing the hit rate efficiency. When applied to multiple rows, this will cause greater churn of additions and evictions of cache entries.<\/p>\n<p>For example, if the <code>IndexEntry<\/code> for a 512MB partition contains 100K+ <code>IndexInfo<\/code> objects and if these <code>IndexInfo<\/code> objects total 1.4MB, then the key cache would only be able to hold 140 entries.<\/p>\n<p>The introduction of <code>ShallowIndexedEntry<\/code> objects changed how the key cache can hold data. The <code>ShallowIndexedEntry<\/code> contains a list of file pointers referencing the serialized <code>IndexInfo<\/code> objects and can binary search through this list, rather than having to deserialize the entire <code>IndexInfo<\/code> objects list. Thus when the <code>ShallowIndexedEntry<\/code> is used no <code>IndexInfo<\/code> objects exist within the key cache. This increases the storage efficiency of the key cache in storing more entries, but does still require that the <code>IndexInfo<\/code> objects are binary searched and deserialized from the <code>\u2013Index.db<\/code> file on a cache hit.<\/p>\n<p>In short, on wide partitions a key cache miss still results in two additional disk reads, as it did before Cassandra 3.6, but now a key cache hit incurs a disk read to the <code>-Index.db<\/code> file where it did not before Cassandra 3.6.<\/p>\n<h2 class=\"b2\">Object Creation and Heap Behavior with Wide Partitions in 2.2.13 vs 3.11.3<\/h2>\n<p>Introducing the <code>ShallowIndexedEntry<\/code> into Cassandra version 3.6 creates a measurable improvement in the performance of wide partitions. To test the effects of this and the other performance enhancement features introduced in version 3 we compared how Cassandra 2.2.13 and 3.11.3 performed when one hundred thousand, one million, or ten million rows were each written to a single partition.<\/p>\n<p>The results and accompanying screenshots help illustrate the impact of object creation and heap behavior when inserting rows into wide partitions. While version 2.2.13 crashed repeatedly during this test, 3.11.3 was able to write over 30 million rows to a single partition before Cassandra Out-of-Memory crashed. The test and results are reproduced below.<\/p>\n<p>Both Cassandra versions were started as single-node clusters with default configurations, excepting heap customization in the <code>cassandra\u2013env.sh<\/code>:<\/p>\n<div class=\"pre-text\">MAX_HEAP_SIZE=&#8221;1G&#8221;<br \/>\nHEAP_NEWSIZE=&#8221;600M&#8221;<\/div>\n<p>In Cassandra only the configured concurrency of memtable flushes and compactors determines how many partitions are processed by a node and thus pressuring its heap at any one time. Based on this known concurrency limitation, profiling can be done by inserting data into one partition against one Cassandra node with a small heap. These results extrapolate to production environments.<\/p>\n<p>The <a href=\"http:\/\/thelastpickle.com\/blog\/2018\/10\/31\/tlp-stress-intro.html\" target=\"_blank\" rel=\"noopener noreferrer\">tlp-stress<\/a> tool inserted data in three separate profiling passes against both versions of Cassandra, creating wide partitions of one hundred thousand (<code class=\"highlighter-rouge language-bash\">100K<\/code>), one million (<code>1M<\/code>), or ten million (<code>10M<\/code>) rows.<\/p>\n<p>A <code>tlp-stress<\/code> profile for wide partitions was written, as no suitable profile existed. The read to write ratio used the default setting of 1:100.<\/p>\n<p>The following command lines then implemented the <code>tlp-stress<\/code> tool:<\/p>\n<div class=\"pre-text\"># To write 100000 rows into one partition<br \/>\ntlp-stress run Wide &#8211;replication &#8220;{&#8216;class&#8217;:&#8217;SimpleStrategy&#8217;,&#8217;replication_factor&#8217;: 1}&#8221; -n 100K# To write 1M rows into one partition<br \/>\ntlp-stress run Wide &#8211;replication &#8220;{&#8216;class&#8217;:&#8217;SimpleStrategy&#8217;,&#8217;replication_factor&#8217;: 1}&#8221; -n 1M# To write 10M rows into one partition<br \/>\ntlp-stress run Wide &#8211;replication &#8220;{&#8216;class&#8217;:&#8217;SimpleStrategy&#8217;,&#8217;replication_factor&#8217;: 1}&#8221; -n 10M<\/div>\n<p>Each time <code>tlp-stress<\/code> executed it was immediately followed by a command to ensure the full count of specified rows passed through the memtable flush and were written to disk:<\/p>\n<div class=\"pre-text\">nodetool flush<\/div>\n<p>The graphs in the sections below, taken from the <a href=\"https:\/\/twitter.com\/errcraft\/status\/1062652047395352576\" target=\"_blank\" rel=\"noopener noreferrer\">Apache NetBeans<\/a> Profiler, illustrate how the <code>ShallowIndexEntry<\/code> in Cassandra version 3.11 avoids keeping <code>IndexInfo<\/code> objects in memory.<\/p>\n<p>Notably, the <code>IndexInfo<\/code> objects are instantiated far more often, but are referenced for much shorter periods of time. The Garbage Collector is more effective at removing short-lived objects, as illustrated by the GC pause times being barely present in the Cassandra 3.11 graphs compared to Cassandra 2.2 where GC pause times overwhelm the JVM.<\/p>\n<h3 class=\"b3\">Wide Partitions in Cassandra 2.2<\/h3>\n<p>Benchmarks were against Cassandra 2.2.13<\/p>\n<h4 class=\"b4\">One Partition with 100K Rows (2.2.13)<\/h4>\n<p>The following three screenshots shows the number of <code>IndexInfo<\/code> objects instantiated during the write benchmark, during compaction, and a heap profile.<\/p>\n<p><b>The partition grew to be ~40MB.<\/b><\/p>\n<p>Objects created during <code>tlp-stress<\/code><\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2019\/01\/image3.png\" data-rel=\"lightbox-gallery-sFsQPakb\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-87239 size-full\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2019\/01\/image3.png\" alt=\"screenshot of Cassandra 2.2 objects created during tlp-stress\" width=\"1999\" height=\"215\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image3.png 1999w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image3-300x32.png 300w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image3-1024x110.png 1024w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image3-768x83.png 768w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image3-1536x165.png 1536w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image3-560x60.png 560w\" sizes=\"auto, (max-width: 1999px) 100vw, 1999px\" \/><\/a><\/p>\n<p>Objects created during subsequent major compaction<\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2019\/01\/image15.png\" data-rel=\"lightbox-gallery-sFsQPakb\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-87241 size-full\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2019\/01\/image15.png\" alt=\"screenshot of Cassandra 2.2 objects created during subsequent major compaction\" width=\"1999\" height=\"215\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image15.png 1999w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image15-300x32.png 300w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image15-1024x110.png 1024w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image15-768x83.png 768w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image15-1536x165.png 1536w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image15-560x60.png 560w\" sizes=\"auto, (max-width: 1999px) 100vw, 1999px\" \/><\/a><\/p>\n<p>Heap profiled during <code>tlp-stress<\/code> and major compaction<\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2019\/01\/image9.png\" data-rel=\"lightbox-gallery-sFsQPakb\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-87240 size-full\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2019\/01\/image9.png\" alt=\"screenshot of Cassandra 2.2 Heap profiled during tlp-stress and major compaction\" width=\"1999\" height=\"1278\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image9.png 1999w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image9-300x192.png 300w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image9-1024x655.png 1024w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image9-768x491.png 768w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image9-1536x982.png 1536w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image9-560x358.png 560w\" sizes=\"auto, (max-width: 1999px) 100vw, 1999px\" \/><\/a><\/p>\n<p>The above diagrams do not have their x-axis expanded to the full width, but still encompass the startup, stress test, flush, and compaction periods of the benchmark.<\/p>\n<p>When stress testing starts with <code>tlp-stress<\/code>, the CPU Time and Surviving Generations starts to climb. During this time the heap also starts to increase and decrease more frequently as it fills up and then the Garbage Collector cleans it out. In these diagrams the garbage collection intervals are easy to identify and isolate from one another.<\/p>\n<h4 class=\"b4\">One Partition with 1M Rows (2.2.13)<\/h4>\n<p>Here, the first two screenshots show the number of <code>IndexInfo<\/code> objects instantiated during the write benchmark and during the subsequent compaction process. The third screenshot shows the CPU &amp; GC Pause Times and the heap profile from the time writes started through when the compaction was completed.<\/p>\n<p><b>The partition grew to be ~400MB.<\/b><\/p>\n<p>Already at this size the Cassandra JVM is GC thrashing and has occasionally Out-of-Memory crashed.<\/p>\n<p>Objects created during <code>tlp-stress<\/code><\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2019\/01\/image12.png\" data-rel=\"lightbox-gallery-sFsQPakb\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-87245 size-full\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2019\/01\/image12.png\" alt=\"screenshot of Cassandra 2.2.13 Objects created during tlp-stress\" width=\"1999\" height=\"215\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image12.png 1999w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image12-300x32.png 300w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image12-1024x110.png 1024w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image12-768x83.png 768w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image12-1536x165.png 1536w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image12-560x60.png 560w\" sizes=\"auto, (max-width: 1999px) 100vw, 1999px\" \/><\/a><\/p>\n<p>Objects created during subsequent major compaction<\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2019\/01\/image6.png\" data-rel=\"lightbox-gallery-sFsQPakb\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-87244 size-full\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2019\/01\/image6.png\" alt=\"screenshot of Cassandra 2.2.13 Objects created during subsequent major compaction\" width=\"1999\" height=\"215\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image6.png 1999w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image6-300x32.png 300w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image6-1024x110.png 1024w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image6-768x83.png 768w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image6-1536x165.png 1536w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image6-560x60.png 560w\" sizes=\"auto, (max-width: 1999px) 100vw, 1999px\" \/><\/a><\/p>\n<p>Heap profiled during <code>tlp-stress<\/code> and major compaction<\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2019\/01\/image16.png\" data-rel=\"lightbox-gallery-sFsQPakb\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-87246 size-full\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2019\/01\/image16.png\" alt=\"screenshot of Cassandra 2.2.13 Heap profiled during tlp-stress and major compaction\" width=\"1999\" height=\"1278\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image16.png 1999w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image16-300x192.png 300w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image16-1024x655.png 1024w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image16-768x491.png 768w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image16-1536x982.png 1536w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image16-560x358.png 560w\" sizes=\"auto, (max-width: 1999px) 100vw, 1999px\" \/><\/a><\/p>\n<p>The above diagrams display a longer running benchmark, with the quiet period during the startup barely noticeable on the very left-hand side of each diagram. The number of garbage collection intervals and the oscillations in heap size are far more frequent. The GC Pause Time during the stress testing period is now consistently higher and comparable to the CPU Time. It only dissipates when the benchmark performs the flush and compaction.<\/p>\n<h4 class=\"b4\">One Partition with 10M Rows (2.2.13)<\/h4>\n<p>In this final test of Cassandra version 2.2.13, the results were difficult to reproduce reliably, as more often than not this test Out-of-Memory crashed from GC heap pressure.<\/p>\n<p>The first two screenshots show the number of <code>IndexInfo<\/code> objects instantiated during the write benchmark and during the subsequent compaction process. The third screenshot shows the GC Pause Time and the heap profile from the time writes started until compaction was completed.<\/p>\n<p><b>The partition grew to be ~4GB.<\/b><\/p>\n<p>Objects created during <code>tlp-stress<\/code><\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2019\/01\/image8.png\" data-rel=\"lightbox-gallery-sFsQPakb\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-87249\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2019\/01\/image8.png\" alt=\"\" width=\"1999\" height=\"215\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image8.png 1999w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image8-300x32.png 300w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image8-1024x110.png 1024w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image8-768x83.png 768w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image8-1536x165.png 1536w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image8-560x60.png 560w\" sizes=\"auto, (max-width: 1999px) 100vw, 1999px\" \/><\/a><\/p>\n<p>Objects created during subsequent major compaction<\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2019\/01\/image13.png\" data-rel=\"lightbox-gallery-sFsQPakb\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-87250\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2019\/01\/image13.png\" alt=\"\" width=\"1999\" height=\"215\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image13.png 1999w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image13-300x32.png 300w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image13-1024x110.png 1024w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image13-768x83.png 768w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image13-1536x165.png 1536w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image13-560x60.png 560w\" sizes=\"auto, (max-width: 1999px) 100vw, 1999px\" \/><\/a><\/p>\n<p>Heap profiled during <code>tlp-stress<\/code> and major compaction<\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2019\/01\/image18.png\" data-rel=\"lightbox-gallery-sFsQPakb\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-87251 size-full\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2019\/01\/image18.png\" alt=\"screenshot of Cassandra Heap profiled during tlp-stress and major compaction\" width=\"1999\" height=\"1278\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image18.png 1999w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image18-300x192.png 300w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image18-1024x655.png 1024w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image18-768x491.png 768w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image18-1536x982.png 1536w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image18-560x358.png 560w\" sizes=\"auto, (max-width: 1999px) 100vw, 1999px\" \/><\/a><\/p>\n<p>The above diagrams display consistently very high GC Pause Time compared to CPU Time. Any Cassandra node under this much duress from garbage collection is not healthy. It is suffering from high read latencies, could become blacklisted by other nodes due to its lack of responsiveness, and even crash altogether from Out-of-Memory errors (as it did often during this benchmark).<\/p>\n<h3 class=\"b3\">Wide Partitions in Cassandra 3.11.3<\/h3>\n<p>Benchmarks were against Cassandra 3.11.3<\/p>\n<p>In this series, the graphs demonstrate how <code>IndexInfo<\/code> objects are created either from memtable flushes or from deserialization off disk. The <code>ShallowIndexEntry<\/code> is used in Cassandra 3.11.3 when deserializing the <code>IndexInfo<\/code> objects from the <code>-Index.db<\/code> file.<\/p>\n<p>Neither form of <code>IndexInfo<\/code> objects reside long in the heap and thus the GC Pause Time is barely visible in comparison to Cassandra 2.2.13 despite the additional numbers of <code>IndexInfo<\/code> objects created via deserialization.<\/p>\n<h4 class=\"b4\">One Partition with 100K Rows (3.11.3)<\/h4>\n<p>As with the earlier version test of this size, the following two screenshots shows the number of <code>IndexInfo<\/code> objects instantiated during the write benchmark and during the subsequent compaction process. The third screenshot shows the CPU &amp; GC Pause Time and the heap profile from the time writes started through when the compaction was completed.<\/p>\n<p><b>The partition grew to be ~40MB, the same as with Cassandra 2.2.13<\/b><\/p>\n<p>Objects created during <code>tlp-stress<\/code><\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2019\/01\/image2.png\" data-rel=\"lightbox-gallery-sFsQPakb\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-87253 size-full\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2019\/01\/image2.png\" alt=\"screenshot of Cassandra 3.11.3 objects created during tlp-stress\" width=\"1999\" height=\"304\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image2.png 1999w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image2-300x46.png 300w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image2-1024x156.png 1024w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image2-768x117.png 768w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image2-1536x234.png 1536w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image2-560x85.png 560w\" sizes=\"auto, (max-width: 1999px) 100vw, 1999px\" \/><\/a><\/p>\n<p>Objects created during subsequent major compaction<\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2019\/01\/image7.png\" data-rel=\"lightbox-gallery-sFsQPakb\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-87255 size-full\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2019\/01\/image7.png\" alt=\"screenshot of Cassandra 3.11.3 objects created during subsequent major compaction\" width=\"1999\" height=\"185\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image7.png 1999w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image7-300x28.png 300w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image7-1024x95.png 1024w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image7-768x71.png 768w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image7-1536x142.png 1536w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image7-560x52.png 560w\" sizes=\"auto, (max-width: 1999px) 100vw, 1999px\" \/><\/a><\/p>\n<p>Heap profiled during <code>tlp-stress<\/code> and major compaction<\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2019\/01\/image4.png\" data-rel=\"lightbox-gallery-sFsQPakb\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-87254 size-full\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2019\/01\/image4.png\" alt=\"screenshot of Cassandra 3.11.3 Heap profiled during tlp-stress and major compaction\" width=\"1999\" height=\"1103\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image4.png 1999w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image4-300x166.png 300w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image4-1024x565.png 1024w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image4-768x424.png 768w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image4-1536x848.png 1536w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image4-560x309.png 560w\" sizes=\"auto, (max-width: 1999px) 100vw, 1999px\" \/><\/a><\/p>\n<p>The diagrams above are roughly comparable to the first diagrams presented under Cassandra 2.2.13, except here the x-axis is expanded to full width. Note there are significantly more instantiated <code>IndexInfo<\/code> objects, but barely any noticeable GC Pause Time.<\/p>\n<h4 class=\"b4\">One Partition with 1M Rows (3.11.3)<\/h4>\n<p>Again, the first two screenshots show the number of <code>IndexInfo<\/code> objects instantiated during the write benchmark and during the subsequent compaction process. The third screenshot shows the CPU &amp; GC Pause Time and the heap profile over the time writes started until the compaction was completed.<\/p>\n<p><b>The partition grew to be ~400MB, the same as with Cassandra 2.2.13<\/b><\/p>\n<p>Objects created during <code>tlp-stress<\/code><\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2019\/01\/image14.png\" data-rel=\"lightbox-gallery-sFsQPakb\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-87262\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2019\/01\/image14.png\" alt=\"\" width=\"1999\" height=\"304\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image14.png 1999w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image14-300x46.png 300w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image14-1024x156.png 1024w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image14-768x117.png 768w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image14-1536x234.png 1536w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image14-560x85.png 560w\" sizes=\"auto, (max-width: 1999px) 100vw, 1999px\" \/><\/a><\/p>\n<p>Objects created during subsequent major compaction<\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2019\/01\/image3-1.png\" data-rel=\"lightbox-gallery-sFsQPakb\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-87261\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2019\/01\/image3-1.png\" alt=\"\" width=\"1999\" height=\"215\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image3-1.png 1999w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image3-1-300x32.png 300w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image3-1-1024x110.png 1024w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image3-1-768x83.png 768w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image3-1-1536x165.png 1536w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image3-1-560x60.png 560w\" sizes=\"auto, (max-width: 1999px) 100vw, 1999px\" \/><\/a><\/p>\n<p>Heap profiled during <code>tlp-stress<\/code> and major compaction<\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2019\/01\/image20.png\" data-rel=\"lightbox-gallery-sFsQPakb\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-87263\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2019\/01\/image20.png\" alt=\"\" width=\"1999\" height=\"1103\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image20.png 1999w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image20-300x166.png 300w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image20-1024x565.png 1024w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image20-768x424.png 768w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image20-1536x848.png 1536w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image20-560x309.png 560w\" sizes=\"auto, (max-width: 1999px) 100vw, 1999px\" \/><\/a><\/p>\n<p>The above diagrams show a wildly oscillating heap as many <code>IndexInfo<\/code> objects are created, and shows many garbage collection intervals, yet the GC Pause Time remains low, if at all noticeable.<\/p>\n<h4 class=\"b4\">One Partition with 10M Rows (3.11.3)<\/h4>\n<p>Here again, the first two screenshots show the number of IndexInfo objects instantiated during the write benchmark and during the subsequent compaction process. The third screenshot shows the CPU &amp; GC Pause Time and the heap profile over the time writes started until the compaction was completed.<\/p>\n<p><b>The partition grew to be ~4GB, the same as with Cassandra 2.2.13<\/b><\/p>\n<p>Objects created during <code>tlp-stress<\/code><\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2019\/01\/image11.png\" data-rel=\"lightbox-gallery-sFsQPakb\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-87273\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2019\/01\/image11.png\" alt=\"\" width=\"1999\" height=\"276\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image11.png 1999w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image11-300x41.png 300w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image11-1024x141.png 1024w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image11-768x106.png 768w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image11-1536x212.png 1536w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image11-560x77.png 560w\" sizes=\"auto, (max-width: 1999px) 100vw, 1999px\" \/><\/a><\/p>\n<p>Objects created during subsequent major compaction<\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2019\/01\/image19.png\" data-rel=\"lightbox-gallery-sFsQPakb\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-87274\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2019\/01\/image19.png\" alt=\"\" width=\"1999\" height=\"202\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image19.png 1999w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image19-300x30.png 300w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image19-1024x103.png 1024w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image19-768x78.png 768w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image19-1536x155.png 1536w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image19-560x57.png 560w\" sizes=\"auto, (max-width: 1999px) 100vw, 1999px\" \/><\/a><\/p>\n<p>Heap profiled during <code>tlp-stress<\/code> and major compaction<\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2019\/01\/image5.png\" data-rel=\"lightbox-gallery-sFsQPakb\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-87272\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2019\/01\/image5.png\" alt=\"\" width=\"1999\" height=\"1103\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image5.png 1999w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image5-300x166.png 300w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image5-1024x565.png 1024w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image5-768x424.png 768w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image5-1536x848.png 1536w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image5-560x309.png 560w\" sizes=\"auto, (max-width: 1999px) 100vw, 1999px\" \/><\/a><\/p>\n<p>Unlike this profile in 2.2.13, the cluster remains stable as it was when running 1M rows per partition. The above diagrams display an oscillating heap when <code>IndexInfo<\/code> objects are created, and many garbage collection intervals, yet GC Pause Time remains low, if at all noticeable.<\/p>\n<h4 class=\"b4\">Maximum Rows in 1GB Heap (3.11.3)<\/h4>\n<p>In an attempt to push Cassandra 3.11.3 to the limit, we ran a test to see how much data could be written to a single partition before Cassandra Out-of-Memory crashed.<\/p>\n<p><b>The result was 30M+ rows, which is ~12GB of data on disk.<\/b><\/p>\n<p>This is similar to the limit of 17GB of data written to a single partition as Robert Stupp found in <a href=\"https:\/\/issues.apache.org\/jira\/browse\/CASSANDRA-9754\" target=\"_blank\" rel=\"noopener noreferrer\">CASSANDRA-9754<\/a> when using a 5GB Java heap.<\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2019\/01\/image1.png\" data-rel=\"lightbox-gallery-sFsQPakb\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-87279 size-full\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2019\/01\/image1.png\" alt=\"screenshot of Cassandra 3.11.3 memory usage\" width=\"1999\" height=\"1103\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image1.png 1999w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image1-300x166.png 300w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image1-1024x565.png 1024w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image1-768x424.png 768w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image1-1536x848.png 1536w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image1-560x309.png 560w\" sizes=\"auto, (max-width: 1999px) 100vw, 1999px\" \/><\/a><\/p>\n<h4 class=\"b4\">What about Reads<\/h4>\n<p>The following graph reruns the benchmark on Cassandra version 3.11.3 over a longer period of time with a read to write ratio of 10:1. It illustrates that reads of wide partitions do not create the heap pressure that writes do.<\/p>\n<p><a href=\"\/blog\/wp-content\/uploads\/2019\/01\/image21.png\" data-rel=\"lightbox-gallery-sFsQPakb\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-87283 size-full\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2019\/01\/image21.png\" alt=\"screenshot of Cassandra 3.11.3 read functions\" width=\"2816\" height=\"940\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image21.png 2816w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image21-300x100.png 300w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image21-1024x342.png 1024w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image21-768x256.png 768w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image21-1536x513.png 1536w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image21-2048x684.png 2048w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image21-1600x533.png 1600w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/image21-560x187.png 560w\" sizes=\"auto, (max-width: 2816px) 100vw, 2816px\" \/><\/a><\/p>\n<h3 class=\"b3\">Conclusion<\/h3>\n<p>While the 400MB community recommendation for partition size is clearly appropriate for version 2.2.13, version 3.11.3 shows that performance improvements have created a tremendous ability to handle wide partitions and they can easily be an order of magnitude larger than earlier versions of Cassandra without nodes crashing through heap pressure.<\/p>\n<p>The trade-off for better supporting wide partitions in Cassandra 3.11.3 is increased read latency as row offsets now need to be read off disk. However, modern SSDs and kernel pagecaches take advantage of larger configurations of physical memory providing enough IO improvements to compensate for the read latency trade-offs.<\/p>\n<p>The improved stability and falling back on better hardware to deal with the read latency issue allows Cassandra operators to worry less about how to store massive amounts of data in different schemas and unexpected data growth patterns on those schemas.<\/p>\n<p>Some <a href=\"https:\/\/issues.apache.org\/jira\/browse\/CASSANDRA-9754\" target=\"_blank\" rel=\"noopener noreferrer\">CASSANDRA-9754<\/a> custom B+ tree structures will be used to more effectively look up the deserialised row offsets and further avoid the deserialization and instantiation of short-lived unused <code>IndexInfo<\/code> objects.<\/p>\n<hr \/>\n<table style=\"margin: 0;\" border=\"0\" cellpadding=\"8\">\n<tbody>\n<tr>\n<td><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-87301\" style=\"border-radius: 50%;\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2019\/01\/mick-semb-wever.jpg\" alt=\"Mick Semb Wever\" width=\"340\" height=\"340\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/mick-semb-wever.jpg 380w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/mick-semb-wever-300x300.jpg 300w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/mick-semb-wever-150x150.jpg 150w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/mick-semb-wever-80x80.jpg 80w\" sizes=\"auto, (max-width: 340px) 100vw, 340px\" \/><\/td>\n<td style=\"text-align: left; vertical-align: middle;\">Mick Semb Wever designs, builds, and is an evangelist for distributed systems, from data-driven backends using Cassandra, Hadoop, Spark, to enterprise microservices platforms.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<hr \/>\n","protected":false},"excerpt":{"rendered":"<p>Conventional wisdom for Apache\u2019s high-performance, scalable database has been to avoid wide partitions for risk of impacting database performance and storage limitations. Backblaze\u2019s work with a Cassandra consultant shows how version 3.11 can dramatically decrease garbage collection, lower latencies, and require nearly 30% less space when using wide partitions versus version 2.2.<\/p>\n","protected":false},"author":12,"featured_media":87375,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[131,25],"tags":[471,373],"class_list":["post-87225","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-backblaze-bits","category-partners","tag-businessbackup","tag-developer","entry"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.3 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Apache Cassandra Performance: Database Optimization Tips<\/title>\n<meta name=\"description\" content=\"We will demonstrate that well designed data models can go beyond the existing 400MB recommendation without nodes crashing through heap pressure.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.backblaze.com\/blog\/wide-partitions-in-apache-cassandra-3-11\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Apache Cassandra Performance: Database Optimization Tips\" \/>\n<meta property=\"og:description\" content=\"We will demonstrate that well designed data models can go beyond the existing 400MB recommendation without nodes crashing through heap pressure.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.backblaze.com\/blog\/wide-partitions-in-apache-cassandra-3-11\/\" \/>\n<meta property=\"og:site_name\" content=\"Backblaze Blog | Cloud Storage &amp; Cloud Backup\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/backblaze\" \/>\n<meta property=\"article:published_time\" content=\"2019-01-10T16:00:06+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-12-12T15:06:38+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/blog-guest-post-header.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1440\" \/>\n\t<meta property=\"og:image:height\" content=\"810\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Andy Klein\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@backblaze\" \/>\n<meta name=\"twitter:site\" content=\"@backblaze\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Andy Klein\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"14 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Apache Cassandra Performance: Database Optimization Tips","description":"We will demonstrate that well designed data models can go beyond the existing 400MB recommendation without nodes crashing through heap pressure.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.backblaze.com\/blog\/wide-partitions-in-apache-cassandra-3-11\/","og_locale":"en_US","og_type":"article","og_title":"Apache Cassandra Performance: Database Optimization Tips","og_description":"We will demonstrate that well designed data models can go beyond the existing 400MB recommendation without nodes crashing through heap pressure.","og_url":"https:\/\/www.backblaze.com\/blog\/wide-partitions-in-apache-cassandra-3-11\/","og_site_name":"Backblaze Blog | Cloud Storage &amp; Cloud Backup","article_publisher":"https:\/\/www.facebook.com\/backblaze","article_published_time":"2019-01-10T16:00:06+00:00","article_modified_time":"2025-12-12T15:06:38+00:00","og_image":[{"width":1440,"height":810,"url":"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/blog-guest-post-header.jpg","type":"image\/jpeg"}],"author":"Andy Klein","twitter_card":"summary_large_image","twitter_creator":"@backblaze","twitter_site":"@backblaze","twitter_misc":{"Written by":"Andy Klein","Est. reading time":"14 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.backblaze.com\/blog\/wide-partitions-in-apache-cassandra-3-11\/#article","isPartOf":{"@id":"https:\/\/www.backblaze.com\/blog\/wide-partitions-in-apache-cassandra-3-11\/"},"author":{"name":"Andy Klein","@id":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/#\/schema\/person\/9ac7e0bf0bd16852f8bfef352ce5fa8c"},"headline":"How We Optimized Storage and Performance of Apache Cassandra at Backblaze","datePublished":"2019-01-10T16:00:06+00:00","dateModified":"2025-12-12T15:06:38+00:00","mainEntityOfPage":{"@id":"https:\/\/www.backblaze.com\/blog\/wide-partitions-in-apache-cassandra-3-11\/"},"wordCount":2605,"commentCount":0,"publisher":{"@id":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/#organization"},"image":{"@id":"https:\/\/www.backblaze.com\/blog\/wide-partitions-in-apache-cassandra-3-11\/#primaryimage"},"thumbnailUrl":"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/blog-guest-post-header.jpg","keywords":["BusinessBackup","Developer"],"articleSection":["Backblaze Bits","Partners"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.backblaze.com\/blog\/wide-partitions-in-apache-cassandra-3-11\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.backblaze.com\/blog\/wide-partitions-in-apache-cassandra-3-11\/","url":"https:\/\/www.backblaze.com\/blog\/wide-partitions-in-apache-cassandra-3-11\/","name":"Apache Cassandra Performance: Database Optimization Tips","isPartOf":{"@id":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.backblaze.com\/blog\/wide-partitions-in-apache-cassandra-3-11\/#primaryimage"},"image":{"@id":"https:\/\/www.backblaze.com\/blog\/wide-partitions-in-apache-cassandra-3-11\/#primaryimage"},"thumbnailUrl":"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/blog-guest-post-header.jpg","datePublished":"2019-01-10T16:00:06+00:00","dateModified":"2025-12-12T15:06:38+00:00","description":"We will demonstrate that well designed data models can go beyond the existing 400MB recommendation without nodes crashing through heap pressure.","breadcrumb":{"@id":"https:\/\/www.backblaze.com\/blog\/wide-partitions-in-apache-cassandra-3-11\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.backblaze.com\/blog\/wide-partitions-in-apache-cassandra-3-11\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.backblaze.com\/blog\/wide-partitions-in-apache-cassandra-3-11\/#primaryimage","url":"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/blog-guest-post-header.jpg","contentUrl":"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/blog-guest-post-header.jpg","width":1440,"height":810},{"@type":"BreadcrumbList","@id":"https:\/\/www.backblaze.com\/blog\/wide-partitions-in-apache-cassandra-3-11\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/"},{"@type":"ListItem","position":2,"name":"How We Optimized Storage and Performance of Apache Cassandra at Backblaze"}]},{"@type":"WebSite","@id":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/#website","url":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/","name":"Backblaze Cloud Solutions Blog","description":"Cloud Storage &amp; Cloud Backup","publisher":{"@id":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/#organization","name":"Backblaze","url":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/i0.wp.com\/www.backblaze.com\/blog\/wp-content\/uploads\/2017\/12\/backblaze_icon_transparent.png?fit=512%2C512&ssl=1","contentUrl":"https:\/\/i0.wp.com\/www.backblaze.com\/blog\/wp-content\/uploads\/2017\/12\/backblaze_icon_transparent.png?fit=512%2C512&ssl=1","width":512,"height":512,"caption":"Backblaze"},"image":{"@id":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/backblaze","https:\/\/x.com\/backblaze","https:\/\/www.youtube.com\/user\/Backblaze","https:\/\/en.wikipedia.org\/wiki\/Backblaze"]},{"@type":"Person","@id":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/#\/schema\/person\/9ac7e0bf0bd16852f8bfef352ce5fa8c","name":"Andy Klein","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/04\/andy.jpg","url":"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/04\/andy.jpg","contentUrl":"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/04\/andy.jpg","caption":"Andy Klein"},"description":"Andy Klein is the Principal Cloud Storage Storyteller at Backblaze. He has over 25 years of experience in technology marketing and during that time, he has shared his expertise in cloud storage and computer security at events, symposiums, and panels at RSA, SNIA SDC, MIT, the Federal Trade Commission, and hundreds more. He currently writes and rants about drive stats, Storage Pods, cloud storage, and more.","url":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/author\/andy\/"}]}},"jetpack_featured_media_url":"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2019\/01\/blog-guest-post-header.jpg","_links":{"self":[{"href":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/wp-json\/wp\/v2\/posts\/87225","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/wp-json\/wp\/v2\/users\/12"}],"replies":[{"embeddable":true,"href":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/wp-json\/wp\/v2\/comments?post=87225"}],"version-history":[{"count":0,"href":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/wp-json\/wp\/v2\/posts\/87225\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/wp-json\/wp\/v2\/media\/87375"}],"wp:attachment":[{"href":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/wp-json\/wp\/v2\/media?parent=87225"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/wp-json\/wp\/v2\/categories?post=87225"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/wp-json\/wp\/v2\/tags?post=87225"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}