Introduction
I was recently involved with an investigation into a Memcached cluster running on AWS ElastiCache and discovered a few interesting details about Memcached. This post looks at the memory overhead of storing items in Memcached and demonstrates the impact of disabling CAS (check and set) on memory utilisation.Background - Memcached item overhead
Before getting into the details some background might be useful. Starting with the basics, how much memory does Memcached actually need to store an item? Lets spin up a cluster to check, for simplicity I am using a single m3.large ElastiCache node (running memcached 1.4.24) which according to the AWS docs provides around 6.2GB of memory for cache items.
The easiest way to look at our new cache is by telnetting to the cluster (from within the VPC) and running commands directly. To start, what does the slab allocation look like:
> stats slabs
< STAT active_slabs 0
< STAT total_malloced 0
< END
As expected, our empty cache has 0 active slabs and no memory allocated. Lets add a value and see how this changes things:
> set k 0 500 1
> v
< STORED
Above, we have set an item with a key of 'k', no flags, a 500 second expiry time and a 1 byte value of 'v'. Looking at the slab stats we now see:
> stats slabs
< STAT 1:chunk_size 96
< STAT 1:chunks_per_page 10922
< STAT 1:total_pages 1
< STAT 1:total_chunks 10922
< STAT 1:used_chunks 1
< STAT 1:free_chunks 10921
< STAT 1:free_chunks_end 0
< STAT 1:mem_requested 67
< STAT 1:get_hits 0
< STAT 1:cmd_set 1
< STAT 1:delete_hits 0
< STAT 1:incr_hits 0
< STAT 1:decr_hits 0
< STAT 1:cas_hits 0
< STAT 1:cas_badval 0
< STAT 1:touch_hits 0
< STAT active_slabs 1
< STAT total_malloced 1048512
< END
Lots of information here but lets focus on the chunks and memory stats for the moment. We now have 1 active slab with a chunk size of 96 bytes and one of these chunks (used_chunks) has been used to store our item, all expected except the relatively large size of the chunk used. We can see the actual memory requested was 67 bytes which means that 65 bytes of overhead are required to store the 2 byte item. Looking into the memcached code we can see that the item struct defined in memcached.h requires 48 bytes plus another 8 bytes for CAS. The remaining 11 bytes consist of:
- The null terminated key (2 bytes, 1 byte of overhead, "k\0" in our example)
- The CRLF terminated data (3 bytes, 2 bytes of overhead, "v\r\n" in our example)
- The item header which is a formatted string containing the flags, data length and terminated with CRLF (6 bytes, the minimum, in our example "_0_1\r\n" with underscores to make the spaces more visible)
The header is created in the item_make_header function in items.c, the relevant part being:
*nsuffix = (uint8_t) snprintf(suffix, 40, " %d %d\r\n", flags, nbytes - 2);
The header formats the flags and data length as integers in a string so this overhead will increase as the data size and flag values increase. For example, setting the flag to 10 will require an extra byte of to store the extra digit in the header:
> set k 10 500 1
> v
< STORED
> stats slabs
< STAT 1:chunk_size 96
< STAT 1:used_chunks 1
< STAT 1:free_chunks 10921
< STAT 1:mem_requested 68
...
< STAT active_slabs 1
< STAT total_malloced 1048512
< END
So you can expect at least 65 bytes of overhead per item.
Disable CAS if you don't need it
CAS stands for 'Check And Set' which the Memcached protocol docs describe as: "store this data but only if no one else has updated since I last fetched it."
For a fair number of Memcached use cases CAS will not be used and as a result can be disabled with no impact. You can check your cluster using the stats command and looking at the cas_* values:
> stats
...
< STAT cas_misses 0
< STAT cas_hits 0
< STAT cas_badval 0
...
Sounds simple enough, lets test it out. After setting cas_disabled to 1 on our cluster and rebooting for the change to take effect, we set our key again:
> set k 0 500 1
> v
< STORED
> stats slabs
< STAT 1:chunk_size 96
< STAT 1:used_chunks 1
< STAT 1:free_chunks 10921
< STAT 1:mem_requested 59
...
< STAT active_slabs 1
< STAT total_malloced 1048512
< END
Not much difference except for the requested memory reducing by 8 bytes. Disabling CAS will NOT reduce the wasted memory on its own but it will allow an item to contain an extra 8 bytes of data where previously it would have needed to use a larger chunk which might be wasting significantly more memory. Lets look at an example.
Assume that you are wanting to cache an item of 32 bytes, a 16 byte key and a 16 byte value. From the background section above we know that this will take 66 bytes of overhead pushing the total item size to 98 bytes, 2 bytes more than the chunk size of the first slab (using Memcached defaults), on a CAS enabled node this shows:
> set 0123456789ABCDEF 0 500 16
> 0123456789ABCDEF
< STORED
> stats slabs
< STAT 2:chunk_size 120
< STAT 2:used_chunks 1
< STAT 2:free_chunks 8737
< STAT 2:mem_requested 98
...
< STAT active_slabs 1
< STAT total_malloced 1048560
< END
Whereas with CAS disabled the overhead is reduced to 58 bytes, allowing the item to fit into the original slab:
> set 0123456789ABCDEF 0 500 16
> 0123456789ABCDEF
< STORED
> stats slabs
> STAT 1:chunk_size 96
> STAT 1:used_chunks 1
> STAT 1:free_chunks 10921
> STAT 1:mem_requested 90
...
> STAT active_slabs 1
> STAT total_malloced 1048512
> END
This means that with CAS enabled we are wasting 18% of the memory in each chunk of this slab (22 out of 120 bytes lost per item) compared to just 6% (6 out of 96 bytes) lost with CAS disabled. If the entire cache consisted of items this size, it would result in more than 1 GB of wasted memory compared to 372 MB wasted with CAS disabled. To put this differently, disabling CAS would allow you to store approximately 7 million additional 90 byte items resulting in better cache hit rates and lower eviction rates. This is of course an artificial example so lets look at a more realistic example where the cache item sizes are not quite so uniform.
The following stats are based on a production CAS enabled node's stats adjusted by a factor of ten to fit on an m3.large (the production node had around 59 million items in slab 11 this was scaled to 5.9 million for testing). The test code used to generate this data is relatively simple and distributes the items in a slab fairly evenly across the item sizes of a slab, each item's size is slightly weighted to try and match the average item size in production.
The following stats are based on a production CAS enabled node's stats adjusted by a factor of ten to fit on an m3.large (the production node had around 59 million items in slab 11 this was scaled to 5.9 million for testing). The test code used to generate this data is relatively simple and distributes the items in a slab fairly evenly across the item sizes of a slab, each item's size is slightly weighted to try and match the average item size in production.
Slab | Chunk Size | Chunks Used | Memory Allocated | Memory Requested | Average Item Size | Memory Wasted |
4 | 192 | 139,810 | 26,843,520 | 25,706,546 | 184 | 1,136,974 |
5 | 240 | 55,924 | 13,421,760 | 11,399,776 | 204 | 2,021,984 |
10 | 752 | 232,025 | 174,482,800 | 162,673,515 | 701 | 11,809,285 |
11 | 944 | 5,985,735 | 5,650,533,840 | 4,643,546,639 | 776 | 1,006,987,201 |
12 | 1,184 | 10,951 | 12,965,984 | 11,429,965 | 1,044 | 1,536,019 |
13 | 1,480 | 7,177 | 10,621,960 | 9,421,613 | 1,313 | 1,200,347 |
14 | 1,856 | 6,071 | 11,267,776 | 10,148,664 | 1,672 | 1,119,112 |
15 | 2,320 | 5,325 | 12,354,000 | 11,094,291 | 2,083 | 1,259,709 |
16 | 2,904 | 4,621 | 13,419,384 | 11,947,574 | 2,585 | 1,471,810 |
17 | 3,632 | 3,695 | 13,420,240 | 11,906,634 | 3,222 | 1,513,606 |
18 | 4,544 | 5,201 | 23,633,344 | 21,025,609 | 4,043 | 2,607,735 |
19 | 5,680 | 4,523 | 25,690,640 | 23,038,012 | 5,094 | 2,652,628 |
20 | 7,104 | 3,778 | 26,838,912 | 23,778,017 | 6,294 | 3,060,895 |
21 | 8,880 | 3,022 | 26,835,360 | 23,630,979 | 7,820 | 3,204,381 |
22 | 11,104 | 2,417 | 26,838,368 | 23,754,944 | 9,828 | 3,083,424 |
23 | 13,880 | 1,933 | 26,830,040 | 23,092,970 | 11,947 | 3,737,070 |
24 | 17,352 | 1,107 | 19,208,664 | 15,554,075 | 14,051 | 3,654,589 |
6,473,315 | 6,115,206,592 | 5,063,149,823 | 1,052,056,769 |
The table shows that we are caching around 6.4 million items with the majority of them falling in slabs 4, 10, and 11 and that around 1 GB of memory is being wasted. Average item size is computed from (memory requested / chunks used, rounded to the nearest byte) and is an indicator of the actual item sizes (including overhead) within each slab. An interesting point relevant to the CAS discussion is that the average item size for slab 11 is 776 bytes which is only marginally bigger than the chunk size for slab 10.
Running the test code against a node with CAS disabled yields the following results:
Slab | Chunk Size | Chunks Used | Memory Allocated | Memory Requested | Average Item Size | Memory Wasted |
4 | 192 | 165,990 | 31,870,080 | 29,471,729 | 178 | 2,398,351 |
5 | 240 | 29,744 | 7,138,560 | 6,068,721 | 204 | 1,069,839 |
10 | 752 | 3,365,926 | 2,531,176,352 | 2,496,542,132 | 742 | 34,634,220 |
11 | 944 | 2,853,079 | 2,693,306,576 | 2,261,103,795 | 793 | 432,202,781 |
12 | 1,184 | 10,331 | 12,231,904 | 10,910,829 | 1,056 | 1,321,075 |
13 | 1,480 | 6,603 | 9,772,440 | 8,703,301 | 1,318 | 1,069,139 |
14 | 1,856 | 6,116 | 11,351,296 | 10,202,507 | 1,668 | 1,148,789 |
15 | 2,320 | 5,469 | 12,688,080 | 11,429,195 | 2,090 | 1,258,885 |
16 | 2,904 | 4,651 | 13,506,504 | 12,137,620 | 2,610 | 1,368,884 |
17 | 3,632 | 3,629 | 13,180,528 | 11,834,384 | 3,261 | 1,346,144 |
18 | 4,544 | 5,089 | 23,124,416 | 20,661,849 | 4,060 | 2,462,567 |
19 | 5,680 | 4,557 | 25,883,760 | 23,299,194 | 5,113 | 2,584,566 |
20 | 7,104 | 3,786 | 26,895,744 | 23,983,965 | 6,335 | 2,911,779 |
21 | 8,880 | 3,056 | 27,137,280 | 24,146,469 | 7,901 | 2,990,811 |
22 | 11,104 | 2,376 | 26,383,104 | 23,654,235 | 9,955 | 2,728,869 |
23 | 13,880 | 2,297 | 31,882,360 | 28,479,830 | 12,399 | 3,402,530 |
24 | 17,352 | 616 | 10,688,832 | 8,733,548 | 14,178 | 1,955,284 |
6,473,315 | 5,508,217,816 | 5,011,363,303 | 496,854,513 |
As expected, the extra 8 bytes per slab has allowed a large number of items to be moved to slabs having smaller chunk sizes with more than half of the items in slab 11 now fitting into slab 10 and saving almost 600 MB of memory. The average item size being closer to the slab chunk size is another indicator that we are making more efficient use of the memory in each slab.
For reference, the size of the items each slab can contain are below. The chunk sizes are displayed when running 'memcached -vv', reformatting this into a table and calculating the maximum item size gives:
Slab | Chunk size | Max data | Min data |
1 | 96 | 30 | 1 |
2 | 120 | 54 | 31 |
3 | 152 | 86 | 55 |
4 | 192 | 125 | 87 |
5 | 240 | 173 | 126 |
6 | 304 | 237 | 174 |
7 | 384 | 317 | 238 |
8 | 480 | 413 | 318 |
9 | 600 | 533 | 414 |
10 | 752 | 685 | 534 |
11 | 944 | 877 | 686 |
12 | 1184 | 1116 | 878 |
13 | 1480 | 1412 | 1117 |
14 | 1856 | 1788 | 1413 |
15 | 2320 | 2252 | 1789 |
16 | 2904 | 2836 | 2253 |
17 | 3632 | 3564 | 2837 |
18 | 4544 | 4476 | 3565 |
19 | 5680 | 5612 | 4477 |
20 | 7104 | 7036 | 5613 |
21 | 8880 | 8812 | 7037 |
22 | 11104 | 11035 | 8813 |
23 | 13880 | 13811 | 11036 |
24 | 17352 | 17283 | 13812 |
25 | 21696 | 21627 | 17284 |
26 | 27120 | 27051 | 21628 |
27 | 33904 | 33835 | 27052 |
28 | 42384 | 42315 | 33836 |
29 | 52984 | 52915 | 42316 |
30 | 66232 | 66163 | 52916 |
31 | 82792 | 82723 | 66164 |
32 | 103496 | 103426 | 82724 |
33 | 129376 | 129306 | 103427 |
34 | 161720 | 161650 | 129307 |
35 | 202152 | 202082 | 161651 |
36 | 252696 | 252626 | 202083 |
37 | 315872 | 315802 | 252627 |
38 | 394840 | 394770 | 315803 |
39 | 493552 | 493482 | 394771 |
40 | 616944 | 616874 | 493483 |
41 | 771184 | 771114 | 616875 |
42 | 1048576 | 1048505 | 771115 |
This table is using the default Memcached config settings (48 min item, 1.25 growth factor, cas enabled). The "Max data" column is the maximum number of bytes that can be used to store both the key and the value for an item in this slab. The "Min data" column is the minimum data size required for an item to be put in a slab and is simply the maximum data size from the previous slab + 1. With CAS disabled the max and min data values would be increased by 8 bytes for all items with the exception of the minimum data for slab 1 remaining at 1 byte. The following Python function can be used for calculating the maximum item size (key + value) per slab:
Conclusion
In this post we have had a look at the details of Memcached item overhead and identified disabling CAS as a simple mechanism of improving memory utilisation which will reduce evictions and increase hit rate. In the next post we will look at how adjusting the growth factor and minimum item size can further improve memory utilisation.