forked from tikv/rocksdb
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
New blog post for Ribbon filter (facebook#8992)
Summary: new blog post for Ribbon filter Pull Request resolved: facebook#8992 Test Plan: markdown render in GitHub, Pages on my fork Reviewed By: jay-zhuang Differential Revision: D33342496 Pulled By: pdillinger fbshipit-source-id: a0a7c19100abdf8755f8a618eb4dead755dfddae
- Loading branch information
1 parent
aa2b3bf
commit 26a238f
Showing
1 changed file
with
281 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,281 @@ | ||
--- | ||
title: Ribbon Filter | ||
layout: post | ||
author: pdillinger | ||
category: blog | ||
--- | ||
|
||
## Summary | ||
Since version 6.15 last year, RocksDB supports Ribbon filters, a new | ||
alternative to Bloom filters that save space, especially memory, at | ||
the cost of more CPU usage, mostly in constructing the filters in the | ||
background. Most applications with long-lived data (many hours or | ||
longer) will likely benefit from adopting a Ribbon+Bloom hybrid filter | ||
policy. Here we explain why and how. | ||
|
||
[Ribbon filter on RocksDB wiki](https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter#ribbon-filter) | ||
|
||
[Ribbon filter paper](https://arxiv.org/abs/2103.02515) | ||
|
||
## Problem & background | ||
Bloom filters play a critical role in optimizing point queries and | ||
some range queries in LSM-tree storage systems like RocksDB. Very | ||
large DBs can use 10% or more of their RAM memory for (Bloom) filters, | ||
so that (average case) read performance can be very good despite high | ||
(worst case) read amplification, [which is useful for lowering write | ||
and/or space | ||
amplification](http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-pick-2_23.html). | ||
Although the `format_version=5` Bloom filter in RocksDB is extremely | ||
fast, all Bloom filters use around 50% more space than is | ||
theoretically possible for a hashed structure configured for the same | ||
false positive (FP) rate and number of keys added. What would it take | ||
to save that significant share of “wasted” filter memory, and when | ||
does it make sense to use such a Bloom alternative? | ||
|
||
A number of alternatives to Bloom filters were known, especially for | ||
static filters (not modified after construction), but all the | ||
previously known structures were unsatisfying for SSTs because of some | ||
combination of | ||
* Not enough space savings for CPU increase. For example, [Xor | ||
filters](https://arxiv.org/abs/1912.08258) use 3-4x more CPU than | ||
Bloom but only save 15-20% of | ||
space. [GOV](https://arxiv.org/pdf/1603.04330.pdf) can save around | ||
30% space but requires around 10x more CPU than Bloom. | ||
* Inconsistent space savings. [Cuckoo | ||
filters](https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf) | ||
and Xor+ filters offer significant space savings for very low FP | ||
rates (high bits per key) but little or no savings for higher FP | ||
rates (low bits per key). ([Higher FP rates are considered best for | ||
largest levels of | ||
LSM.](https://stratos.seas.harvard.edu/files/stratos/files/monkeykeyvaluestore.pdf)) | ||
[Spatially-coupled Xor | ||
filters](https://arxiv.org/pdf/2001.10500.pdf) require very large | ||
number of keys per filter for large space savings. | ||
* Inflexible configuration. No published alternatives offered the same | ||
continuous configurability of Bloom filters, where any FP rate and | ||
any fractional bits per key could be chosen. This flexibility | ||
improves memory efficiency with the `optimize_filters_for_memory` | ||
option that minimizes internal fragmentation on filters. | ||
|
||
## Ribbon filter development and implementation | ||
The Ribbon filter came about when I developed a faster, simpler, and | ||
more adaptable algorithm for constructing a little-known [Xor-based | ||
structure from Dietzfelbinger and | ||
Walzer](https://arxiv.org/pdf/1907.04750.pdf). It has very good space | ||
usage for required CPU time (~30% space savings for 3-4x CPU) and, | ||
with some engineering, Bloom-like configurability. The complications | ||
were managable for use in RocksDB: | ||
* Ribbon space efficiency does not naturally scale to very large | ||
number of keys in a single filter (whole SST file or partition), but | ||
with the current 128-bit Ribbon implementation in RocksDB, even 100 | ||
million keys in one filter saves 27% space vs. Bloom rather than 30% | ||
for 100,000 keys in a filter. | ||
* More temporary memory is required during construction, ~230 bits per | ||
key for 128-bit Ribbon vs. ~75 bits per key for Bloom filter. A | ||
quick calculation shows that if you are saving 3 bits per key on the | ||
generated filter, you only need about 50 generated filters in memory | ||
to offset this temporary memory usage. (Thousands of filters in | ||
memory is typical.) Starting in RocksDB version 6.27, this temporary | ||
memory can be accounted for under block cache using | ||
`BlockBasedTableOptions::reserve_table_builder_memory`. | ||
* Ribbon filter queries use relatively more CPU for lower FP rates | ||
(but still O(1) relative to number of keys added to filter). This | ||
should be OK because lower FP rates are only appropriate when then | ||
cost of a false positive is very high (worth extra query time) or | ||
memory is not so constrained (can use Bloom instead). | ||
|
||
Future: data in [the paper](https://arxiv.org/abs/2103.02515) suggests | ||
that 32-bit Balanced Ribbon (new name: [Bump-Once | ||
Ribbon](https://arxiv.org/pdf/2109.01892.pdf)) would improve all of | ||
these issues and be better all around (except for code complexity). | ||
|
||
## Ribbon vs. Bloom in RocksDB configuration | ||
Different applications and hardware configurations have different | ||
constraints, but we can use hardware costs to examine and better | ||
understand the trade-off between Bloom and Ribbon. | ||
|
||
### Same FP rate, RAM vs. CPU hardware cost | ||
Under ideal conditions where we can adjust our hardware to suit the | ||
application, in terms of dollars, how much does it cost to construct, | ||
query, and keep in memory a Bloom filter vs. a Ribbon filter? The | ||
Ribbon filter costs more for CPU but less for RAM. Importantly, the | ||
RAM cost directly depends on how long the filter is kept in memory, | ||
which in RocksDB is essentially the lifetime of the filter. | ||
(Temporary RAM during construction is so short-lived that it is | ||
ignored.) Using some consumer hardware and electricity prices and a | ||
predicted balance between construction and queries, we can compute a | ||
“break even” duration in memory. To minimize cost, filters with a | ||
lifetime shorter than this should be Bloom and filters with a lifetime | ||
longer than this should be Ribbon. (Python code) | ||
|
||
``` | ||
# Commodity prices based roughly on consumer prices and rough guesses | ||
# Upfront cost of a CPU per hardware thread | ||
upfront_dollars_per_cpu_thread = 30.0 | ||
# CPU average power usage per hardware thread | ||
watts_per_cpu_thread = 3.5 | ||
# Upfront cost of a GB of RAM | ||
upfront_dollars_per_gb_ram = 8.0 | ||
# RAM average power usage per GB | ||
# https://www.crucial.com/support/articles-faq-memory/how-much-power-does-memory-use | ||
watts_per_gb_ram = 0.375 | ||
# Estimated price of power per kilowatt-hour, including overheads like conversion losses and cooling | ||
dollars_per_kwh = 0.35 | ||
# Assume 3 year hardware lifetime | ||
hours_per_lifetime = 3 * 365 * 24 | ||
seconds_per_lifetime = hours_per_lifetime * 60 * 60 | ||
# Number of filter queries per key added in filter construction is heavily dependent on workload. | ||
# When replication is in layer above RocksDB, it will be low, likely < 1. When replication is in | ||
# storage layer below RocksDB, it will likely be > 1. Using a rough and general guesstimate. | ||
key_query_per_construct = 1.0 | ||
#================================== | ||
# Bloom & Ribbon filter performance | ||
typical_bloom_bits_per_key = 10.0 | ||
typical_ribbon_bits_per_key = 7.0 | ||
# Speeds here are sensitive to many variables, especially query speed because it | ||
# is so dependent on memory latency. Using this benchmark here: | ||
# for IMPL in 2 3; do | ||
# ./filter_bench -impl=$IMPL -quick -m_keys_total_max=200 -use_full_block_reader | ||
# done | ||
# and "Random filter" queries. | ||
nanoseconds_per_construct_bloom_key = 32.0 | ||
nanoseconds_per_construct_ribbon_key = 140.0 | ||
nanoseconds_per_query_bloom_key = 500.0 | ||
nanoseconds_per_query_ribbon_key = 600.0 | ||
#================================== | ||
# Some constants | ||
kwh_per_watt_lifetime = hours_per_lifetime / 1000.0 | ||
bits_per_gb = 8 * 1024 * 1024 * 1024 | ||
#================================== | ||
# Crunching the numbers | ||
# on CPU for constructing filters | ||
dollars_per_cpu_thread_lifetime = upfront_dollars_per_cpu_thread + watts_per_cpu_thread * kwh_per_watt_lifetime * dollars_per_kwh | ||
dollars_per_cpu_thread_second = dollars_per_cpu_thread_lifetime / seconds_per_lifetime | ||
dollars_per_construct_bloom_key = dollars_per_cpu_thread_second * nanoseconds_per_construct_bloom_key / 10**9 | ||
dollars_per_construct_ribbon_key = dollars_per_cpu_thread_second * nanoseconds_per_construct_ribbon_key / 10**9 | ||
dollars_per_query_bloom_key = dollars_per_cpu_thread_second * nanoseconds_per_query_bloom_key / 10**9 | ||
dollars_per_query_ribbon_key = dollars_per_cpu_thread_second * nanoseconds_per_query_ribbon_key / 10**9 | ||
dollars_per_bloom_key_cpu = dollars_per_construct_bloom_key + key_query_per_construct * dollars_per_query_bloom_key | ||
dollars_per_ribbon_key_cpu = dollars_per_construct_ribbon_key + key_query_per_construct * dollars_per_query_ribbon_key | ||
# on holding filters in RAM | ||
dollars_per_gb_ram_lifetime = upfront_dollars_per_gb_ram + watts_per_gb_ram * kwh_per_watt_lifetime * dollars_per_kwh | ||
dollars_per_gb_ram_second = dollars_per_gb_ram_lifetime / seconds_per_lifetime | ||
dollars_per_bloom_key_in_ram_second = dollars_per_gb_ram_second / bits_per_gb * typical_bloom_bits_per_key | ||
dollars_per_ribbon_key_in_ram_second = dollars_per_gb_ram_second / bits_per_gb * typical_ribbon_bits_per_key | ||
#================================== | ||
# How many seconds does it take for the added cost of constructing a ribbon filter instead | ||
# of bloom to be offset by the added cost of holding the bloom filter in memory? | ||
break_even_seconds = (dollars_per_ribbon_key_cpu - dollars_per_bloom_key_cpu) / (dollars_per_bloom_key_in_ram_second - dollars_per_ribbon_key_in_ram_second) | ||
print(break_even_seconds) | ||
# -> 3235.1647730256936 | ||
``` | ||
|
||
So roughly speaking, filters that live in memory for more than an hour | ||
should be Ribbon, and filters that live less than an hour should be | ||
Bloom. This is very interesting, but how long do filters live in | ||
RocksDB? | ||
|
||
First let's consider the average case. Write-heavy RocksDB loads are | ||
often backed by flash storage, which has some specified write | ||
endurance for its intended lifetime. This can be expressed as *device | ||
writes per day* (DWPD), and supported DWPD is typically < 10.0 even | ||
for high end devices (excluding NVRAM). Roughly speaking, the DB would | ||
need to be writing at a rate of 20+ DWPD for data to have an average | ||
lifetime of less than one hour. Thus, unless you are prematurely | ||
burning out your flash or massively under-utilizing available storage, | ||
using the Ribbon filter has the better cost profile *on average*. | ||
|
||
### Predictable lifetime | ||
But we can do even better than optimizing for the average case. LSM | ||
levels give us very strong data lifetime hints. Data in L0 might live | ||
for minutes or a small number of hours. Data in Lmax might live for | ||
days or weeks. So even if Ribbon filters weren't the best choice on | ||
average for a workload, they almost certainly make sense for the | ||
larger, longer-lived levels of the LSM. As of RocksDB 6.24, you can | ||
specify a minimum LSM level for Ribbon filters with | ||
`NewRibbonFilterPolicy`, and earlier levels will use Bloom filters. | ||
|
||
### Resident filter memory | ||
The above analysis assumes that nearly all filters for all live SST | ||
files are resident in memory. This is true if using | ||
`cache_index_and_filter_blocks=0` and `max_open_files=-1` (defaults), | ||
but `cache_index_and_filter_blocks=1` is popular. In that case, | ||
if you use `optimize_filters_for_hits=1` and non-partitioned filters | ||
(a popular MyRocks configuration), it is also likely that nearly all | ||
live filters are in memory. However, if you don't use | ||
`optimize_filters_for_hits` and use partitioned filters, then | ||
cold data (by age or by key range) can lead to only a portion of | ||
filters being resident in memory. In that case, benefit from Ribbon | ||
filter is not as clear, though because Ribbon filters are smaller, | ||
they are more efficient to read into memory. | ||
|
||
RocksDB version 6.21 and later include a rough feature to determine | ||
block cache usage for data blocks, filter blocks, index blocks, etc. | ||
Data like this is periodically dumped to LOG file | ||
(`stats_dump_period_sec`): | ||
|
||
``` | ||
Block cache entry stats(count,size,portion): DataBlock(441761,6.82 GB,75.765%) FilterBlock(3002,1.27 GB,14.1387%) IndexBlock(17777,887.75 MB,9.63267%) Misc(1,0.00 KB,0%) | ||
Block cache LRUCache@0x7fdd08104290#7004432 capacity: 9.00 GB collections: 2573 last_copies: 10 last_secs: 0.143248 secs_since: 0 | ||
``` | ||
|
||
This indicates that at this moment in time, the block cache object | ||
identified by `LRUCache@0x7fdd08104290#7004432` (potentially used | ||
by multiple DBs) uses roughly 14% of its 9GB, about 1.27 GB, on filter | ||
blocks. This same data is available through `DB::GetMapProperty` with | ||
`DB::Properties::kBlockCacheEntryStats`, and (with some effort) can | ||
be compared to total size of all filters (not necessarily in memory) | ||
using `rocksdb.filter.size` from | ||
`DB::Properties::kAggregatedTableProperties`. | ||
|
||
### Sanity checking lifetime | ||
Can we be sure that using filters even makes sense for such long-lived | ||
data? We can apply [the current 5 minute rule for caching SSD data in | ||
RAM](http://renata.borovica-gajic.com/data/adms2017_5minuterule.pdf). A | ||
4KB filter page holds data for roughly 4K keys. If we assume at least | ||
one negative (useful) filter query in its lifetime per added key, it | ||
can satisfy the 5 minute rule with a lifetime of up to about two | ||
weeks. Thus, the lifetime threshold for “no filter” is about 300x | ||
higher than the lifetime threshold for Ribbon filter. | ||
|
||
### What to do with saved memory | ||
The default way to improve overall RocksDB performance with more | ||
available memory is to use more space for caching, which improves | ||
latency, CPU load, read IOs, etc. With | ||
`cache_index_and_filter_blocks=1`, savings in filters will | ||
automatically make room for caching more data blocks in block | ||
cache. With `cache_index_and_filter_blocks=0`, consider increasing | ||
block cache size. | ||
|
||
Using the space savings to lower filter FP rates is also an option, | ||
but there is less evidence for this commonly improving existing | ||
*optimized* configurations. | ||
|
||
## Generic recommendation | ||
If using `NewBloomFilterPolicy(bpk)` for a large persistent DB using | ||
compression, try using `NewRibbonFilterPolicy(bpk)` instead, which | ||
will generate Ribbon filters during compaction and Bloom filters | ||
for flush, both with the same FP rate as the old setting. Once new SST | ||
files are generated under the new policy, this should free up some | ||
memory for more caching without much effect on burst or sustained | ||
write speed. Both kinds of filters can be read under either policy, so | ||
there's always an option to adjust settings or gracefully roll back to | ||
using Bloom filter only (keeping in mind that SST files must be | ||
replaced to see effect of that change). |