-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathrelated.tex
53 lines (43 loc) · 4.67 KB
/
related.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
%common lsm stores RocksDB, scylladb, HyperLevelDB, LevelDB, hbase, cassandra
The vast majority of industrial mainstream NoSQL KV-stores are implemented as LSM trees~\cite{hbase,
RocksDB, scylladb, Bigtable2008, cassandra2010}, building on the foundations set by O'Neil
et al.~\cite{O'Neil1996, Muth1998}.
Due to LSM design popularity, much effort has been invested into working around its bottlenecks.
A variety of compaction strategies has been implemented in production systems~\cite{CallaghanCompaction,
ScyllaCompaction} and research prototypes~\cite{triad, PebblesDB}. Other suggestions include storage
optimizations~\cite{WiscKey, PebblesDB}, boost of in-memory parallelism~\cite{scylladb, clsm2015}, or leveraging
workload redundancies to defer disk flushes~\cite{triad, accordion}.
For example, PebblesDB~\cite{PebblesDB} introduces fragmented LSM trees, in which level files are
sliced into {\em guards\/} of increasing granularity and organized in a skiplist-like layout. This structure
reduces write amplification. In contrast, \sys\/ eliminates the concept of levels altogether,
and employs a flat storage layout instead. WiscKey~\cite{WiscKey} separates key and value storage
in SSTables, also in order to reduce amplification. This optimization is orthogonal to \sys's concepts,
and could benefit our work as well. Accordion~\cite{accordion} splits the LSM in-memory buffer into mutable
and immutable levels, which are periodically merged in the background similarly to traditional compaction,
in order to reduce disk flushes. This mechanism is similar to munk rebalance in \sys,
but the latter rebalances data at the chunk level and so benefits from spatial locality.
%maintains a small dynamic to absorb write operation. This active component is frequently merged with a larger immutable in-memory component by running at the background.
\sys's design resembles classic B-trees~\cite{Knuth:1998:ACP:280635} by supporting direct random updates to leaf blocks (chunks).
It resolves the B-tree write throughput issue through munks and in-memory compaction. {\inred{B-trees, which have been designed
prior to multi-versioning concurrency control (MVCC) methods, heavily rely on locks for consistency of operation; this method is
inferior to MVCC in terms of performance.}}
%heavily rely on exclusive locks for concurrency control~\cite{Lehman81}. \sys\ operations use locks in shared mode in normal flow; exclusive mode is only used by background rebalance operations. }
In recent years, $B^{\epsilon}$-trees~\cite{Brodal:2003:LBE:644108.644201} have emerged
as promising way to organize data. They are used in production KV-stores~\cite{TokuDB} and filesystems~\cite{BetrFS}.
$B^{\epsilon}$-tree is a B-tree variant that uses overflow buffers in internal nodes as well as leaves, trading buffer
compactions for random I/O. \sys\/ applies similar ideas to leaf-level storage. \inred{TokuDB's~\cite{TokuDB} performance
is on par with RocksDB in many use cases, but its disk image tends to be bigger~\cite{tokudb-vs-rocksdb}.}
%We are not aware of any $B^{\epsilon}$-tree implementation that provides atomic scans and zero-time consistent recovery as \sys\ does.
%Similarly to \sys, keys and values are appended to buffers, and pointers to keys are sorted in contiguous segments,
%thereby improving read performance.
Tucana~\cite{tucana} is a $B^{\epsilon}$-tree optimization that uses three techniques to reduce overheads: copy-on-write,
private allocation, and memory-mapped I/O. \sys\/ could incorporate some of these to further improve performance.
\inred{However, Tucana paper does not provide consistent semantics for scans that span multiple leaf segments.}
In-memory KV-stores~\cite{ignite, redis, memcached, Srinivasan:2016:AAR:3007263.3007276} have originally emerged as fast volatile
data storage, e.g., web and application caches. Over time, most of them evolved to support durability,
albeit as a second-class citizen in most cases.
%For example, Ignite~\cite{ignite} uses a B-tree index with in-place random updates.
These systems resemble \sys\/ in their memory-centric approach.
However, we are unaware of their consistency guarantees or performance optimizations with respect to disk-resident data in such systems.
%Their persistent storage support is based on either B-tree~\cite{ignite} or LSM-tree design~\cite{redis}.
%used for application data caching, but also as building blocks for distributed database with optional durability~\cite{ignite,redis}. These are not comparable with \sys\ as they are either not persistent~\cite{memcached}, do not support atomic scans~\cite{redis} or resemble relational DBMS with a centralized WAL, , and in-place updates~\cite{ignite} more than a kv lsm store.