-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathprinciples.tex
76 lines (63 loc) · 4.34 KB
/
principles.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
\sys\ is a persistent key-value store that provides, similarly to other KV-stores~\cite{hbase,leveldb,RocksDB},
strong consistency guarantees:
%\begin{enumerate}\itemsep0pt
%\item
\emph{Atomic API} -- \emph{put, get}, and \emph{range scan} (or scan) operations.
Scans are atomic in the sense that all key-value pairs returned by a single scan belong to a consistent
snapshot reflecting the state of the data store at a unique point in time.
%\item
\emph{Consistent recovery.} Following a crash, the system recovers to a well defined execution
point some time before the crash. The exact recovery point depends on the put persistency model.
\emph{Asynchronous} persistence, where puts are buffered and persisted to disk in the background, allows
applications to trade durability for speed. Data consistency is preserved following recovery,
in the sense that if some put is lost, then all ensuing (and thus possibly dependent) puts are lost as well.
%\end{enumerate}
Our key design goals are the following:
\begin{enumerate}\itemsep0pt
\item \emph{Focus on spatial locality and range scans.}
Multiple NoSQL applications embed multi-dimensional data in a single-dimension composite key.
This design provides high spatial locality on the primary dimension (key prefix). We strive
to express this locality in physical data organization in order to exploit it efficiently for scans
by the primary dimension.
%In addition, a query that retrieves data pertaining to a particular dimension thus needs to atomically
%retrieve the values pertaining to a range of keys.
%To favor analytics queries, it is important to optimize such scans.
\item \emph{High performance with memory-resident working sets.}
To sustain high speed, key-value stores nowadays leverage increasing DRAM sizes
where they can hold most of the active data set. We strive for maximum performance
in this ``hyper-local'' case.
\item \emph{Low write amplification.} We seek to minimize disk writes in order to boost performance
and reduce disk wear, especially for SSD devices.
\item \emph{Fast recovery.} Because crashes are inevitable,
the mean-time-to-recover should be kept very short.
\end{enumerate}
%\subsection{Design choices}
Given the aforementioned requirements, we make the following design choices in \sys:
\begin{enumerate}\itemsep0pt
\item \emph{Chunk-based organization.}
We organize data, both on-disk and in-memory, in large chunks pertaining to key ranges.
%This enables efficient support of range scans, with high cache locality and minimal loading of new memory pages.
%Both the read and the write path go through chunks.
Each chunk has a file representation called \emph{funk} (file chunk), and may be cached in a memory data structure called \emph{munk} (memory chunk).
This organization exploits spatial locality and is friendly to range scans.
We use a number of techniques to optimize in-memory access, including partially sorting keys in each chunk and
indexing munks.
To expedite access to keys whose chunks are only on-disk (i.e., have no munks),
individual popular keys are cached in a \emph{row cache},
and \emph{Bloom filters} are used to limit excessive access to disk.
\item \emph{Infrequent disk compactions.}
As long as a chunk is cached (has a munk), its funk's organization does not have to be optimized since
queries do not access it. Therefore, \sys\ infrequently performs reorganization (compaction) on such funks.
Conversely, when a funk holds cold data, its organization hardly deteriorates, and therefore compaction is not necessary.
Note that this is unlike LSM-trees, where all disk components are compacted, regardless of which keys reside in memory and whether
keys are hot or cold.
\item \emph{Multi-versioning for atomic scans.}
\sys\ employs multi-versioning along with
copy-on-write to keep data versions required by atomic scans.
In other words, if a put attempts to overwrite a key required by an active scan, then a new version is created alongside the
existing one, whereas versions that are not needed by any scan are not retained.
Thus, version management incurs a low overhead (as it occurs only on scans).
%It also defines a simple rule for garbage collecting old versions.
\item \emph{In-funk WALs.}
\sys\ logs writes within funks and avoids duplicating the updates in a separate WAL. This reduces write amplification and expedites recovery.
\end{enumerate}