-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathintro-cuts.tex
124 lines (96 loc) · 9.97 KB
/
intro-cuts.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
% Removed RT analytics - not sure this is our pitch
Real-time analytics applications are on the rise in the big data world.
Modern decision support and machine intelligence engines strive to continuously ingest large volumes of data, while providing up-to-date insights with minimum delay.
Examples of such systems include Google's F1~\cite{Shute2013}, which powers its AdWords, and
Oath's Flurry Analytics\footnote{\url{https://developer.yahoo.com/analytics/}}, which
provides mobile developers with analysis tools to explore user characteristics (age, gender, location, app context, etc.) and behavior, e.g., which code paths they follow and how they churn.
As of late 2017, the Flurry SDK is installed on 2.6B devices, and monitors 1M+ mobile apps\footnote{\url{http://flurrymobile.tumblr.com/post/169545749110/state-of-mobile-2017-mobile-stagnates}}.
Analytics workloads feature a combination of high ingestion rates and (concurrently executed) large range scans; their
data -- both ingested and queried -- is highly skewed towards popular applications, locations, users, etc.
Moreover, since applications like Flurry often aggregate statistics for multi-dimension composite keys (e.g., application-user-date)
and scan them in ranges, the skewness is reflected for key ranges rather than individual keys.
Underlying today's real-time analytics systems are large \emph{key-value (KV)-stores}, such as RocksDB~\cite{rocks},
to which data is ingested at a high rate, and which simultaneously serve
\emph{range scans} for analytics queries. Such data stores need to be \emph{persistent} following crashes.
% Motivation
The booming NoSQL market, which broadly aggregates the technologies for storage, serving and analysis of very large volumes of unstructured data, is expected to garner \$4.2B by 2020~\cite{alliedmarketresearch}. %\footnote{\url{https://www.alliedmarketresearch.com/press-release/NoSQL-market-is-expected-to-reach-4-2-billion-globally-by-2020-allied-market-research.html}}.
Key-value (KV)-stores~\cite{Aerospike, Bigtable, Cassandra, DynamoDB, HBase, Redis, RocksDB, Scylla} are the fastest-growing market segment among the NoSQL platforms. They offer a simple yet powerful data abstraction that is attractive for a variety of applications, e.g., e-commerce~\cite{DynamoDB}, search indexing~\cite{Percolator}, product recommendation~\cite{NetflixCassandraSpark}, user modeling~\cite{AirbnbHBase}, etc. Modern KV-storage technologies provide scalability and speed far beyond the reach of traditional RDBMS technologies, at a fraction of cost. The new generation of relational databases~\cite{MyRocks, Phoenix} uses KV-store infrastructure as building block, in order to apply its multiple benefits to structured data management.
% Analytics, DRAM, persistence
The drop in DRAM prices (more than $6.6$x since 2010~\cite{StatistaReport}) stimulates proliferation of
KV-stores towards better use of memory. One extreme approach has been taken by in-memory data stores
that store all their data in RAM. While appealing to many use cases (e.g., web caching~ֿ\cite{RedisWebCaching}, online fraud detection~\cite{Hazelcast}, and real-time bidding~\cite{AerospikeUseCRaases}), they fall short when it comes to scalability and reliability requirements. Recently, some of these products~\cite{Aerospike,Ignite,Redis,Tarantul} started supporting non-volatile
(primarily, solid state) disk drives for both cold data storage and crash recovery. They strive to provide the user experience
of in-memory databases (ultra low latency, flexibility for diverse workloads) while serving very large datasets with near-zero
time to recovery. Our paper steps up to the challenge of building such {\em memory-centric\/} KV-stores.
\remove{
One domain that is on the rise is real-time analytics, and it uses big data and requires long scans [support by examples/citations].
High performance is achieved by striving to keep almost all of the working data set in-memory, which is becoming increasingly feasible
with the drop in DRAM prices [citations].
Nevertheless, disks are still required for data persistence.
Note that even in cache-only KV-stores that rely on a separate cold storage tier for persistent archiving,
using a RAM-only solution for real-time query processing is undesirable, as the the penalty for crashes is extremely high; for example,
Facebook reported that it would take days to recover a crashed cluster from the backend persistent store, and this is expedited to take hours
by recovering from so-called warm clusters that hold the data in memory~\cite{Scaling-Memcache}.
%(try to find a citation - maybe something Facebook wrote about this).
This motivates optimizing key-value stores for workloads where all or most data fits in memory, while supporting persistence with fast recovery.
}
% KV stores today are LSM
KV-stores provide an {\em ordered map\/} API, with random read (get), random write (put), and range query
(scan) methods. By and large, their implementations are based on the \emph{Log-Structured Merge (LSM)}
tree paradigm~\cite{LSM}, which optimizes write performance by replacing random writes with sequential ones.
An LSM tree is organized as a collection of components, partitioned according to data update times.
The most recently written data resides in the in-memory map ({\em memtable\/}), and earlier-written data
resides in a collection of sorted immutable on-disk ({\em level}) files. Every write creates a data version,
which is inserted into the memtable, and in addition, for persistence, is logged to a \emph{write-ahead-log} ({\em WAL}). Periodically, the whole memtable is flushed to disk, creating a new level file. Reads search the memtable first,
and if the key is not found, also the level files; the latest version is retrieved. To expedite reads, popular data blocks
from level files are cached in memory, and Bloom filters reduce redundant accesses to disk. Scans traverse the
memtable and the level files in parallel, and combine the data versions in the ascending/descending order of keys.
In order to reduce disk reads, level files are periodically merged by a background process named \emph{compaction}.
% LSM drawbacks
Despite its popularity, the LSM paradigm suffers from a number of drawbacks.
First, it is subject to \emph{write amplification} -- namely, each key-value pair is written multiple times.
WAL logging implies that updates are written at least twice, and may be written many more times due
to compactions. Write amplification increases disk wear (especially for SSD storage), in addition to hurting
performance. Second, LSMs incur \emph{read amplification} -- namely, reads may need
to search in multiple level files. This is partially mitigated by compaction, caching, and Bloom filters.
All the above, however, are less effective for scans, which do not enjoy Bloom filters and suffer from
caches being destroyed by compactions. Though their impact can be reduced, both kinds of amplification
are inherent to the LSM design, and cannot be avoided altogether. Additionally, recovery from crashes
may be slow because the WAL needs to be replayed before the system proceeds to service new requests.
Moreover, recovery time is adversely affected by available memory: increasing the memtable size
reduces the flush rate, which slows down WAL recycling, and consequently prolongs recovery.
% PiWi
%In this work we specifically target analytics workloads, with high ingestion rates and range scans of popular key ranges.
We present \sys, an experimental {\em persistent key-value store}, which pursues the following goals:
\begin{enumerate}
\item{\bf Consistency} -- strict linearizability~\cite{Herlihy} of all operations, in particular range scans.
\item {\bf High performance}, in terms of throughput and latency, by exploiting the workload's
locality better than the LSM design. \sys's advantage is biggest when the working set fits in main memory.
\item {\bf Low write amplification}, by applying in-memory compaction whenever possible to avoid
on-disk compaction.
\item {\bf Near-instant recovery}, by avoiding the WAL replay process upon start-up.
\end{enumerate}
\sys\/ introduces the following principles in order to achieve these goals:
\begin{enumerate}
\item Instead of managing log-structured storage for the whole dataset,
\sys\/ organizes data as a collection of micro-partitions (chunks) holding contiguous key ranges,
both in memory and on disk. Whereas the LSM design keeps the write and read paths separate,
\sys\/ unifies them by maintaining a read-write cache of popular chunks in RAM.
\item In-memory chunks (named {\em munks}) accommodate both write and read operations.
They are maintained mostly ordered ordered through periodic background mink {\em rebalances},
in the spirit of KiWi~\cite{KiWi}. Therefore, the gets and the scans that access these munks are
served very efficiently. The second-tier shared {\emph{row cache}} supports the efficiency of
gets that cannot be served from munks.
\item Every chunk is backed by a two private files: an ordered data store (named {\em funk})
and a linear WAL. The latter logs the latest writes, to facilitate recovery. In contrast
with the LSM approach, {\em both\/} the funk and the WAL are searched by reads that cannot
be served from RAM. These reads access just these two locations -- in contrast with LSM reads
that has to search in multiple levels. \sys\/ strives to minimize the latency and volume of sequential
searches by keeping its WALs trimmed and maintaining Bloom filters for WAL data.
\item Since the WALs are inherently searchable, there is no need to replay them upon recovery.
Therefore, \sys\/ becomes instantly available to serve data.
\item There is no need to perform costly on-disk compaction for the funks that have in-memory
munks; they are just re-written when the munk snapshots are flushed back to disk. For unpopular chunks,
the background funk rebalance process infrequently merges the funk and the WAL in the background.
All in all, write amplification is reduced, especially for large datasets.
\end{enumerate}