Skip to content

Commit e1c68cc

Browse files
authored
Merge pull request #58 from apache/tdigest_docs
t-digest changes
2 parents d3e7bb6 + e2a48d2 commit e1c68cc

File tree

8 files changed

+114
-16
lines changed

8 files changed

+114
-16
lines changed

CMakeLists.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -117,7 +117,7 @@ cmake_policy(SET CMP0097 NEW)
117117
include(ExternalProject)
118118
ExternalProject_Add(datasketches
119119
GIT_REPOSITORY https://github.com/apache/datasketches-cpp.git
120-
GIT_TAG 5.1.0
120+
GIT_TAG 5.2.0
121121
GIT_SHALLOW true
122122
GIT_SUBMODULES ""
123123
INSTALL_DIR /tmp/datasketches

NOTICE

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
Apache DataSketches Python
2-
Copyright 2024 The Apache Software Foundation
2+
Copyright 2025 The Apache Software Foundation
33

44
Copyright 2015-2018 Yahoo Inc.
55
Copyright 2019-2020 Verizon Media
6-
Copyright 2021 Yahoo Inc.
6+
Copyright 2021- Yahoo Inc.
77

88
This product includes software developed at
99
The Apache Software Foundation (http://www.apache.org/).

README.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,10 +75,13 @@ The unit tests are mostly structured in a tutorial style and can be used as a re
7575
- `vector_of_kll_floats_sketches`
7676
- Kolmogorov-Smirnov Test
7777
- `ks_test` applied to a pair of matched-type Absolute Error quantiles sketches
78-
- Density
78+
- Kernel Density
7979
- `density_sketch`
8080
- Count-min sketch
8181
- `count_min_sketch`
82+
- t-digest
83+
- tdigest_float
84+
- tdigest_double
8285

8386
## Known Differences from C++
8487

docs/source/quantiles/index.rst

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,17 +10,21 @@ in the stream.
1010
These sketches may be used to compute approximate histograms, Probability Mass Functions (PMFs), or
1111
Cumulative Distribution Functions (CDFs).
1212

13-
The library provides three types of quantiles sketches, each of which has generic items as well as versions
14-
specific to a given numeric type (e.g. integer or floating point values). All three types provide error
15-
bounds on rank estimation with proven probabilistic error distributions.
13+
The library provides four types of quantiles sketches, three of which have generic items as well as versions
14+
specific to a given numeric type (e.g. integer or floating point values). Those three types provide error
15+
bounds on rank estimation with proven probabilistic error distributions. t-digest is a heuristic-based sketch
16+
that works only on numeric data, and while the error properties are not guaranteed, the sketch typically
17+
does a good job with small storage.
1618

17-
* KLL: Provides uniform rank estimation error over the entire range
19+
* KLL: Provides uniform rank estimation error over the entire range.
1820
* REQ: Provides relative rank error estimates, which decreases approaching either the high or low end values.
21+
* t-digest: Relative rank error estimates, heuristic-based without guarantees but quite compact with generally very good error properties.
1922
* Classic quantiles: Largely deprecated in favor of KLL, also provides uniform rank estimation error. Included largely for backwards compatibility with historic data.
2023

2124
.. toctree::
2225
:maxdepth: 1
23-
26+
2427
kll
2528
req
29+
tdigest
2630
quantiles_depr

docs/source/quantiles/kll.rst

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,6 @@ The analysis is obtained using `get_quantile()` function or the
1414
inverse functions `get_rank()`, `get_pmf()` (Probability Mass Function), and `get_cdf()`
1515
(Cumulative Distribution Function).
1616

17-
As of May 2020, this implementation produces serialized sketches which are binary-compatible
18-
with the equivalent Java implementation only when template parameter `T = float`
19-
(32-bit single precision values).
20-
2117
Given an input stream of `N` items, the `natural rank` of any specific
2218
item is defined as its index `(1 to N)` in inclusive mode
2319
or `(0 to N-1)` in exclusive mode
@@ -168,4 +164,3 @@ Additionally, the interval may be quite large for certain distributions.
168164
.. rubric:: Non-static Methods:
169165

170166
.. automethod:: __init__
171-

docs/source/quantiles/tdigest.rst

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
t-digest
2+
--------
3+
4+
.. currentmodule:: datasketches
5+
6+
The implementation in this library is based on the MergingDigest described in
7+
`Computing Extremely Accurate Quantiles Using t-Digests <https://arxiv.org/abs/1902.04023>`_ by Ted Dunning and Otmar Ertl.
8+
9+
The implementation in this library has a few differences from the reference implementation associated with that paper:
10+
11+
* Merge does not modify the input
12+
* Derialization similar to other sketches in this library, although reading the reference implementation format is supported
13+
14+
Unlike all other algorithms in the library, t-digest is empirical and has no mathematical basis for estimating its error
15+
and its results are dependent on the input data. However, for many common data distributions, it can produce excellent results.
16+
t-digest also operates only on numeric data and, unlike the quantiles family algorithms in the library which return quantile
17+
approximations from the input domain, t-digest interpolates values and will hold and return data points not seen in the input.
18+
19+
The closest alternative to t-digest in this library is REQ sketch. It prioritizes one chosen side of the rank domain:
20+
either low rank accuracy or high rank accuracy. t-digest (in this implementation) prioritizes both ends of the rank domain
21+
and has lower accuracy towards the middle of the rank domain (median).
22+
23+
Measurements show that t-digest is slightly biased (tends to underestimate low ranks and overestimate high ranks), while still
24+
doing very well close to the extremes. The effect seems to be more pronounced with more input values.
25+
26+
For more information on the performance characteristics, see `the Datasketches page on t-digest <https://datasketches.apache.org/docs/tdigest/tdigest.html>`_.
27+
28+
.. autoclass:: tdigest_float
29+
:members:
30+
:undoc-members:
31+
:exclude-members: deserialize
32+
33+
.. rubric:: Static Methods:
34+
35+
.. automethod:: deserialize
36+
37+
.. rubric:: Non-static Methods:
38+
39+
.. automethod:: __init__
40+
41+
.. autoclass:: tdigest_double
42+
:members:
43+
:undoc-members:
44+
:exclude-members: deserialize
45+
46+
.. rubric:: Static Methods:
47+
48+
.. automethod:: deserialize
49+
50+
.. rubric:: Non-static Methods:
51+
52+
.. automethod:: __init__

src/tdigest_wrapper.cpp

Lines changed: 28 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@
2424
#include <nanobind/nanobind.h>
2525
#include <nanobind/make_iterator.h>
2626
#include <nanobind/stl/string.h>
27+
#include <nanobind/stl/vector.h>
2728
#include <nanobind/ndarray.h>
2829

2930
#include "tdigest.hpp"
@@ -44,7 +45,7 @@ void bind_tdigest(nb::module_ &m, const char* name) {
4445
.def("__copy__", [](const tdigest<T>& sk) { return tdigest<T>(sk); })
4546
.def("update", (void(tdigest<T>::*)(T)) &tdigest<T>::update, nb::arg("item"),
4647
"Updates the sketch with the given value")
47-
.def("merge", (void(tdigest<T>::*)(tdigest<T>&)) &tdigest<T>::merge, nb::arg("sketch"),
48+
.def("merge", (void(tdigest<T>::*)(const tdigest<T>&)) &tdigest<T>::merge, nb::arg("sketch"),
4849
"Merges the provided sketch into this one")
4950
.def("__str__", [](const tdigest<T>& sk) { return sk.to_string(); },
5051
"Produces a string summary of the sketch")
@@ -71,6 +72,32 @@ void bind_tdigest(nb::module_ &m, const char* name) {
7172
.def("get_serialized_size_bytes", &tdigest<T>::get_serialized_size_bytes,
7273
nb::arg("with_buffer")=false,
7374
"Returns the size of the serialized sketch, in bytes")
75+
.def(
76+
"get_pmf",
77+
[](const tdigest<T>& sk, const std::vector<T>& split_points) {
78+
return sk.get_PMF(split_points.data(), split_points.size());
79+
},
80+
nb::arg("split_points"),
81+
"Returns an approximation to the Probability Mass Function (PMF) of the input stream "
82+
"given a set of split points (values).\n"
83+
"If the sketch is empty this returns an empty vector.\n"
84+
"split_points is an array of m unique, monotonically increasing float values "
85+
"that divide the real number line into m+1 consecutive disjoint intervals.\n"
86+
"It is not necessary to include either the min or max values in these split points."
87+
)
88+
.def(
89+
"get_cdf",
90+
[](const tdigest<T>& sk, const std::vector<T>& split_points) {
91+
return sk.get_CDF(split_points.data(), split_points.size());
92+
},
93+
nb::arg("split_points"),
94+
"Returns an approximation to the Cumulative Distribution Function (CDF), which is the "
95+
"cumulative analog of the PMF, of the input stream given a set of split points (values).\n"
96+
"If the sketch is empty this returns an empty vector.\n"
97+
"split_points is an array of m unique, monotonically increasing float values "
98+
"that divide the real number line into m+1 consecutive disjoint intervals.\n"
99+
"It is not necessary to include either the min or max values in these split points."
100+
)
74101
;
75102

76103
add_serialization<T>(tdigest_class);

tests/tdigest_test.py

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,16 @@ def test_tdigest_double_example(self):
5050
self.assertFalse(td.is_empty())
5151
self.assertEqual(td.get_total_weight(), n)
5252

53-
# we can define a new tdiget with a different distribution, then merge them
53+
# we can get the PMF and CDF
54+
pmf = td.get_pmf([-0.5, 0.0, 0.5])
55+
self.assertEqual(len(pmf), 4)
56+
self.assertAlmostEqual(sum(pmf), 1.0)
57+
58+
cdf = td.get_cdf([0.0])
59+
self.assertEqual(len(cdf), 2)
60+
self.assertAlmostEqual(cdf[0], 0.5, delta = 0.05)
61+
62+
# we can define a new tdigest with a different distribution, then merge them
5463
td2 = tdigest_double()
5564
td2.update(np.random.normal(loc=2.0, size=n))
5665
td.merge(td2)
@@ -89,6 +98,14 @@ def test_tdigest_float_example(self):
8998
self.assertFalse(td.is_empty())
9099
self.assertEqual(td.get_total_weight(), n)
91100

101+
pmf = td.get_pmf([-0.5, 0.0, 0.5])
102+
self.assertEqual(len(pmf), 4)
103+
self.assertAlmostEqual(sum(pmf), 1.0)
104+
105+
cdf = td.get_cdf([0.0])
106+
self.assertEqual(len(cdf), 2)
107+
self.assertAlmostEqual(cdf[0], 0.5, delta = 0.05)
108+
92109
td2 = tdigest_float()
93110
td2.update(np.random.normal(loc=2.0, size=n))
94111
td.merge(td2)

0 commit comments

Comments
 (0)