The QML
package provides an implementation of manifold learning via quantum dynamics as introduced in [SLoD] and [QCC]. This is an implementation without any parallelization.
The QML
package only consists of one file qml_serial.py, which contains all functions required.
The QML code is executed on data stored in a CSV, h5, sql, xlsx, or json file. The data must just store a matrix as more complex formats are not processed. The code is called with a input file that specifies all parameters, e.g.,
python3 qml_serial.py QML_test_input.data
For an example input file and required parameters see QML_test_input.dat
. Some useful parameters are described below. For a full discribtion refer to psuedocode in Appendix VI of [SLoD] for interpretation of code and parameters.
-
logepsilon
: This sets the scale length (logepsilon
$= 2 \log \sqrt{\epsilon}$ ) at which the$\sqrt{\epsilon}$ -nearest-neighbour graph, and hence the graph Laplacian, is constructed from the dataset. -
alpha
: This scales withlogepsilon
to determine the error term$h$ . This term bounds the error on the geodesics found through propagation. -
dt
: Determines time step size for propagating geodesics. Paths are found at an unit speed, so time step is equivilant to distance. -
nProp
: Number of time steps taken. Taking more steps along geodesics increases the number of paths found between points as more distance is covered. -
nColl
: Number of initial momenta for geodesics. Each starting momentum is a direction towards its "nColl" nearest neighbors. Increasing this parameter will start the propagation in more directions. -
H_test
: Boolean to autmatically tune error terms. The optimal values may not be the minimum found from this test, but they are often close to the minimum.
The outputs of QML are:
- the geodesic distance matrix is saved to file
f.out
, wheref
is the input file name and - optionally, a visualization of the embedding of the graph is also shown as a plot, if
SHOW_EMBEDDING
= 2 or 3 (this number sets the embedding dimension) in the input file.
For development purposes, the package can be obtained by cloning the repository locally:
git clone https://github.com/sandialabs/QML.git
For the Swissroll dataset of 2000 points, QML is better able to unroll the dataset than other manifold learning techniques, when embedding into three dimensions. QML works by approximating wave dynamics on the dataset. Using this, a wavepacket localized at a sample point, with large oscillations in a prescribed direction, can be propagated to reveal the geodesic flow on the manifold. The hitting time between points gives a sparse graph of the data that can be embedded, giving a low-dimensional representation; here, we use the method of 1 for the embedding.
We compare this with other popular graph embedding techniques: UMAP, Laplacian Eigenmaps, Diffusion Maps and
- UMAP (3) aims to preserve the topological structure of the manifold and approximates it with a fuzzy topological structure, which it then embeds into a lower-dimensional Euclidean space. The assumptions are that the data is uniformly distribued on the manifold and that the manifold is locally connected and its Riemannian metric is a scalar matrix in ambient coordinates in some neighbourhood of each point.
-
t-SNE (4) aims to reduce dimensionality by placing Gaussian affinities between points in the high-dimensional ambient space and finding lower-dimensional coordinates that minimize the Kullback-Liebler divergence between the high-dimensional Gaussian distributions and lower-dimensional Student
$t$ -distributions. - Laplacian Eigenmaps (5) produces lower dimensional coordinates by using the values of some of the eigenfunctions of the graph Laplacian of a graph on the dataset constructed by connecting points within a certain scale-length based on the density of point samples.
- Diffusion Maps (6) works in the same way as Laplacian Eigenmaps, but scales each eigenfunction that provides coordinates, by a power of the corresponding eigenvalue, corresponding to time of diffusion of the diffusive process underlying the method.
On the Swiss roll with 2000 uniformly sampled points, QML, UMAP, Laplacian Eigenmaps, and Diffusion Maps all embed the dataset into a roughly flat two-dimensional submanifold of three-dimensional coordinate space. The embeddings produced by Laplacian Eigenmaps and Diffusion Maps lose an additional dimension of the ground truth, due to the initial two chosen eigenfunctions having dependence only on one of the directions of the manifold. This is a noted drawback of these methods (see e.g., 7). The embedding by
Quantum Manifold Learning (QML) computes geodesics using quantum dynamics derived from a diffusion process. The accuracy of the geodesics scales with the amount of data in the input. QML scales to higher-dimensional data more efficiently than common geodesic methods. We compare the resolution of geodesics by QML to the Heat Method (our terminology) of 2 and shortest paths on local neighbourhood graphs. The Heat Method attempts to recover geodesic distances to a given point by approximately solving the heat equation on the dataset with a singular point source as the initial condition, computing its normalized gradient vector field and then approximately solving the Poisson equation with respect to this gradient field. This method needs structured approximations to the manifold, or proceeds by an initial Voronoi cell decomposition from the dataset. Dijkstra's algorithm can also be used on a nearest neighbours graph of Euclidean distances to approximate geodesic distance.
We apply these methods to two sets of discrete point coordinates on the unit sphere in
If you use or refer to this project in any publication, please cite the corresponding papers:
[SLoD] Akshat Kumar, Mohan Sarovar. Shining light on data: Geometric data analysis through quantum dynamics arXiv:2212.00682 (2022).
[QCC] Akshat Kumar. On a quantum-classical correspondence: from graphs to manifolds arXiv:2112.10748 (2022).
- Peter Oostema, Franz Franchetti. Leveraging High Dimensional Spatial Graph Embedding as a Heuristic for Graph Algorithms, IEEE IPDPSW (2021).
- Keenan Crane, Clarisse Weischedel, Max Wardetzky. The Heat Method for Distance Computation, CACM (2017).
- Leland McInnes, John Healy, James Melville. Umap: Uniform manifold approximation and projection for dimension reduction, arXiv:1802.03426 (2018).
- L.J.P. van der Maaten, G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE, JMLR (2008).