dadac

Description

Code repository for the thesis project Density-based clustering application to substructures identification in cosmological simulations, Francesco Tomba, 2023 @ University of Trieste.

dadac is the porting and optimization of the implementation of ADP (Laio et al. 2021) which is present in the python package dadaPy. In particular dadac implements at the moment:

TWO-NN Intrinsic Dimension estimator
k-NN search using a kd-tree (or vp-tree with custom callable metric)
k*-NN density estimator
ADP Heuristics

On the same input dadac achieves a one to one match on the results, obtaining up to a factor 40 speedup on the whole procedure w.r.t. Python/Cython implementation.

Benchmarks

Validation of the procedure is done against the implementation of the original package with and without the computation of the halo. Hereafter some of the results, complete ones in the directory benchmarks.

On AMD Ryzen 7 7735HS @ 4.80GHz (my laptop) (8 cores - 16 Threads, 16GB RAM)

from dadapy examples

Fig1.dat: N = 20.0k, D = 2

Method	part	time
py	ngbh and density	1.53s
py	ADP	0.13s
C	ngbh and density	0.26s
C	ADP	0.04s

from dadapy examples

Fig2.dat: N = 38.4k, D = 2

Method	part	time
py	ngbh and density	2.76s
py	ADP	0.94s
C	ngbh and density	0.22s
C	ADP	0.03s

CosmoSim (sub)Set1: N = 100.0k, D = 5 (~500 Clusters)

Method	part	time
py	ngbh and density	21.93s
py	ADP	4.08s
C	ngbh and density	1.37s
C	ADP	0.07s

On Intel Xeon Gold 5118 CPU @ 2.30GHz (4 sockets x 12 cores - 48 Threads, 512GB RAM)

CosmoSim Set1: N = 1.8M, D = 5 (~2000 Clusters)

Method	part	time
py	ngbh and density	414.83s
py	ADP	3282.80s
C	ngbh and density	47.52s
C	ADP	7.46s

MNIST N = 70.0k, D = 784

Method	part	time
py	ngbh and density	11.62s
py	ADP	2.21s
C	ngbh and density	12.07s
C	ADP	0.18s

Usage

dadac comes with an example driver file driver.c. Data is expected to be a matrix of floats of type FLOAT_TYPE (defined at compile time) of dimensions N x d.

It returns a text file where for each point are reported:

k*: number of neighbors used for computing the density value
cluster_idx: cluster assignement of the point
rho: value of the density
is_center: flag for identifying cluster centers

Once parameters are set dadac can be launched using: ./driver i=[INPUT_FILE] o=[OUTPUT_FILE] d=[d] t=[t] z=[Z] h=[HALO] k=[k] s=[s] t=[t]

INPUT_FILE : input file, file path
OUTPUT_FILE: output file, file path
d : Lenght of the data vectors (number of columns of the data matrix)
Z : Z value, float
HALO : Assign halo, y/n [yes/no]
k : Number of neighbors to use, int (>0)
s : Use sparse borders implementation, y/n [sparse/dense]
t : Input binary is in Float32, y/n [float/double]

Relies on kd-Tree or vantage point tree alogirthms in order to perform neighborhood search. V2 versions are optimized ones, use them. KNN search is validated against scipy.spatial.KDtree query time is comparable with state of the art libraries except for bruteforce search used when the dimensionality D > 15.

Python interface

dadac comes also with a python interface build with ctypes which leverages the capabilities of the C-compiled library. In order to use it, build the package and then import dadac module from python

REQUIRES make build system and a C compiler to properly install (on linux system this works fine)

dadac can be installed with pip on linux systems using:

pip install git+https://github.com/lykos98/dadaC

On MacOS this cannot compile the required shared library if gcc and make are not installed I am still working on that.

If you want to use it on windows use it under WLS

Compiling

dadac comes with a make file which compiles the executable driver and the shared library bin/libdadac.so from which ADP methods can be linked to. dadac supports compilation to use float or double to store data and kNN distances, and uint32 or uint64 to store indexes. Add -DUSE_FLOAT32 or -DUSE_INT32 to compile with support to 32bit types. By default 64bit ones are used. This feature is important for the application on big datasets, allowing of course some sort of rounding error to happen when processing data. Implementation with 64bit types results are binary equal w.r.t. dadapy.

TODO

MUCH MORE

Complete porting of all density estimation methods in dadaPy (i.e. pAK)
Adaptive strategies on k-NN search and density estimation

Name	Name	Last commit message	Last commit date
Latest commit lykos98 Added ml alg for id estimation, corrected verbose output Jun 4, 2024 3ec0a9c · Jun 4, 2024 History 193 Commits
.github/workflows	.github/workflows	raising an error to usupported platforms to integrate to dadapy	Feb 23, 2024
benchmarks	benchmarks	linted code	Jan 25, 2024
dadac	dadac	Added ml alg for id estimation, corrected verbose output	Jun 4, 2024
tests	tests	linted code	Jan 25, 2024
.gitignore	.gitignore	removed dist	Jan 23, 2024
README.md	README.md	updated readme	Feb 23, 2024
demo.ipynb	demo.ipynb	fixed fields	Jan 13, 2024
requirements.txt	requirements.txt	updating CI/CD	Jan 19, 2024
setup.py	setup.py	linted code	Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dadac

Description

Benchmarks

Usage

Python interface

Compiling

TODO

About

Releases 2

Packages

Languages

lykos98/dadaC

Folders and files

Latest commit

History

Repository files navigation

dadac

Description

Benchmarks

Usage

Python interface

Compiling

TODO

About

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages