Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor C++ code into modular files and switch to setuptools build system #9

Merged
merged 4 commits into from
Mar 16, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ jobs:
- name: Install Python dependencies
run: |
python -m pip install --upgrade pip
pip install setuptools build meson meson-python pybind11
pip install setuptools build wheel pybind11

- name: Build sdist
run: python -m build --sdist --outdir wheelhouse/
Expand Down
12 changes: 8 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,19 @@
# `pyensmallen`: python bindings for the [`ensmallen`](https://ensmallen.org/) library for numerical optimization

Very minimal python bindings for `ensmallen` library. Currently supports
Minimal python bindings for `ensmallen` library. Currently supports
+ L-BFGS, with intended use for optimisation of smooth objectives for m-estimation
+ ADAM (and variants with different step-size routines) - makes use of ensmallen's templatization.
+ Frank-Wolfe, with intended use for constrained optimization of smooth losses
- constraints are either lp-ball (lasso, ridge, elastic-net) or simplex

See [ensmallen docs](https://ensmallen.org/docs.html) for details.
See [ensmallen docs](https://ensmallen.org/docs.html) for details. The `notebooks/` directory walks through several statistical examples.

Installation:
## speed
`pyensmallen` is very fast. A comprehensive set of benchmarks is available in the `benchmarks` directory. The benchmarks are run on an intel 12th gen framework laptop. Benchmarks vary data size (sample size and number of covariates) and parametric family (linear, logistic, poisson) and compare `pyensmallen` with `scipy` and `statsmodels` (I initially also tried to keep `cvxpy` in the comparison set but it was far too slow to be in the running). At large data sizes, pyensmallen is roughly an order of magnitude faster than scipy, which in turn is an order of magnitude faster than statsmodels. So, a single statsmodels run takes around as long as a pyensmallen run that naively uses the nonparametric bootstrap for inference. This makes the bootstrap a viable option for inference in large data settings.

![](benchmarks/library_performance_comparison.png)

## Installation:

__from pypi__

Expand All @@ -25,4 +30,3 @@ __from source__
__from wheel__
- download the appropriate `.whl` for your system from the more recent release listed in `Releases` and run `pip install ./pyensmallen...` OR
- copy the download url and run `pip install https://github.com/apoorvalal/pyensmallen/releases/download/<version>/pyensmallen-<version>-<pyversion>-linux_x86_64.whl`

121 changes: 121 additions & 0 deletions benchmarks/BENCHMARK_RESULTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
---
title: pyensmallen is very fast
author: Apoorva Lal
documentclass: amsart
amsart: true
numbersections: true
geometry: "margin=0.5in"
fontsize: 12pt
---

This document summarizes comprehensive performance benchmarking of
pyensmallen against other popular optimization libraries across
various regression models and dataset sizes.

## Performance Analysis

### Overall Performance

- pyensmallen consistently outperforms both SciPy and statsmodels
across all regression models and data sizes
- The performance advantage becomes more dramatic as dataset size
increases
- pyensmallen shows particularly strong performance with
high-dimensional data (k=20)

### Model-Specific Performance

#### Linear Regression
- pyensmallen is 5-11× faster than SciPy for large datasets (n=10M)
- pyensmallen is 3-4× faster than statsmodels for large datasets
- The speed advantage increases with both dataset size and
dimensionality

#### Logistic Regression
- pyensmallen achieves 11-15× speedup over SciPy for large datasets
- pyensmallen is 2-4.5× faster than statsmodels
- Extremely efficient with high-dimensional data

#### Poisson Regression
- pyensmallen is 13× faster than SciPy for high-dimensional large
datasets
- pyensmallen is 30× faster than statsmodels in these cases
- statsmodels sometimes fails to converge on large Poisson problems

### Scaling Properties
- All libraries show roughly linear scaling with data size (on the
log-log plots)
- pyensmallen maintains its performance advantage across the entire
range of data sizes
- The gap between pyensmallen and other libraries widens as data size
increases

### Accuracy
- All libraries achieve essentially identical parameter accuracy (MSE
values match closely)
- This confirms that pyensmallen's speed advantage doesn't come at the
cost of accuracy
- For linear regression, the MSE values decrease predictably as sample
size increases

### Dimensionality Effects
- Higher dimensionality (k=20 vs k=5) impacts all libraries, but
pyensmallen handles it much better
- SciPy shows the steepest performance degradation with increased
dimensions
- pyensmallen's relative advantage is greater for high-dimensional
problems

## Key Takeaways

1. **Best-in-Class Performance**: pyensmallen consistently delivers
the fastest optimization across all regression types, especially at
scale.

2. **Excellent Scaling**: pyensmallen scales much better than
competitors with both dataset size and dimensionality.

3. **Perfect for Large Datasets**: The performance advantage is most
pronounced for datasets with millions of observations, making
pyensmallen particularly valuable for big data applications.

4. **No Accuracy Tradeoff**: The speed gains don't compromise solution
quality - all libraries converge to essentially the same
parameters.

5. **Reliability**: Unlike statsmodels that occasionally fails to
converge on Poisson problems, pyensmallen shows robust convergence
across all test cases.

6. **High-Dimensional Strength**: pyensmallen's superior handling of
high-dimensional data makes it especially suitable for complex
modeling tasks.

## Visualization

![Library Performance Comparison](library_performance_comparison.png)

### Detailed Time Comparisons

![Linear Regression Performance](linear_time_comparison.png)

![Logistic Regression Performance](logistic_time_comparison.png)

![Poisson Regression Performance](poisson_time_comparison.png)

## Methodology

Benchmarks were conducted using the `benchmark_performance.py` script,
which tests each library on synthetic datasets of varying sizes
(n=1,000 to n=10,000,000) and dimensions (k=5 and k=20). Each
optimization algorithm was run on identical data and parameter
initialization to ensure a fair comparison.

For each model type (linear, logistic, and Poisson regression), we
measured:
1. Execution time
2. Parameter accuracy (MSE compared to true parameters)
3. Convergence reliability

The benchmark script is designed to be reproducible, with all random
seeds fixed for consistent data generation across test runs.
Binary file added benchmarks/BENCHMARK_RESULTS.pdf
Binary file not shown.
Loading