Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions baselines/fedht/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
__pycache__/
*.py[cod]
*.egg-info/
dist/
build/
.eggs/
results/
.venv
*.log
.DS_Store
.mypy_cache/
.ruff_cache/
142 changes: 142 additions & 0 deletions baselines/fedht/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
---
title: Federated Nonconvex Sparse Learning
url: https://arxiv.org/abs/2101.00052
labels: [sparse learning, hard thresholding, non-IID, linear regression, logistic regression]
dataset: [Simulation I, Simulation II, MNIST, E2006-tfidf, RCV1]
---

# FedHT: Federated Nonconvex Sparse Learning

> Note: If you use this baseline in your work, please remember to cite the original authors of the paper as well as the Flower paper.

**Paper:** [arxiv.org/abs/2101.00052](https://arxiv.org/abs/2101.00052)

**Authors:** Qianqian Tong, Guannan Liang, Tan Zhu, Jinbo Bi (University of Connecticut)

**Abstract:** Nonconvex sparse learning plays an essential role in many areas, such as signal processing and deep network compression. Iterative hard thresholding (IHT) methods are the state-of-the-art for nonconvex sparse learning due to their capability of recovering true support and scalability with large datasets. Theoretical analysis of IHT is currently based on centralized IID data. In realistic large-scale situations, however, data are distributed, hardly IID, and private to local edge computing devices. In this paper, we propose two IHT methods: Federated Hard Thresholding (Fed-HT) and Federated Iterative Hard Thresholding (FedIter-HT). We prove that both algorithms enjoy a linear convergence rate and have strong guarantees to recover the optimal sparse estimator, similar to traditional IHT methods, but now with decentralized non-IID data. Empirical results demonstrate that the Fed-HT and FedIter-HT outperform their competitor, a distributed IHT, in terms of decreasing the objective values with lower requirements on communication rounds and bandwidth.

## About this baseline

**What is implemented:** The code in this directory replicates the simulation experiments in *Federated Nonconvex Sparse Learning* (Tong et al., 2021), which proposed the Fed-HT and FedIter-HT algorithms. Both algorithms extend iterative hard thresholding to federated settings with non-IID data. The baseline replicates the two key synthetic experiments from the paper (Simulation I: sparse linear regression, Simulation II: sparse logistic regression) and supports MNIST for the softmax regression experiment. The Distributed-IHT baseline (K=1) is included for comparison.

**Datasets:** Simulation I (synthetic), Simulation II (synthetic), MNIST

**Hardware Setup:** These experiments were run on a MacBook with Apple Silicon (ARM). Any machine with 4 CPU cores should reproduce results in reasonable time. The simulation experiments use 100 clients with 100 samples each. MNIST uses 100 clients with 600 samples each.

**Contributors:** Harshal Manerikar

## Experimental Setup

**Task:** Sparse parameter estimation under a cardinality constraint

**Model:** Linear models with no hidden layers. The cardinality constraint is enforced externally by the strategy via hard thresholding rather than inside the model.

| Experiment | Model | Loss |
| :--- | :--- | :--- |
| Simulation I | Sparse linear regression | Mean squared error |
| Simulation II | Sparse logistic regression | Binary cross-entropy |
| MNIST | Sparse softmax regression | Cross-entropy |

**Algorithms:**

Both algorithms follow the same outer loop structure (T communication rounds, N clients, K local SGD steps each). The difference is where the hard thresholding operator H_tau is applied:

| Algorithm | Local update | Server aggregation |
| :--- | :--- | :--- |
| Fed-HT | Plain SGD, no thresholding | H_tau applied after weighted average |
| FedIter-HT | H_tau applied after each SGD step | H_tau applied after weighted average |
| Distributed-IHT (baseline) | K=1, communicates every step | H_tau applied after weighted average |

**Dataset partitioning:**

| Experiment | Clients | Samples per client | Partition method |
| :--- | :---: | :---: | :--- |
| Simulation I | 100 | 100 | Synthetic generation with alpha=0.1, beta=0.1 |
| Simulation II | 100 | 1000 (binary thresholded) | Synthetic generation with alpha=1.0, beta=1.0 |
| MNIST | 100 | 600 | Pathological (each client holds 2 of 10 digit classes) |

**Training hyperparameters (defaults):**

| Description | Default value |
| :--- | :--- |
| Total clients | 100 |
| Fraction sampled per round | 1.0 (all clients) |
| Number of rounds | 100 |
| Local steps K | 5 |
| Learning rate | 0.001 |
| Sparsity tau | 200 (simulations), 500 (MNIST) |
| Initialization | Zero (x_0 = 0 as in the paper) |
| Client resources | 2 CPUs, 0 GPUs |

**Hyperparameter search ranges (from the paper):**

The paper uses grid search to select the best K and learning rate per experiment. The search ranges are:

| Parameter | Search range |
| :--- | :--- |
| K (local steps) | {3, 5, 8, 10} |
| Learning rate | {10, 1, 0.6, 0.3, 0.1, 0.06, 0.03, 0.01, 0.001} |

## Environment Setup

The Flower venv must be placed **outside** the project directory. PyTorch contains files nested more than 10 directories deep, and `flwr run .` will reject the project if it finds such paths inside the project tree.

```bash
# Create a Python 3.12 environment outside the project
# On macOS with Homebrew Python you may need the DYLD fix below
DYLD_LIBRARY_PATH=$(brew --prefix expat)/lib python3.12 -m venv ~/fedht-venv

# Activate
source ~/fedht-venv/bin/activate

# Install dependencies
pip install -e /path/to/baselines/fedht
```

**macOS note:** Homebrew Python 3.12 may fail with a `pyexpat` symbol error on older macOS versions. Running with `DYLD_LIBRARY_PATH=$(brew --prefix expat)/lib` before any Python or `flwr` command resolves this. Install Homebrew expat first with `brew install expat`.

**Federation setup:** These experiments simulate 100 clients. The `local-simulation` federation must be configured with `num-supernodes = 100`. Add the following to `~/.flwr/config.toml` (create the file if it does not exist):

```toml
[superlink.local-simulation]
address = ":local:"
options.num-supernodes = 100
options.backend.client-resources.num-cpus = 2
options.backend.client-resources.num-gpus = 0.0
```

## Running the Experiments

Make sure the venv is active, then from inside the `fedht` directory:

```bash
# Simulation I: sparse linear regression with Fed-HT (default)
flwr run . local-simulation

# Switch to FedIter-HT
flwr run . local-simulation --run-config 'algorithm.name="FedIterHT"'

# Simulation II: sparse logistic regression
flwr run . local-simulation --run-config 'dataset.name="simulation_II" model.input-dim=1000 dataset.batch-size=20'

# MNIST: sparse softmax regression
flwr run . local-simulation --run-config 'dataset.name="mnist" dataset.batch-size=64 algorithm.tau=500 model.input-dim=784 model.num-classes=10'

# Override K and learning rate for grid search
flwr run . local-simulation --run-config "algorithm.local-steps=10 algorithm.learning-rate=0.01"

# Reduce rounds for a quick smoke test (client count is set by the federation num-supernodes)
flwr run . local-simulation --run-config "algorithm.num-server-rounds=10"
```

## Expected Results

The key result from the paper is that Fed-HT and FedIter-HT reach the same objective value as Distributed-IHT using significantly fewer communication rounds:

| Experiment | Algorithm | Rounds to match Distributed-IHT |
| :--- | :--- | :--- |
| Simulation I (linear) | Fed-HT (K=3, lr=0.003) | ~28 rounds vs 100 for Distributed-IHT (3.5x fewer) |
| Simulation I (linear) | FedIter-HT (K=3) | TBD |
| Simulation II (logistic) | All algorithms | TBD |

Plots of objective value vs communication rounds for each experiment will be added to `_static/` after grid search is complete. See `docs/EXTENDED_README.md` for detailed per-experiment results and plots.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
121 changes: 121 additions & 0 deletions baselines/fedht/docs/EXTENDED_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# FedHT Extended Results

This document contains detailed per-experiment results for the Fed-HT and FedIter-HT baselines.
Results are structured to match Figures 3 and 5 from the paper.

---

## Simulation I: Sparse Linear Regression

**Setup:** 100 clients, 100 samples each, feature dimension d=1000, tau=200, alpha=0.1, beta=0.1.
The first 100 elements of each local coefficient vector are drawn from N(u_i, 1), the rest are zero.

**Objective function:** Mean squared error.

**Command:**
```bash
flwr run . local-simulation --run-config "algorithm.num-server-rounds=100 algorithm.tau=200"
```

**Results (objective value vs communication rounds):**

Best configurations: Distributed-IHT lr=0.01 K=1, Fed-HT lr=0.003 K=3, FedIter-HT lr=0.03 K=1.

| Rounds | Distributed-IHT | Fed-HT (K=3) | FedIter-HT (K=1) |
| :---: | :---: | :---: | :---: |
| 1 | 55.09 | 31.65 | 46.80 |
| 20 | 14.28 | 5.90 | 23.48 |
| 40 | 8.44 | 5.60 | 17.43 |
| 60 | 6.73 | 5.61 | 15.48 |
| 80 | 5.98 | 5.60 | 13.18 |
| 100 | 5.71 | 5.62 | 10.09 |

Fed-HT (K=3) reaches the same objective as Distributed-IHT in approximately 28 communication rounds (vs 100 for Distributed-IHT), a 3.5x improvement in communication efficiency.

FedIter-HT achieves its best result with K=1 (lr=0.03), reaching 10.09 at round 100. With K=3, the local hard thresholding after each step causes clients to develop misaligned sparse supports; averaging 100 such sparse vectors injects noise, producing oscillation around 14–17. With K=1 there is no drift accumulation across local steps, allowing a higher learning rate and steady (if slow) convergence.

Best hyperparameters found by grid search:

| Algorithm | K | Learning rate |
| :--- | :---: | :---: |
| Fed-HT | 3 | 0.003 |
| FedIter-HT | 1 | 0.03 |
| Distributed-IHT | 1 | 0.01 |

Plot: `../_static/simulation_I_comparison.png`

---

## Simulation II: Sparse Logistic Regression

**Setup:** 100 clients, 1000 samples each, feature dimension d=1000, tau=200, alpha=1.0, beta=1.0.
Binary labels: top-100 samples per client by sigmoid score are assigned label 1.

**Objective function:** Binary cross-entropy.

**Command:**
```bash
flwr run . local-simulation --run-config 'dataset.name="simulation_II" model.input-dim=1000 algorithm.tau=200'
```

**Results (objective value vs communication rounds):**

| Rounds | Distributed-IHT | Fed-HT (K=5) | FedIter-HT (K=5) |
| :---: | :---: | :---: | :---: |
| 0 | TBD | TBD | TBD |
| 50 | TBD | TBD | TBD |
| 100 | TBD | TBD | TBD |
| 150 | TBD | TBD | TBD |
| 200 | TBD | TBD | TBD |

Best hyperparameters found by grid search:

| Algorithm | K | Learning rate |
| :--- | :---: | :---: |
| Fed-HT | TBD | TBD |
| FedIter-HT | TBD | TBD |
| Distributed-IHT | 1 | TBD |

Plot: `../_static/simulation_II_comparison.png`

---

## MNIST: Sparse Softmax Regression

**Setup:** 100 clients, 600 samples each, feature dimension d=784 (flattened images), tau=500, 10 classes.
Each client holds data from 2 digit classes (non-IID).

**Objective function:** Cross-entropy.

**Command:**
```bash
flwr run . local-simulation --run-config 'dataset.name="mnist" dataset.batch-size=64 algorithm.tau=500 model.input-dim=784 model.num-classes=10'
```

**Results (objective value vs communication rounds):**

| Rounds | Distributed-IHT | Fed-HT (K=5) | FedIter-HT (K=5) |
| :---: | :---: | :---: | :---: |
| 0 | TBD | TBD | TBD |
| 50 | TBD | TBD | TBD |
| 100 | TBD | TBD | TBD |

Plot: `../_static/mnist_comparison.png`

---

## Notes on Reproducing the Paper

The paper does not report exact final hyperparameter values for the grid search, only the search ranges.
The best values for each experiment were obtained by the baseline authors via grid search:

- K searched over {3, 5, 8, 10}
- Learning rate searched over {10, 1, 0.6, 0.3, 0.1, 0.06, 0.03, 0.01, 0.001}
- All algorithms initialized with x_0 = 0

For Simulations I and II, the paper reports results on the objective value vs both communication rounds (Figure 3)
and total internal iterations (Figure 3 right panels). The communication round comparison is the primary metric
because it reflects the practical advantage of federated methods.

The paper also reports results on real datasets (E2006-tfidf, RCV1, MNIST) in Figure 5.
Implementation of E2006-tfidf and RCV1 loaders is in progress and requires downloading from the LibSVM website.
1 change: 1 addition & 0 deletions baselines/fedht/fedht/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""FedHT baseline for federated nonconvex sparse learning."""
Loading
Loading