Dataset Anonymization and Utility Preservation

Overview

This repository contains implementations of methods for dataset anonymization and utility preservation. The primary focus is on anonymizing network data features while maintaining their utility for machine learning tasks. Two approaches have been implemented:

Feature Anonymization with Differential Privacy and Clustering
Utility-Preserving Dataset Condensation with Gradient Matching

The repository is structured to include separate files for each approach, with detailed implementation and usage guidelines.

Features

Anonymization of timestamps, duration, and packet size using clustering and differential privacy.
Anonymization of IP addresses with prefix-preserving hashing and clustering.
Implementation of Dataset Condensation with Gradient Matching to synthesize utility-preserving datasets.
Testing and evaluation of anonymized datasets using KNN classifiers and other models.

Feature Anonymization with Differential Privacy and Clustering

Methodology

Timestamps, Duration, and Packet Size Anonymization:
- Clustering algorithms were employed to anonymize numerical features.
- Differential Privacy (DP) was implemented by adding Laplace noise to the clusters.
IP Address Anonymization:
- The network part of the IP address was anonymized using the SHA algorithm (prefix-preserving).
- The host part was anonymized by clustering host addresses and replacing each with the mean value of its cluster.

Results

Accuracy of the dataset before anonymization: 99.52% (KNN classifier).
Accuracy of the dataset after anonymization: 99.51% (KNN classifier).
Correlation matrices for anonymized features showed minimal deviation from original values.

Utility-Preserving Dataset Condensation with Gradient Matching

Methodology

Initialization:
- Synthetic data points (e.g., 1000) were initialized randomly.
- A set of simple neural networks, including convolutional models, was defined.
Outer and Inner Steps:
- At each "outer step," a random model and real data points (256 per class) were selected.
- The synthetic data points were trained for 10 "inner steps," updating them based on gradient loss calculated with cosine similarity between gradients of real and synthetic data points.
Generalization:
- Models were reset to zero gradient at each step to prevent overfitting.
- The synthesized dataset was designed for general use, not limited to any specific use case (e.g., IDS).

Results

Synthesized datasets maintained utility comparable to the original dataset.
Testing on separate models confirmed similar accuracy and performance metrics.

Results

Accuracy before anonymization: 99.55%.
Accuracy after anonymization with DP and clustering: 97.57%.
Accuracy after dataset condensation: Comparable results with generalized utility.

File Structure

|-- DP_implementation.ipynb   # Implementation of Differential Privacy and Clustering
|-- Dataset_condensation_gradient_matching.ipynb      # Implementation of Gradient Matching Condensation
|-- README.md

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
DP_implementation.ipynb		DP_implementation.ipynb
Dataset_condensation_gradient_matching.ipynb		Dataset_condensation_gradient_matching.ipynb
Gradient_mathcing_with_DP.ipynb		Gradient_mathcing_with_DP.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dataset Anonymization and Utility Preservation

Overview

Table of Contents

Features

Feature Anonymization with Differential Privacy and Clustering

Methodology

Results

Utility-Preserving Dataset Condensation with Gradient Matching

Methodology

Results

Results

File Structure

About

Uh oh!

Releases

Packages

Languages

Khushmagrawal/data-anonymization

Folders and files

Latest commit

History

Repository files navigation

Dataset Anonymization and Utility Preservation

Overview

Table of Contents

Features

Feature Anonymization with Differential Privacy and Clustering

Methodology

Results

Utility-Preserving Dataset Condensation with Gradient Matching

Methodology

Results

Results

File Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages