Skip to content

Commit 2682bed

Browse files
Initial commit
0 parents  commit 2682bed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

48 files changed

+8571
-0
lines changed

.gitignore

+70
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# These are some examples of commonly ignored file patterns.
2+
# You should customize this list as applicable to your project.
3+
# Learn more about .gitignore:
4+
# https://www.atlassian.com/git/tutorials/saving-changes/gitignore
5+
6+
# Node artifact files
7+
node_modules/
8+
dist/
9+
10+
# Compiled Java class files
11+
*.class
12+
13+
# Compiled Python bytecode
14+
*.py[cod]
15+
16+
# Log files
17+
*.log
18+
19+
# Package files
20+
*.jar
21+
22+
# Maven
23+
target/
24+
dist/
25+
26+
# JetBrains IDE
27+
.idea/
28+
29+
# Unit test reports
30+
TEST*.xml
31+
32+
# Generated by MacOS
33+
.DS_Store
34+
35+
# Generated by Windows
36+
Thumbs.db
37+
38+
# Applications
39+
*.app
40+
*.exe
41+
*.war
42+
43+
# Large media files
44+
*.mp4
45+
*.tiff
46+
*.avi
47+
*.flv
48+
*.mov
49+
*.wmv
50+
*.jpeg
51+
*.jpg
52+
*.png
53+
*.bmp
54+
55+
#data
56+
*.pickle
57+
*.ubyte
58+
*-ubyte
59+
*-ubyte.gz
60+
*.meta
61+
*.html
62+
*.csv
63+
.ruff_cache/
64+
65+
#test notebooks
66+
activelearning/notebooks/
67+
68+
#exceptions: don't ignore what is here
69+
!docs/iris_random.png
70+
!docs/iris_comparison.png

.pre-commit-config.yaml

+42
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
.pre-commit-config.yamlfail_fast: true
2+
3+
repos:
4+
- repo: https://github.com/pre-commit/mirrors-prettier
5+
rev: v2.7.1
6+
hooks:
7+
- id: prettier
8+
name: (prettier) Reformat YAML files with prettier
9+
types: [yaml]
10+
# Utilities to avoid common mistakes
11+
- repo: https://github.com/pre-commit/pre-commit-hooks
12+
rev: v4.6.0
13+
hooks:
14+
- id: check-added-large-files
15+
name: Check for large files added to git
16+
args: ["--maxkb=500"]
17+
- id: check-merge-conflict
18+
name: Check for files that contain merge conflict strings
19+
- id: mixed-line-ending
20+
name: Check for mixed line endings
21+
- id: no-commit-to-branch
22+
name: Prevent commits to protected branches
23+
args: ["--branch", "main", "--branch", "master"]
24+
- repo: https://github.com/srstevenson/nb-clean
25+
rev: 3.2.0
26+
hooks:
27+
- id: nb-clean
28+
args:
29+
- --remove-empty-cells
30+
- --preserve-cell-metadata
31+
- tags
32+
- --
33+
- repo: https://github.com/astral-sh/ruff-pre-commit
34+
rev: v0.4.1
35+
hooks:
36+
- id: ruff
37+
types_or: [python, pyi, jupyter]
38+
args: ["--config", "pyproject.toml", "--fix"]
39+
exclude: "activelearning/notebooks/"
40+
- id: ruff-format
41+
types_or: [python, pyi, jupyter]
42+
args: ["--config", "pyproject.toml"]

CHANGELOG.md

+6
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
# Changelog
2+
3+
All notable changes to this project will be documented in this file.
4+
5+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
6+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

README.md

+112
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# Active learning
2+
3+
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
4+
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
5+
6+
Active Learning is a subfield of machine learning where the model iteratively queries the most relevant unlabeled data points, to optimize performance with minimal labeled data.
7+
This project provides the implementation of various Active Learning query strategies, for an easy application and comparison of different types of acquisition functions. The Active Learning framework is based on [modAL](https://github.com/modAL-python), a popular package for Python.
8+
9+
## Structure
10+
The repository includes the following features:
11+
12+
- The main scripts `AL_cycle.py` and `AL_selection.py`, which contain functions that can be used to execute an active learning cycle with the specified parameters and to compare the performances of different query strategies.
13+
- The `activelearning/queries` folder contains the different query strategies that are implemented, both for the pool based and stream based scenario. More detail on this below.
14+
- The `activelearning/utils` folder contains helper functions for the main scripts and examples.
15+
- The `examples` folder contains some demostrative notebooks to show how the main features work.
16+
- The `docs` folder contains additional documentation.
17+
18+
19+
## Installation
20+
The repository is setup us as a poetry project and by default requires Python 3.10 or later.
21+
To install the repository you can follow these steps:
22+
23+
First, install `poetry` if you haven't already, as indicated by the instructions on the [Poetry installation page](https://python-poetry.org/docs/).
24+
Then, clone the repository to your local machine using the following command:
25+
```
26+
git clone https://github.com/orobix/active-learning
27+
cd active-learning
28+
```
29+
Use Poetry to install the project dependencies:
30+
```
31+
poetry install
32+
```
33+
Finally, activate the virtual environment created by Poetry:
34+
```
35+
poetry shell
36+
```
37+
38+
39+
40+
41+
42+
## Introduction on Active Learning
43+
44+
Active Learning aims to save time and labeling costs by reducing the amount of labeled data required to train models, as annotation is often an expensive and laborious task. The solution is iteratively selecting a small set of the most relevant samples from unlabeled data, and querying an oracle for their label. This can allow to train a model with high accuracy while spending less resources on the construction of the dataset.
45+
46+
For example, when using a random forest classifier on the _Iris_ dataset, and randomly choosing one instance to be labeled at every iteration, it's possible to reach the same accuracy that the model would have when using the whole training set (96%) with only 12 data points.
47+
48+
![AL accuracy on Iris](docs/iris_random.png)
49+
50+
## Query strategies
51+
52+
Query strategies, also called acquisition functions, are the criteria with which data points are selected to be labeled. **Representation based** query strategies try to explore the whole feature space to find samples that are representative of the whole data. They are agnostic methods, as they don't require the training of a model.
53+
Implemented resentation based query strategies are:
54+
* Information density query
55+
* K-Means cluster-based query
56+
* Diversity query
57+
* Coreset query
58+
* ProbCover query
59+
60+
**Information based** query strategies rely on a model trained on a small labeled set of data, and search on the most informative unlabeled sampled according to the model predictions, measured for example with uncertainty criterias. In this category are also **Query by committee** methods, which measure informativeness with the prediction of a committee of models.
61+
Implemented information based query strategies are:
62+
* Least Confindent uncertainty sampling
63+
* Margin uncertainty sampling
64+
* Entropy uncertainty sampling
65+
* QBC vote entropy sampling
66+
* QBC consensus entropy sampling
67+
* QBC max disagreement sampling
68+
69+
**Bayesian Optimization** based strategies rely on stochastic forward passes in a neural net classifier, referred to as Monte Carlo Dropout, to approximate Bayesian posterior probabilities and measure uncertainty.
70+
Implemented Bayesian query strategies are:
71+
* MC max entropy
72+
* BALD (Bayesian Active Learning by Disagreements)
73+
* Max variation ratios
74+
* Max mean std
75+
76+
## Stream based scenario
77+
78+
When data points arrive one at a time from a stream, instead of having a pool of unlabeled data to select from, there are two options: in **Batch setting** samples are saved until a batch is complete, and then the classical query strategies can be performed on the batch; in the pure **Stream** setting, a criteria is used to decide whether to query the new point or discard it.
79+
Implemented stream based query stratgies are:
80+
* Stream diversity query
81+
* Stream Coreset query
82+
* Stream ProbCover query
83+
* Stream LC uncertainty sampling
84+
* Stream Margin uncertainty sampling
85+
* Stream Entropy uncertainty sampling
86+
87+
## Example
88+
89+
The functions in this repository can be used to effectively compare the effectiveness of different query strategies on a labeled dataset, in order to be able to choose the appropriate one in real applications with unlabeled data. With the following script we can compare some representation-based strategies on the _Iris_ dataset:
90+
91+
```
92+
scores = strategy_comparison(
93+
X_train=None, y_train=None,
94+
X_pool=X_pool, y_pool=y_pool,
95+
X_test=X_test, y_test=y_test,
96+
classifier="randomforest",
97+
query_strategies=[query_kmeans_foreach, query_density, query_coreset, query_random],
98+
n_instances=n_instances,
99+
K=3, # number of clusters for k-means query
100+
metric="euclidean", # metric for density query
101+
goal_acc=0.96,
102+
)
103+
```
104+
![AL comparison on Iris](docs/iris_comparison.png)
105+
106+
Detail of this implentation can be found in `examples/1_iris.ipynb`
107+
108+
## Resources
109+
110+
- [modAL documentation](https://modal-python.readthedocs.io/en/latest/)
111+
- [A Survey on Active Learning: State-of-the-Art, Practical Challenges and Research Directions (Tharwat & Schenck, 2023)](https://www.mdpi.com/2227-7390/11/4/820)
112+
- [A Survey on Deep Active Learning: Recent Advances and New Frontiers (Li et al., 2024)](https://ieeexplore.ieee.org/abstract/document/10537213)

0 commit comments

Comments
 (0)