BlueTuna ><>

A multi-stage hyperparameter optimization engine for binary classifiers, built from scratch. Bluetuna combines gradient-based searches, Latin Hypercube Sampling, and employs a specialized single-layer perceptron to guide hyperparameter values towards performance optima.

Overview

Bluetuna is a hyperparameter tuning system that breaks ML optimization into 3 coordinated stages, each informing the next. More on these stages can be read in the "Technical Details" section.

Stage 1 - Gradient-based search region identification

Individual parameter zones are scored and filtered by their likelihood of high-performance
Each parameter has its future weight pre-initialized according to its performance ceiling and sensitivity

Stage 2 - Weight learning

A specialized single-layer perceptron is trained on LHS-sampled configurations and data from the previous stage
Pairwise interactions are analyzed, providing the SL-perceptron with a more holistic 'view' of the param. performance landscape

Stage 3 - Fixed-weight gradient descent

Learned weights become fixed, gradient descent is applied to the hyperparam. values which are tuned proportionally to their learned influence
Gradients from interaction terms are decomposed back to the base param. via chain rule, so every update is informed by the full interaction structure

Installation

As a personal project and more of an experimental framework, BlueTuna is not available as a pip installable package. You can clone the repository and install dependencies using this.

git clone https://github.com/Elliot-Chan-120/BlueTuna.git
cd BlueTuna

pip install -r requirements.txt

Usage

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

from BlueTuna.Tuna import Tuna

X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=10,
    n_redundant=5,
    flip_y=0.05,
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# parameter grid zone
bluetuna_space = {
    "learning_rate": (0.001, 0.3, 3),   # (min, max, n_splits)
    "max_depth": (2, 12, 3),
    "subsample": (0.5, 1.0, 3),
    "colsample_bytree": (0.5, 1.0, 3),
    "min_child_weight": (1, 20, 3),
    "gamma": (0.0, 5.0, 3)
}

tuner = Tuna(X_train=X_train,
             X_test=X_test,
             y_train=y_train,
             y_test=y_test,
             search_space=bluetuna_space,
             base_model=XGBClassifier,
             test_metric='F1',  # options: F1 and Accuracy
             verbose=True, 
             record=True)   # record for benchmarking functions

optima_params, optima_res, optimized_search_space = tuner.bluetuna_run(grad_nudge=0.5,
                                                                       training_reps=100,
                                                                       n_epochs=1000,
                                                                       tune_lr=0.05,
                                                                       tune_reps=100,
                                                                       f_activation="sigmoid")

                                                                    

print(optima_params)
print(optima_res)
print(optimized_search_space)
print(tuner.complete_history)

BlueTuna search space format:

{
    'param_name': (min_value, max_value, n_splits)
}

BlueTuna parameter notes:

n_splits determines how many interior points are sampled during preprocessing (stage 1)
- more splits -> finer resolution and more model evals
grad_nudge determines the % distance between each split we sample
tune_lr determines how aggressive each hyperparameter is tuned during stage 3
- its essentially a learning rate applied to the hyperparameter value instead of its weight

Benchmark Results

Benchmarked against Optuna (TPE sampler) across 20 randomized seeds on a binary classification task with synthetic data using XGBoost as the base model. Evaluated using F1 (macro). Runner configurations are in the "benchmarking" folder, 6 hyperparameters were tuned in combination, with the splits being 5 for each.

Seed	BlueTuna	Optuna	Delta (BT-Opt)
1	0.9049	0.9175	-0.0126
2	0.9242	0.9378	-0.0136
3	0.9372	0.9395	-0.0023
4	0.9590	0.9710	-0.012
5	0.9340	0.9519	-0.0179
6	0.8759	0.8845	-0.0086
7	0.9100	0.9195	-0.0095
8	0.9091	0.9065	0.0026
9	0.9302	0.9340	-0.0038
10	0.9293	0.9341	-0.0048
11	0.9173	0.9155	0.0018
12	0.8947	0.9021	-0.0074
13	0.8579	0.8782	-0.0203
14	0.9268	0.9246	0.0022
15	0.9282	0.9262	0.002
16	0.8990	0.9173	-0.0183
17	0.9395	0.9466	-0.0071
18	0.8800	0.8750	0.005
19	0.9266	0.9394	-0.0128
20	0.9540	0.9685	-0.0145

Table 1: Raw Max F1 Scores (4 sf) obtained from BlueTuna and Optuna across 20 random-seeded runs

BlueTuna beat Optuna on 5/20 seeds (8, 11, 14, 15, 18) and lost within 0.005 of Optuna on 3 additional runs, proving competitive on 8/20, or 40% of benchmark trials (Table 1). The average delta was -0.007595, reflecting an asymmetric variance profile (Table 1). BlueTuna's upper performance is competitive with Optuna, however its lower end diverges greatly due to sensitivity to preprocessing quality.

Fig 1: Banded Convergence Curve plot showing the max F1 value obtained per evaluation between BlueTuna and Optuna's full runs.

Fig 2: Banded Convergence Curve plot showing the max F1 value obtained per evaluation between BlueTuna after preprocessing (Stage 1) and Optuna's full run.

Fig 3: Boxplot representing max F1 scores obtained between BlueTuna and Optuna across 20 random-seeded runs

Fig 4: Boxplot representing walltime between BlueTuna and Optuna across 20 random-seeded runs

Stages 1 and 2 (preprocessing + exploration) in BlueTuna comprised of 318 trials alone (Fig 1), with 168 going to stage 1 and 150 going to stage 2. As expected, this exposed BlueTuna's main limiting attribute of walltime explosion as preprocessing becomes more thorough (Fig 1 and 4).

However, when preprocessing was completed, run convergence showed BlueTuna catching up as shown by the greater degree of banded region overlap, demonstrating a competitive performance ceiling on favourable seeds (Fig 2).

The notable acceleration in optima-breaking rate (faster climbing) starting at evaluation 150 in BlueTuna's curve of Figure 2 (and eval 318 in Fig 1,) corresponds to the transition from stage 2 (LHS-sampled dataset training) to stage 3 (fixed-weight gradient descent tuning), where the learned weights are guiding hyperparameter value updates. The improved performance step rate at this transition suggests the perceptron's learned interaction weights are informative as the exploitation phase is making consistent structurally guided moves that produce measurable gains.

In terms of raw results, BlueTuna was outperformed by Optuna with an average delta of -0.007595 despite being competitive on 8/20 seeds and beating it on 5/20. What was surprising is that the boxplot of results showed BlueTuna's performance median is slightly greater than that of Optuna's, and simultaneously Optuna's box is tighter and higher (Fig 3). Its also noteworthy that on bad seeds like 5, 13, and 16, BlueTuna's lower performance on those tanked the average significantly (Table 1 and Fig 4). Together, this implies that although BlueTuna is capable of finding competitive / slightly better configurations than Optuna (shown by median), it contains much more severe negative variance as demonstrated by the aforementioned bad seeds.

Put more concisely, BlueTuna's main performance limitation is consistency, not performance ceiling (walltime aside).

Overall, this was expected as Optuna's TPE is exploitation-heavy and reliable, consistently exploiting good regions once found. It appears that BlueTuna behaves more as a diverse explorer, occasionally locating optima otherwise overlooked by traditional sampling methods. Future scope would involve employing more exploitative methods to increase performance and decrease walltime (as much as I can).

The results suggest that the hyperparameter landscape filtering and characterization that BlueTuna employs, allows it to identify high-performing parameter value regions, potentially setting the exploitation phase up for competitive performance against current top-performing TPE samplers.

Technical Details

Search Region Scoring (Stage 1)

For each hyperparameter to be tuned, its search space is essentially "split" evenly into regions by the number of n_splits chosen (Fig 5).

e.g. param1: (2, 10, 3) -> 3 "splits" / lines placed along the range of 2-10 such that each area is equal
the lines would end up falling on 4, 6, 8 -> therefore the regions created by these splits are: (2, 4), (4, 6), (6, 8), (8, 10)

At each region boundary, the base model is evaluated at that point and at a small forward step of grad_nudge% of the region width, producing a local pseudo-gradient. We end up with a list of points (region starting points) and their pseudo-gradients.

Each region is then scored by two components:

Raw output: the performance area the region already encompasses, computed geometrically as a trapezoid:

square_area   = length × min(Y_left, Y_right)
triangle_area = length × |Y_right - Y_left| × 0.5
raw_output    = square_area + triangle_area

Projection score: what the gradients we obtained imply optimal performance is within this zone:

projection_score = grad_left + (-grad_right)

The intuition behind this:

+grad → -grad: +ve grad implies greater values ahead of it, -ve grad implies greater values behind it. Both suggest an optima lies between them.
-grad → -grad or +grad → +grad: one boundary implies interior optimum → score reflects that gradient's magnitude
-grad → +grad: neither implies interior optimum → score is negative, region will most definitely be filtered out

The "projection_output" used for region filtering is raw_output × projection_score. Regions scoring below average are filtered out, concentrating subsequent sampling in zones that are both high-performing and likely to contain a high-performing optimum.

Visualization for this stage below, it's a concept visualization not a direct one of the example above:

Fig 5: Stage 1 Preprocessing Concept Visualization showing an individual hyperparameter's search space being split by n_splits=2 as shown by the dotted lines. The dashed lines indicate the gradients of all region points which are used to create performance projections

Weight Initialization (Stage 1)

Traditional neural structures (and ML models in general) will typically have thousands to millions of training data entries to learn from and comfortably set their randomly initialized weights. BlueTuna's learning stages generates hundreds at most, presenting the issue being a lack of training samples for meaningful weight convergence. BlueTuna works around this by pre-initializing hyperparameter weights based on data gained from previous search region filtering.

During stage 1, sensitivities per hyperparameter are determined by its average gradient, and its performance ceiling (max_Y) is also obtained. A composite score sensitivity * max_Y is added to a list of scores, and each hyperparameter's weight is initialized as its relative % of the total of scores.

total = sum(scores)

weight = score / total -> for each hyperparam's score

The intuition behind the scoring was that combining both the sensitivity and maximal performance ceiling into one term "fills in the gaps" behind what each individual term lacks.

sensitivity alone cannot tell you if the parameter actually improves the model significantly
max_Y alone cannot tell you if altering it changes performance by a significant degree

The weights then become the starting point for the perceptron rather than having them be randomly initialized as observed in traditional neural structures. This provides a stable, meaningful and informative starting point for the perceptron to learn from as compared to random initialization, also is handy for reproducibility.

Potential limitation: I should note that feature interaction term weights are intialized randomly during stage 1 since the only data at this stage revolves around parameters tuned in isolation, not in combination. Future improvements would involve finding ways to pre-initialize feature interaction weights.

Autoconvergence (Stage 2)

Because the perceptron is generally trained on a small dataset with hand-crafted initialization, learning rate has a large effect on the weights that are to be fixed for stage 3. Choosing a learning rate that is too high would cause weight overshoot whereas one that's too low would result in the wrong weights being finalized.

To address this, _autoconverge runs 5 independent training runs across a wide range of learning rates, picking the one resulting in minimal variance in learned weights. If a learning rate produces similar relative weights and rankings across independent runs, then that is more likely to reflect genuine signal from training data rather than noise from the optimization path. The lower the variance, the more signal is actually being captured by the model and thus the more likely it is to be used in the online tuning stage.

Interaction Term Gradient Decomposition (Stage 3)

During online tuning (Stage 3) weights become fixed and the gradient updates are applied to the hyperparameter values. However, pairwise interaction terms prevent updates from encapsulating the hyperparameter landscape's influence. These interaction terms need to be decomposed back into their constituents via chain rule in order for updates to base parameters to accomplish this task.

Feature interaction term calculations: a*b and a/b

a*b: gradient with respect to 'a' is 'b', and vice versa
a/b -> log(a) - log(b): gradient with respect to 'a' is 1/a, with respect to 'b' is -1/b (avoid numerical underflow)

Each base parameter accumulates gradients from its own weight and every partial derivative of every feature interaction term it's involved in.

This means that a parameter appearing in multiple high-weighted interaction terms can receive a much larger accumulated gradient (and thus a substantial update) even if its own weight is lower, which is intended since interaction importance should influence how aggressively a base param gets tuned.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
BlueTuna		BlueTuna
benchmarking		benchmarking
concept_vis		concept_vis
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
use_like_this.py		use_like_this.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BlueTuna ><>

Overview

Installation

Usage

Benchmark Results

Technical Details

Search Region Scoring (Stage 1)

Weight Initialization (Stage 1)

Autoconvergence (Stage 2)

Interaction Term Gradient Decomposition (Stage 3)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BlueTuna ><>

Overview

Installation

Usage

Benchmark Results

Technical Details

Search Region Scoring (Stage 1)

Weight Initialization (Stage 1)

Autoconvergence (Stage 2)

Interaction Term Gradient Decomposition (Stage 3)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages