A multi-stage hyperparameter optimization engine for binary classifiers, built from scratch. Bluetuna combines gradient-based searches, Latin Hypercube Sampling, and employs a specialized single-layer perceptron to guide hyperparameter values towards performance optima.
Bluetuna is a hyperparameter tuning system that breaks ML optimization into 3 coordinated stages, each informing the next. More on these stages can be read in the "Technical Details" section.
Stage 1 - Gradient-based search region identification
- Individual parameter zones are scored and filtered by their likelihood of high-performance
- Each parameter has its future weight pre-initialized according to its performance ceiling and sensitivity
Stage 2 - Weight learning
- A specialized single-layer perceptron is trained on LHS-sampled configurations and data from the previous stage
- Pairwise interactions are analyzed, providing the SL-perceptron with a more holistic 'view' of the param. performance landscape
Stage 3 - Fixed-weight gradient descent
- Learned weights become fixed, gradient descent is applied to the hyperparam. values which are tuned proportionally to their learned influence
- Gradients from interaction terms are decomposed back to the base param. via chain rule, so every update is informed by the full interaction structure
As a personal project and more of an experimental framework, BlueTuna is not available as a pip installable package. You can clone the repository and install dependencies using this.
git clone https://github.com/Elliot-Chan-120/BlueTuna.git
cd BlueTuna
pip install -r requirements.txt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from BlueTuna.Tuna import Tuna
X, y = make_classification(
n_samples=1000,
n_features=20,
n_informative=10,
n_redundant=5,
flip_y=0.05,
random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# parameter grid zone
bluetuna_space = {
"learning_rate": (0.001, 0.3, 3), # (min, max, n_splits)
"max_depth": (2, 12, 3),
"subsample": (0.5, 1.0, 3),
"colsample_bytree": (0.5, 1.0, 3),
"min_child_weight": (1, 20, 3),
"gamma": (0.0, 5.0, 3)
}
tuner = Tuna(X_train=X_train,
X_test=X_test,
y_train=y_train,
y_test=y_test,
search_space=bluetuna_space,
base_model=XGBClassifier,
test_metric='F1', # options: F1 and Accuracy
verbose=True,
record=True) # record for benchmarking functions
optima_params, optima_res, optimized_search_space = tuner.bluetuna_run(grad_nudge=0.5,
training_reps=100,
n_epochs=1000,
tune_lr=0.05,
tune_reps=100,
f_activation="sigmoid")
print(optima_params)
print(optima_res)
print(optimized_search_space)
print(tuner.complete_history)
BlueTuna search space format:
{
'param_name': (min_value, max_value, n_splits)
}BlueTuna parameter notes:
- n_splits determines how many interior points are sampled during preprocessing (stage 1)
- more splits -> finer resolution and more model evals
- grad_nudge determines the % distance between each split we sample
- tune_lr determines how aggressive each hyperparameter is tuned during stage 3
- its essentially a learning rate applied to the hyperparameter value instead of its weight
Benchmarked against Optuna (TPE sampler) across 20 randomized seeds on a binary classification task with synthetic data using XGBoost as the base model. Evaluated using F1 (macro). Runner configurations are in the "benchmarking" folder, 6 hyperparameters were tuned in combination, with the splits being 5 for each.
| Seed | BlueTuna | Optuna | Delta (BT-Opt) |
|---|---|---|---|
| 1 | 0.9049 | 0.9175 | -0.0126 |
| 2 | 0.9242 | 0.9378 | -0.0136 |
| 3 | 0.9372 | 0.9395 | -0.0023 |
| 4 | 0.9590 | 0.9710 | -0.012 |
| 5 | 0.9340 | 0.9519 | -0.0179 |
| 6 | 0.8759 | 0.8845 | -0.0086 |
| 7 | 0.9100 | 0.9195 | -0.0095 |
| 8 | 0.9091 | 0.9065 | 0.0026 |
| 9 | 0.9302 | 0.9340 | -0.0038 |
| 10 | 0.9293 | 0.9341 | -0.0048 |
| 11 | 0.9173 | 0.9155 | 0.0018 |
| 12 | 0.8947 | 0.9021 | -0.0074 |
| 13 | 0.8579 | 0.8782 | -0.0203 |
| 14 | 0.9268 | 0.9246 | 0.0022 |
| 15 | 0.9282 | 0.9262 | 0.002 |
| 16 | 0.8990 | 0.9173 | -0.0183 |
| 17 | 0.9395 | 0.9466 | -0.0071 |
| 18 | 0.8800 | 0.8750 | 0.005 |
| 19 | 0.9266 | 0.9394 | -0.0128 |
| 20 | 0.9540 | 0.9685 | -0.0145 |
Table 1: Raw Max F1 Scores (4 sf) obtained from BlueTuna and Optuna across 20 random-seeded runs
BlueTuna beat Optuna on 5/20 seeds (8, 11, 14, 15, 18) and lost within 0.005 of Optuna on 3 additional runs, proving competitive on 8/20, or 40% of benchmark trials (Table 1). The average delta was -0.007595, reflecting an asymmetric variance profile (Table 1). BlueTuna's upper performance is competitive with Optuna, however its lower end diverges greatly due to sensitivity to preprocessing quality.
Fig 1: Banded Convergence Curve plot showing the max F1 value obtained per evaluation between BlueTuna and Optuna's full runs.
Fig 2: Banded Convergence Curve plot showing the max F1 value obtained per evaluation between BlueTuna after preprocessing (Stage 1) and Optuna's full run.
Fig 3: Boxplot representing max F1 scores obtained between BlueTuna and Optuna across 20 random-seeded runs
Fig 4: Boxplot representing walltime between BlueTuna and Optuna across 20 random-seeded runs
Stages 1 and 2 (preprocessing + exploration) in BlueTuna comprised of 318 trials alone (Fig 1), with 168 going to stage 1 and 150 going to stage 2. As expected, this exposed BlueTuna's main limiting attribute of walltime explosion as preprocessing becomes more thorough (Fig 1 and 4).
However, when preprocessing was completed, run convergence showed BlueTuna catching up as shown by the greater degree of banded region overlap, demonstrating a competitive performance ceiling on favourable seeds (Fig 2).
The notable acceleration in optima-breaking rate (faster climbing) starting at evaluation 150 in BlueTuna's curve of Figure 2 (and eval 318 in Fig 1,) corresponds to the transition from stage 2 (LHS-sampled dataset training) to stage 3 (fixed-weight gradient descent tuning), where the learned weights are guiding hyperparameter value updates. The improved performance step rate at this transition suggests the perceptron's learned interaction weights are informative as the exploitation phase is making consistent structurally guided moves that produce measurable gains.
In terms of raw results, BlueTuna was outperformed by Optuna with an average delta of -0.007595 despite being competitive on 8/20 seeds and beating it on 5/20. What was surprising is that the boxplot of results showed BlueTuna's performance median is slightly greater than that of Optuna's, and simultaneously Optuna's box is tighter and higher (Fig 3). Its also noteworthy that on bad seeds like 5, 13, and 16, BlueTuna's lower performance on those tanked the average significantly (Table 1 and Fig 4). Together, this implies that although BlueTuna is capable of finding competitive / slightly better configurations than Optuna (shown by median), it contains much more severe negative variance as demonstrated by the aforementioned bad seeds.
Put more concisely, BlueTuna's main performance limitation is consistency, not performance ceiling (walltime aside).
Overall, this was expected as Optuna's TPE is exploitation-heavy and reliable, consistently exploiting good regions once found. It appears that BlueTuna behaves more as a diverse explorer, occasionally locating optima otherwise overlooked by traditional sampling methods. Future scope would involve employing more exploitative methods to increase performance and decrease walltime (as much as I can).
The results suggest that the hyperparameter landscape filtering and characterization that BlueTuna employs, allows it to identify high-performing parameter value regions, potentially setting the exploitation phase up for competitive performance against current top-performing TPE samplers.
For each hyperparameter to be tuned, its search space is essentially "split" evenly into regions by the number of n_splits chosen (Fig 5).
- e.g. param1: (2, 10, 3) -> 3 "splits" / lines placed along the range of 2-10 such that each area is equal
- the lines would end up falling on 4, 6, 8 -> therefore the regions created by these splits are: (2, 4), (4, 6), (6, 8), (8, 10)
At each region boundary, the base model is evaluated at that point and at
a small forward step of grad_nudge% of the region width, producing a
local pseudo-gradient. We end up with a list of points (region starting points) and their pseudo-gradients.
Each region is then scored by two components:
Raw output: the performance area the region already encompasses, computed geometrically as a trapezoid:
square_area = length × min(Y_left, Y_right)
triangle_area = length × |Y_right - Y_left| × 0.5
raw_output = square_area + triangle_area
Projection score: what the gradients we obtained imply optimal performance is within this zone:
projection_score = grad_left + (-grad_right)
The intuition behind this:
+grad → -grad: +ve grad implies greater values ahead of it, -ve grad implies greater values behind it. Both suggest an optima lies between them.-grad → -grador+grad → +grad: one boundary implies interior optimum → score reflects that gradient's magnitude-grad → +grad: neither implies interior optimum → score is negative, region will most definitely be filtered out
The "projection_output" used for region filtering is raw_output × projection_score. Regions scoring below average are filtered out, concentrating subsequent sampling in zones that are both high-performing and likely to contain a high-performing optimum.
Visualization for this stage below, it's a concept visualization not a direct one of the example above:
Fig 5: Stage 1 Preprocessing Concept Visualization showing an individual hyperparameter's search space being split by n_splits=2 as shown by the dotted lines. The dashed lines indicate the gradients of all region points which are used to create performance projections
Traditional neural structures (and ML models in general) will typically have thousands to millions of training data entries to learn from and comfortably set their randomly initialized weights. BlueTuna's learning stages generates hundreds at most, presenting the issue being a lack of training samples for meaningful weight convergence. BlueTuna works around this by pre-initializing hyperparameter weights based on data gained from previous search region filtering.
During stage 1, sensitivities per hyperparameter are determined by its average gradient, and its performance ceiling (max_Y) is also obtained.
A composite score sensitivity * max_Y is added to a list of scores, and each hyperparameter's weight is initialized as its relative % of the total of scores.
total = sum(scores)
weight = score / total -> for each hyperparam's score
The intuition behind the scoring was that combining both the sensitivity and maximal performance ceiling into one term "fills in the gaps" behind what each individual term lacks.
- sensitivity alone cannot tell you if the parameter actually improves the model significantly
- max_Y alone cannot tell you if altering it changes performance by a significant degree
The weights then become the starting point for the perceptron rather than having them be randomly initialized as observed in traditional neural structures. This provides a stable, meaningful and informative starting point for the perceptron to learn from as compared to random initialization, also is handy for reproducibility.
Potential limitation: I should note that feature interaction term weights are intialized randomly during stage 1 since the only data at this stage revolves around parameters tuned in isolation, not in combination. Future improvements would involve finding ways to pre-initialize feature interaction weights.
Because the perceptron is generally trained on a small dataset with hand-crafted initialization, learning rate has a large effect on the weights that are to be fixed for stage 3. Choosing a learning rate that is too high would cause weight overshoot whereas one that's too low would result in the wrong weights being finalized.
To address this, _autoconverge runs 5 independent training runs across a wide range of learning rates, picking the one resulting in minimal variance in learned weights. If a learning rate produces similar relative weights and rankings across independent runs, then that is more likely to reflect genuine signal from training data rather than noise from the optimization path. The lower the variance, the more signal is actually being captured by the model and thus the more likely it is to be used in the online tuning stage.
During online tuning (Stage 3) weights become fixed and the gradient updates are applied to the hyperparameter values. However, pairwise interaction terms prevent updates from encapsulating the hyperparameter landscape's influence. These interaction terms need to be decomposed back into their constituents via chain rule in order for updates to base parameters to accomplish this task.
Feature interaction term calculations: a*b and a/b
- a*b: gradient with respect to 'a' is 'b', and vice versa
- a/b -> log(a) - log(b): gradient with respect to 'a' is 1/a, with respect to 'b' is -1/b (avoid numerical underflow)
Each base parameter accumulates gradients from its own weight and every partial derivative of every feature interaction term it's involved in.
This means that a parameter appearing in multiple high-weighted interaction terms can receive a much larger accumulated gradient (and thus a substantial update) even if its own weight is lower, which is intended since interaction importance should influence how aggressively a base param gets tuned.




