By Krv Labs.
Welcome to Thema, our Topological Hyperparameter Evaluation and Mapping Algorithm! ๐
Thema systematically explores hyperparameter spaces for unsupervised learning through topological data analysis. Instead of manually tuning preprocessing and embedding parameters, Thema generates candidate models systematically and uses curvature-based graph distances to identify diverse, high-quality representatives.
By leveraging advanced techniques to understand the distribution of representations that emerge from various preprocessing and hyperparameter choices, Thema brings a new level of insight to your unsupervised tasks. Navigate the complex terrain of hyperparameter optimization with confidence, identifying the most salient patterns and features in your data. ๐ง ๐
Thema operates through three distinct modules:
The foundational system that transforms raw data into topological representations:
- Planet (Preprocessing): Generates multiple clean data versions with different imputation, scaling, and encoding strategies
- Oort (Embeddings): Creates low-dimensional projections across parameter grids (t-SNE, PCA)
- Galaxy (Graph Construction): Builds Mapper graphs, computes topological distances, and selects representatives
Specialized tools for extended analysis capabilities:
- Realtor: Real estate and geographic data analysis tools
- Utils: Utility functions for specialized data processing workflows
Install Thema using pip:
pip install thema
Verify the installation:
pip show thema
Get started with Thema in just a few lines of code! See params.yaml.sample as a template for defining your own representation grid search.
import thema
from thema import Thema
# Enable logging to see progress
thema.enable_logging()
# Initialize Thema with your configuration
my_thema = Thema(YAML_PATH='path/to/custom.yaml')
# Run the complete pipeline
my_thema.genesis()
# Access the selected representative model files
print(my_thema.selected_model_files)
That's it! Thema will systematically process your data through preprocessing, embedding, and graph construction stages, automatically selecting the most representative models.
Clean, encode, and impute your raw data with multiple strategies:
from thema.multiverse import Planet
# Initialize Planet with your configuration
planet = Planet(YAML_PATH='path/to/params.yaml')
# Generate multiple cleaned datasets
planet.fit()
Planet creates various versions of your cleaned data with different:
- Scaling methods (
standard,minmax,robust) - Encoding strategies (
one_hot,label,ordinal) - Imputation methods (
mean,median,mode,sampleNormal) - Random seeds for reproducible sampling
Generate low-dimensional projections from your cleaned data:
from thema.multiverse import Oort
# Create embeddings across parameter grids
oort = Oort(YAML_PATH='path/to/params.yaml')
oort.fit()
Oort produces embeddings using:
- t-SNE: With various perplexity values and dimensions
- PCA: With different dimensionality settings
- Multiple random seeds for robustness
Build Mapper graphs and select representatives:
from thema.multiverse import Galaxy
# Generate graph models across hyperparameter space
galaxy = Galaxy(YAML_PATH='path/to/params.yaml')
galaxy.fit()
# Cluster and select representative models
representatives = galaxy.collapse()
Galaxy creates and analyzes:
- Mapper graphs: Using various cover resolutions and overlap parameters
- Topological distances: Computing curvature-based similarity metrics
- Representative selection: Choosing diverse, high-quality models using clustering
Generate a 2D embedding space of your models for analysis:
# Get 2D coordinates of all models in the galaxy
coordinates = galaxy.get_galaxy_coordinates()
# Access the selected representatives
for cluster_id, info in galaxy.selection.items():
print(f"Cluster {cluster_id}: {info['star']} ({info['cluster_size']} models)")
โจ Systematic Exploration: Automatically explores preprocessing and embedding parameter combinations
๐ฏ Representative Selection: Uses topological distance metrics to identify diverse, high-quality models
๐ Robust Analysis: Generates multiple models per configuration for statistical reliability
๐ง Flexible Configuration: YAML-based configuration for easy parameter management
๐ Parallel Processing: Efficient multiprocessing for large parameter grids
๐ Topological Insights: Leverage graph topology and curvature for model comparison
Thema organizes outputs hierarchically:
{outDir}/{runName}/
โโโ clean/ # Preprocessed datasets (Moon files)
โ โโโ moon_42_0.pkl
โ โโโ moon_42_1.pkl
โ โโโ ...
โโโ projections/ # Low-dimensional embeddings (Comet files)
โ โโโ tsne_perplexity30_dims2_seed42_moon_42_0.pkl
โ โโโ pca_dims2_seed42_moon_42_0.pkl
โ โโโ ...
โโโ models/ # Mapper graphs (Star files)
โโโ star_tsne_perplexity30_nCubes10_overlap0.6.pkl
โโโ star_pca_dims2_nCubes10_overlap0.6.pkl
โโโ ...
โ Good Use Cases:
- Exploring preprocessing choices for unsupervised learning
- Comparing embedding methods systematically
- Finding robust data representations across hyperparameter grids
- Identifying diverse graph topologies in your data
- Validating clustering stability across multiple configurations
โ Not Ideal For:
- Supervised learning (Thema focuses on unsupervised tasks)
- Single fixed preprocessing pipeline
- Real-time inference (Thema generates models offline)
For comprehensive guides and tutorials, visit our documentation.
Quick Links:
Transform the way you explore and interpret your data with Thema - where the topology of your analysis reveals the hidden stories in your data! ๐ โจ