Skip to content

Krv-Analytics/Thema

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

THEMA ๐Ÿ”ฎ


By Krv Labs.


Welcome to Thema, our Topological Hyperparameter Evaluation and Mapping Algorithm! ๐ŸŒŸ


Thema systematically explores hyperparameter spaces for unsupervised learning through topological data analysis. Instead of manually tuning preprocessing and embedding parameters, Thema generates candidate models systematically and uses curvature-based graph distances to identify diverse, high-quality representatives.

By leveraging advanced techniques to understand the distribution of representations that emerge from various preprocessing and hyperparameter choices, Thema brings a new level of insight to your unsupervised tasks. Navigate the complex terrain of hyperparameter optimization with confidence, identifying the most salient patterns and features in your data. ๐Ÿง ๐Ÿ”

Architecture

Thema operates through three distinct modules:

๐ŸŒ Multiverse - Core Data Processing Pipeline

The foundational system that transforms raw data into topological representations:

  • Planet (Preprocessing): Generates multiple clean data versions with different imputation, scaling, and encoding strategies
  • Oort (Embeddings): Creates low-dimensional projections across parameter grids (t-SNE, PCA)
  • Galaxy (Graph Construction): Builds Mapper graphs, computes topological distances, and selects representatives

๐Ÿš€ Expansion - Advanced Analytics Extensions

Specialized tools for extended analysis capabilities:

  • Realtor: Real estate and geographic data analysis tools
  • Utils: Utility functions for specialized data processing workflows

Installation

Install Thema using pip:

pip install thema

Verify the installation:

pip show thema

Quick Start

Get started with Thema in just a few lines of code! See params.yaml.sample as a template for defining your own representation grid search.

import thema
from thema import Thema

# Enable logging to see progress
thema.enable_logging()

# Initialize Thema with your configuration
my_thema = Thema(YAML_PATH='path/to/custom.yaml')

# Run the complete pipeline
my_thema.genesis()

# Access the selected representative model files
print(my_thema.selected_model_files)

That's it! Thema will systematically process your data through preprocessing, embedding, and graph construction stages, automatically selecting the most representative models.


Pipeline Components

Step 1: Preprocessing with Planet ๐ŸŒ

Clean, encode, and impute your raw data with multiple strategies:

from thema.multiverse import Planet

# Initialize Planet with your configuration
planet = Planet(YAML_PATH='path/to/params.yaml')

# Generate multiple cleaned datasets
planet.fit()

Planet creates various versions of your cleaned data with different:

  • Scaling methods (standard, minmax, robust)
  • Encoding strategies (one_hot, label, ordinal)
  • Imputation methods (mean, median, mode, sampleNormal)
  • Random seeds for reproducible sampling

Step 2: Embedding with Oort โ˜„๏ธ

Generate low-dimensional projections from your cleaned data:

from thema.multiverse import Oort

# Create embeddings across parameter grids
oort = Oort(YAML_PATH='path/to/params.yaml')
oort.fit()

Oort produces embeddings using:

  • t-SNE: With various perplexity values and dimensions
  • PCA: With different dimensionality settings
  • Multiple random seeds for robustness

Step 3: Graph Construction with Galaxy ๐ŸŒŒ

Build Mapper graphs and select representatives:

from thema.multiverse import Galaxy

# Generate graph models across hyperparameter space
galaxy = Galaxy(YAML_PATH='path/to/params.yaml')
galaxy.fit()

# Cluster and select representative models
representatives = galaxy.collapse()

Galaxy creates and analyzes:

  • Mapper graphs: Using various cover resolutions and overlap parameters
  • Topological distances: Computing curvature-based similarity metrics
  • Representative selection: Choosing diverse, high-quality models using clustering

Coordinate Space Generation

Generate a 2D embedding space of your models for analysis:

# Get 2D coordinates of all models in the galaxy
coordinates = galaxy.get_galaxy_coordinates()

# Access the selected representatives
for cluster_id, info in galaxy.selection.items():
    print(f"Cluster {cluster_id}: {info['star']} ({info['cluster_size']} models)")

Key Features

โœจ Systematic Exploration: Automatically explores preprocessing and embedding parameter combinations

๐ŸŽฏ Representative Selection: Uses topological distance metrics to identify diverse, high-quality models

๐Ÿ“Š Robust Analysis: Generates multiple models per configuration for statistical reliability

๐Ÿ”ง Flexible Configuration: YAML-based configuration for easy parameter management

๐Ÿš€ Parallel Processing: Efficient multiprocessing for large parameter grids

๐Ÿ“ˆ Topological Insights: Leverage graph topology and curvature for model comparison


Output Structure

Thema organizes outputs hierarchically:

{outDir}/{runName}/
โ”œโ”€โ”€ clean/                  # Preprocessed datasets (Moon files)
โ”‚   โ”œโ”€โ”€ moon_42_0.pkl
โ”‚   โ”œโ”€โ”€ moon_42_1.pkl
โ”‚   โ””โ”€โ”€ ...
โ”œโ”€โ”€ projections/           # Low-dimensional embeddings (Comet files)
โ”‚   โ”œโ”€โ”€ tsne_perplexity30_dims2_seed42_moon_42_0.pkl
โ”‚   โ”œโ”€โ”€ pca_dims2_seed42_moon_42_0.pkl
โ”‚   โ””โ”€โ”€ ...
โ””โ”€โ”€ models/               # Mapper graphs (Star files)
    โ”œโ”€โ”€ star_tsne_perplexity30_nCubes10_overlap0.6.pkl
    โ”œโ”€โ”€ star_pca_dims2_nCubes10_overlap0.6.pkl
    โ””โ”€โ”€ ...

When to Use Thema

โœ… Good Use Cases:

  • Exploring preprocessing choices for unsupervised learning
  • Comparing embedding methods systematically
  • Finding robust data representations across hyperparameter grids
  • Identifying diverse graph topologies in your data
  • Validating clustering stability across multiple configurations

โŒ Not Ideal For:

  • Supervised learning (Thema focuses on unsupervised tasks)
  • Single fixed preprocessing pipeline
  • Real-time inference (Thema generates models offline)

Documentation

For comprehensive guides and tutorials, visit our documentation.

Quick Links:


Transform the way you explore and interpret your data with Thema - where the topology of your analysis reveals the hidden stories in your data! ๐ŸŒ โœจ

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages