ScalingIntelligence
diff --git a/‎.env.example‎
Lines changed: 24 additions & 0 deletions b/‎.env.example‎
Lines changed: 24 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 19 additions & 11 deletions b/‎README.md‎
Lines changed: 19 additions & 11 deletions
diff --git a/‎requirements.txt‎
Lines changed: 7 additions & 6 deletions b/‎requirements.txt‎
Lines changed: 7 additions & 6 deletions
diff --git a/‎results/timing/README.md‎
Lines changed: 3 additions & 1 deletion b/‎results/timing/README.md‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎scripts/eval_from_generations.py‎
Lines changed: 32 additions & 28 deletions b/‎scripts/eval_from_generations.py‎
Lines changed: 32 additions & 28 deletions
diff --git a/‎scripts/generate_and_eval_single_sample.py‎
Lines changed: 35 additions & 8 deletions b/‎scripts/generate_and_eval_single_sample.py‎
Lines changed: 35 additions & 8 deletions
@@ -0,0 +1,24 @@
+# API Keys for LLM Providers
+# Copy this file to .env and fill in your actual API keys
+# DO NOT commit your .env file with real keys!
+
+# OpenAI (for GPT models and o1/o3 reasoning models)
+OPENAI_API_KEY=sk-...
+
+# Anthropic (for Claude models)
+ANTHROPIC_API_KEY=sk-ant-api03-...
+
+# Google Gemini
+GEMINI_API_KEY=...
+
+# DeepSeek
+DEEPSEEK_API_KEY=sk-...
+
+# Together AI
+TOGETHER_API_KEY=...
+
+# Fireworks AI
+FIREWORKS_AI_API_KEY=...
+
+# Local Server Deployment (SGLang, vLLM, Tokasaurus)
+SGLANG_API_KEY=...
@@ -1,16 +1,19 @@
 # KernelBench: Can LLMs Write Efficient GPU Kernels? [ICML '25]
-[arXiv](https://arxiv.org/html/2502.10517v1) | [blog post](https://scalingintelligence.stanford.edu/blogs/kernelbench/) | [HuggingFace Dataset](https://huggingface.co/datasets/ScalingIntelligence/KernelBench) | 
+A benchmark for evaluating LLMs' ability to generate efficient GPU kernels
+
+[arXiv](https://arxiv.org/html/2502.10517v1) | [blog post](https://scalingintelligence.stanford.edu/blogs/kernelbench/) | [HuggingFace Dataset](https://huggingface.co/datasets/ScalingIntelligence/KernelBench) 
+
+<img src="./assets/figures/KernelBenchMascot.png" width="200">
 
 ## Versions
-The huggingface dataset is updated to v0.1.
-- [v0.1](https://github.com/ScalingIntelligence/KernelBench/tree/v0.1) - Latest version (also main branch)
+The latest stable version will be on `main` branch. We continue to update and improve the repo. 
+- [v0.1](https://github.com/ScalingIntelligence/KernelBench/tree/v0.1) - See [blog](https://scalingintelligence.stanford.edu/blogs/kernelbenchv01/)
 - [v0](https://github.com/ScalingIntelligence/KernelBench/tree/v0) - Original Release
 
-A benchmark for evaluating LLMs' ability to generate efficient GPU kernels
 
-<img src="./assets/figures/KernelBenchMascot.png" width="200">
+The Huggingface [dataset](https://huggingface.co/datasets/ScalingIntelligence/KernelBench) is updated to v0.1.
 
-<!-- See [blog post](https://scalingintelligence.stanford.edu/blogs/kernelbench/) and [arXiv paper](https://arxiv.org/html/2502.10517v1) for more details. -->
+This repo provides core functionality for KernelBench and an easy-to-use set of scripts for evaluation. It is not intended to provide complex agentic scaffolds that solve this task; we recommend cloning and modifying this repo for your experiment, or using it as a git submodule.
 
 ## 👋 Task Description
 We structure the problem for LLM to transpile operators described in PyTorch to CUDA kernels, at whatever level of granularity it desires to.
@@ -26,7 +29,7 @@ We construct KernelBench to have 4 Levels of categories:
 - **Level 4 🤗**:  Level Hugging Face 
     Optimize whole model architectures from HuggingFace
 
-We are actively extending KernelBench to other DSLs beyond `cuda` as well.
+We are actively extending KernelBench to other DSLs beyond `cuda` as well (see below).
 
 ## ⚖️ Evaluation
 #### Methodology
@@ -36,7 +39,7 @@ To evaluate model-generated kernels, we need to check if they:
 
 Check out `src/eval.py` for details on how we implement correctness check and timing. 
 
-We provide a convenient script `scripts/run_and_check.py` to evaluate one single sample source code against a reference source code, check correctness and compute speedup. You can use this to evaluate a model-generated kernel. 
+We provide a convenient script `scripts/run_and_check.py` to evaluate one single sample source code against a reference source code, check correctness and compute speedup. You can use this to evaluate a kernel either locally or remotely by setting `eval_mode=local` or `eval_mode=modal`.
 
 #### Overall Benchmark Metric
 
@@ -80,7 +83,7 @@ pip install -r requirements.txt
 pip install -e . 
 ```
 
-To call LLM API providers, set your `{INFERENCE_SERVER_PROVIDER}_API_KEY` API key.
+We use `litellm` for API calls. Please set your keys by creating a `.env` following our `.env.example`.
 
 Running and profiling kernels require a GPU. 
 If you don't have GPU available locally, you can set up [Modal](https://modal.com/). Set up your modal token after creating an account by running `modal token new`. Then, use the `generate_and_eval_single_sample_modal.py` script.
@@ -98,7 +101,12 @@ python3 scripts/generate_and_eval_single_sample.py dataset_src="huggingface" lev
 # add .verbose_logging for more visbility
 ```
 
-We are also supporting other GPU programming languages beyond `cuda`. Simply specify `backend=triton`. For now we support (`cuda`, `triton`, `cute`).
+**What you might need to modify**
+* **`gpu_arch`** - Depend on your GPU, you might need to adjust the `gpu_arch` argument to reflect your hardware.
+* **`precision`** - You can specify the precision of tensor by `precision=fp32`. Currently all of our reported results are `fp32` but we added support for `fp16` & `bf16`.
+*  **`backend`** - We are also supporting other GPU programming languages beyond `cuda`. Simply specify `backend=triton`. For now we support DSLs: `cuda`, `triton`, `cute`, `tilelang`.
+
+Check the config fields for comprehensive set of options.
 
 ### Run on all problems 
 
@@ -122,7 +130,7 @@ If you are using a different hardware, you can generate the baseline time with `
 We provide some reference baseline times a variety of NVIDIA GPUs across generations in `results/timing`, but we recommend you to generate your own baseline time for more accurate results (cluster power, software version, all affects timing result). See `results/timing/README.md` for more details.
 
 ### Multi-Turn Framework
-We have also releaed the test-time framework [Caesar](https://github.com/simonguozirui/caesar) that are used in the multi-turn / iterative refinement experiments in our paper. You can use or modify this framework for high-throughput test-time scaling (both sequential and parallel) targeting KernelBench problems. 
+We have also releaed the test-time framework [Caesar](https://github.com/ScalingIntelligence/caesar) that are used in the multi-turn / iterative refinement experiments in our paper. You can use or modify this framework for high-throughput test-time scaling (both sequential and parallel) targeting KernelBench problems. 
 
 ## 🛣️ Upcoming Roadmap
 Check out our [roadmap](https://github.com/ScalingIntelligence/KernelBench/issues/74) for what we plan to add as features. We welcome community contirbutions in these directions. 
 
@@ -1,12 +1,15 @@
 # Frameworks
-torch==2.5.0
+# we use latest PyTorch stable release
+torch==2.9.0
+
 # we shall upgrade torch for blackwell when it is stable
 transformers
 datasets
 modal
 
 # DSLs
 nvidia-cutlass-dsl
+tilelang
 
 # helper
 tqdm
@@ -20,9 +23,7 @@ einops
 dotenv
 numpy
 
-# to deprecate with litellm
-google-generativeai
-together
-openai
-anthropic
+# use litellm for cloud providers and openai for local
+openai 
+litellm[proxy]
 
@@ -6,7 +6,9 @@ This folder contains a set of baseline timing results for the KernelBench proble
 Since KernelBench measures the speedup between Runtime(refernece architecture) and Runtime(LLM-generated architecture), it is important to measure the baseline reference module runtime.
 
 We have provided a set of baseline results for the KernelBench problems on a variety of hardware as well as various PyTorch configurations.
-All baseline are ran with PyTorch `2.5.0+cu124` and CUDA `12.4`.
+All (current) baseline are ran with PyTorch `2.5.0+cu124` and CUDA `12.4`.
+
+Note: we will update it soon with PyTorch `2.9.0` and CUDA `12.8`    
 
 For timing, we measure wall clock time. We warm up 3 times and collect runtime statistics for 100 trials.
 
 
@@ -55,7 +55,7 @@
 app = modal.App("eval_from_generations_modal")
 gpu_arch_mapping = {"L40S": ["Ada"], "H100": ["Hopper"], "A100": ["Ampere"], "L4": ["Ada"], "T4": ["Turing"], "A10G": ["Ampere"]}
 
-cuda_version = "12.4.0"  # should be no greater than host CUDA version
+cuda_version = "12.8.0"  # should be no greater than host CUDA version
 flavor = "devel"  #  includes full CUDA toolkit
 operating_sys = "ubuntu22.04"
 tag = f"{cuda_version}-{flavor}-{operating_sys}"
@@ -67,23 +67,7 @@
                 "g++-10",
                 "clang"
                 )
-    .pip_install(
-        "anthropic",
-        "numpy",
-        "openai",
-        "packaging",
-        "pydra_config",
-        "torch==2.5.0",
-        "tqdm",
-        "datasets",
-        "transformers",
-        "google-generativeai",
-        "together",
-        "pytest",
-        "ninja",
-        "utils",
-        "python-dotenv",
-    )
+    .pip_install_from_requirements(os.path.join(REPO_TOP_DIR, "requirements.txt"))
     .add_local_dir(
         KERNEL_BENCH_PATH,
         remote_path="/root/KernelBench"
@@ -145,6 +129,10 @@ def __init__(self):
 
         # Backend to use for kernel implementation (cuda or triton)
         self.backend = "cuda"
+        
+        # Precision for computation: "fp32", "fp16", "bf16"
+        self.precision = "fp32"
+        
         # Number of samples per problem to evaluate for pass@k analysis
         self.num_samples_per_problem = 1  # Default to 1 sample per problem
 
@@ -165,17 +153,18 @@ class WorkArgs:
 # Modal Evaluation Class
 # GPU must be specified here for all instances
 # Retries are configured at the class level to handle GPU attachment failures
-# @modal.concurrent: Each container handles exactly ONE evaluation at a time - prevents memory leaks
+# scaledown_window=5 kills idle containers after 5 seconds
+# Combined with 10s sleep between batches, this prevents container reuse and GPU corruption spread
 @app.cls(
-    image=image, 
+    image=image,
     gpu="A10G",
+    scaledown_window=5,  # Kill idle containers after 5 seconds
     retries=modal.Retries(
         max_retries=3,
         backoff_coefficient=2.0,
         initial_delay=1.0,
     )
 )
-@modal.concurrent(max_inputs=1)  # One input per container - prevents GPU memory leaks
 class ModalEvaluator:
 
     @modal.method()
@@ -188,11 +177,13 @@ def evaluate_single_sample_modal(
         num_perf_trials: int = 100,
         measure_performance: bool = True,
         verbose: bool = False,
+        backend: str = "cuda",
+        precision: str = "fp32",
     ):
         """
         Evaluate a single sample on Modal GPU with automatic retries for GPU attachment failures
         """
-        from src.eval import eval_kernel_against_ref
+        from src.eval import eval_kernel_against_ref, get_torch_dtype_from_string
         from src.utils import set_gpu_arch
         import torch
         import time
@@ -225,12 +216,14 @@ def evaluate_single_sample_modal(
             num_perf_trials=num_perf_trials,
             build_dir=None,  # Modal doesn't need persistent build dir
             device=torch.device("cuda:0"),  # Modal has one GPU per container
+            backend=backend,
+            precision=get_torch_dtype_from_string(precision),
         )
-        
-        # Force cleanup and exit to prevent container reuse and memory leaks
+
+        # Cleanup GPU cache before returning
         torch.cuda.empty_cache()
-        
-        return result  # Never reached, but needed for type checking
+
+        return result
 
 
 def fetch_ref_arch_from_problem_id(
@@ -321,6 +314,7 @@ def evaluate_single_sample(
             build_dir=build_dir,
             device=device,
             backend=configs.backend,
+            precision=eval.get_torch_dtype_from_string(configs.precision),
         )
         return eval_result
     except Exception as e:
@@ -477,7 +471,8 @@ def batch_eval_modal(
                 evaluator_cls = ModalEvaluator.with_options(gpu=config.gpu) if config.gpu != "A10G" else ModalEvaluator
 
                 # Spawn all tasks in parallel
-                # Each spawn creates a NEW container instance with a GPU
+                # Modal assigns these to available containers (may reuse warm containers from previous batches)
+                # To prevent GPU corruption spread, we sleep between batches to ensure containers scale down
                 futures = []
                 for item in work_items:
                     if item is None:
@@ -491,6 +486,8 @@ def batch_eval_modal(
                             num_perf_trials=config.num_perf_trials,
                             measure_performance=config.measure_performance,
                             verbose=config.verbose,
+                            backend=config.backend,
+                            precision=config.precision,
                         )
                         futures.append(future)
 
@@ -531,7 +528,14 @@ def batch_eval_modal(
 
                 print("-" * 128)
                 print(f"[Modal Batch] Evaluation took {end_time - start_time:.2f} seconds")
-                
+
+                # Wait for containers to scale down before next batch
+                # This prevents container reuse and GPU corruption from spreading between batches
+                if len(total_work) > 0:  # Only sleep if there are more batches
+                    scaledown_wait = 10  # Wait 10 seconds (2x the scaledown_window) to ensure containers are killed
+                    print(f"[Modal] Waiting {scaledown_wait}s for containers to scale down before next batch...")
+                    time.sleep(scaledown_wait)
+
                 pbar.update(len(curr_work_batch))
 
 
 
@@ -18,7 +18,7 @@
     read_file,
     set_gpu_arch,
 )
-
+from src.eval import get_torch_dtype_from_string
 """
 Generate and evaluate a single sample
 Easiest way to get started, to test a single problem for experimentation or debugging
@@ -48,12 +48,18 @@ def __init__(self):
         # Construct this from mapping from architecture name to torch cuda arch list in the future
         # you can either specify SM version or just use the name
         self.gpu_arch = ["Ada"]
+        self.precision = "fp32" # options ["fp32", "fp16", "bf16"]
 
         # Inference config
-        self.server_type = "deepseek"
-        self.model_name = "deepseek-coder"
-        self.max_tokens = 4096
-        self.temperature = 0.0
+        self.server_type = None
+        self.model_name = None
+        self.max_tokens = None
+        self.temperature = None
+        
+        # Reasoning model specific parameters
+        self.is_reasoning_model = False  # set to True for o1, o3, Gemini 2.5 thinking, etc.
+        self.reasoning_effort = None  # for o1/o3: "low", "medium", "high"
+        self.budget_tokens = 0  # for Claude extended thinking mode
 
         # Logging
         self.logdir = os.path.join(REPO_TOP_DIR, "results/eval_logs")
@@ -81,6 +87,21 @@ def main(config: EvalConfig):
     """
     Keep it simple: Generate and evaluate a single sample
     """
+    from src.utils import SERVER_PRESETS
+    
+    if config.server_type and config.server_type in SERVER_PRESETS:
+        preset = SERVER_PRESETS[config.server_type]
+        if config.model_name is None or config.model_name == "None":
+            config.model_name = preset.get("model_name", "None")
+        if config.max_tokens is None or config.max_tokens == "None":
+            config.max_tokens = preset.get("max_tokens", "None")
+        if config.temperature is None or config.temperature == "None":
+            config.temperature = preset.get("temperature", "None")
+    
+    # Convert string boolean to actual boolean for reasoning model flag
+    if isinstance(config.is_reasoning_model, str):
+        config.is_reasoning_model = config.is_reasoning_model.lower() in ['true', '1', 'yes']
+    
     print(f"Starting Eval with config: {config}")
 
     # Configurations
@@ -143,14 +164,19 @@ def main(config: EvalConfig):
         max_tokens=config.max_tokens,
         verbose=config.verbose,
         time_generation=True,
+        is_reasoning_model=config.is_reasoning_model,
+        reasoning_effort=config.reasoning_effort,
+        budget_tokens=config.budget_tokens,
     )
 
     # Use appropriate prompt constructor based on backend
-    if config.backend in ["cuda", "triton", "cute"]:
-        custom_prompt = get_prompt_for_language(ref_arch_src, language=config.backend, option="few_shot")
+    if config.backend == "cuda":
+        custom_prompt = prompt_generate_custom_cuda_from_prompt_template(ref_arch_src)
+    elif config.backend in ["triton", "tilelang", "cute"]:
+        custom_prompt = get_prompt_for_backend(ref_arch_src, config.backend)
     else:
         raise ValueError(
-            f"Unsupported backend: {config.backend}. Must be 'cuda', 'triton', or 'cute'."
+            f"Unsupported backend: {config.backend}. Must be 'cuda', 'triton', 'tilelang', or 'cute'."
         )
 
     if config.log_prompt:
@@ -194,6 +220,7 @@ def main(config: EvalConfig):
         num_correct_trials=5,
         num_perf_trials=100,
         backend=config.backend,
+        precision=get_torch_dtype_from_string(config.precision),
     )
 
     print(