LILO: Bayesian Optimization with Natural Language Feedback

Official implementation of LILO: Bayesian Optimization with Natural Language Feedback (ICML 2026).

LILO performs Bayesian optimization guided by free-form natural-language feedback from a (possibly LLM-simulated) decision maker. The NL feedback is converted into a optimizable signale via LLM-automated pairwise comparisons followed by GP modelling, which feeds a standard BO acquisition loop.

[openreview] [arXiv] [Colab notebook]

What this repo contains

The main entry point:

bo_loop.py — main BO loop on synthetic environments (DTLZ2, Vehicle Safety, CarCab, Thermo, NAS-Bench-201). Hydra-configured.

Supporting modules:

environments.py — synthetic environments and their utility functions.
gp_models.py — SimpleGPProxyModel, CategoricalGPProxyModel.
utility_approximator.py — LLM-driven scalar / pairwise utility approximation.
human_feedback_simulator.py — simulated NL human responses via an LLM.
prompts.py — prompt templates used by the LLM-side machinery.
llm_utils.py — abstract LLMClient interface that you must implement (see below).
utils.py — small helpers (sigmoid, JSON extraction).
config/ — Hydra configs (one per method).

The entry point and supporting module for the summarization task:

summary_bo_loop.py — BO over LLM-summarization hyperparameters on CNN/DailyMail.
summay_utils.py — ArticleSummarizer, SummaryFeedbackGenerator, LLMPairwiseJudge for the summarization task.

💿 Installation

Below we describe the steps to setup the environment.

1️⃣ Environment

First, clone the repository and enter the directory:

git clone https://github.com/facebookresearch/lilo
cd lilo

Then, set up a conda environment as follows:

conda create --name lilo python=3.12
conda activate lilo

Finally, install the required depndencies:

pip install -r requirements.txt
# or, for an editable install:
pip install -e .

2️⃣ LLM client setup (required)

The optimizer relies on an LLM to simulate human feedback, label pairwise preferences, generate prior knowledge, etc. The repo ships with lilo.llm_utils.PlaceholderLLMClient, which raises NotImplementedError on use. Subclass LLMClient against your provider and wire it into bo_loop._make_llm_client (and summary_bo_loop._make_llm_client if you run the summarization experiments).

The only method you must implement is:

async def get_batch_llm_responses(
    self, prompts, num_responses=1, kwargs=None,
    max_retries=8, timeout_per_call=None,
) -> list[list[str]]:
    ...

For each input prompt, return a list of num_responses string completions.

Example: OpenAI

import asyncio
from openai import AsyncOpenAI
from lilo.llm_utils import LLMClient

    def __init__(self, model="gpt-4o-mini", **kw):
        super().__init__(model=model)
        self.client = AsyncOpenAI()

    async def get_batch_llm_responses(self, prompts, num_responses=1, kwargs=None, **_):
        async def one(p):
            r = await self.client.chat.completions.create(
                model=self.model,
                messages=[{"role": "user", "content": p}],
                n=num_responses,
                **(kwargs or {}),
            )
            return [c.message.content for c in r.choices]
        return await asyncio.gather(*(one(p) for p in prompts))

Then in bo_loop.py:

def _make_llm_client(model: str) -> LLMClient:
    return OpenAIClient(model=model)

llm_utils.py contains additional skeletons for Anthropic and local vLLM/HuggingFace.

3️⃣ Data setup (only required for NAS-Bench-201 and the summarization task)

Place the input datasets under data/ (or pass alternative paths via Hydra). See data/README.md for the exact format and download instructions:

NAS-Bench-201: data/nasbench201_dataset_<id>.jsonl, derived from https://github.com/D-X-Y/NAS-Bench-201.
CNN/DailyMail (summarization task only): data/cnn_daily_mail_{train,test}.parquet, from HuggingFace cnn_dailymail v3.0.0.

The other synthetic environments (dtlz2, vehicle_safety, carcab, thermo) need no external data — they use closed-form objectives or pymoo / botorch test functions.

⚙️ Configuration

The BO loop is configured via Hydra. The top-level entry is config/bo_loop.yaml; methods are selected via the Hydra method group.

`method=`	Description
`lilo`	Pairwise NL utility approximation (the paper's main method)
`lilo_scalar`	Scalar NL utility approximation
`true_utility_bo`	Oracle baseline using the true utility
`preferential_bo`	Preferential BO baseline (`PairwiseGP` + `qEUBO`)
`llm_direct`	LLM directly proposes the next candidate
`llm_2step`	LLM-based 2-step acquisition

`environment.name`	Utility functions used in the paper
`dtlz2`	`piecewise_linear`, `l1`, `beta_products`
`vehicle_safety`	`piecewise_linear`, `beta_products`, `vehicle_safety_llm`
`carcab`	`piecewise_linear`, `carcab_llm`
`thermo`	`thermo_A`, `thermo_B`
`nas201`	`nas201_research`, `nas201_edge`

Other useful overrides:

seed=<int> — random seed (default 0).
N_iter=<int> — number of BO trials (default 8).
bs_feedback=<int> — number of feedback datapoints (i.e., number of question-answer pairs in LILO, direct utility evaluations in the true utility baseline, pairwise comparisons in the preference BO baseliens) (default 2).
bs_exp=<int> — batch size of new candidates per trial (default null = problem dimension).
acquisition_method=log_nei|log_ei|thompson|ucb_<beta>|llm_2step|llm_direct.
pair_labeling_acquisition_method=sequential_q_eubo|random — for the LILO pairwise feedback acquisition ablation.
use_prior_knowledge=True/False, prior_knowledge_type=point|area|domain — for the prior-knowledge ablation. The prior is consumed at initialization (prior_knowledge_inc_method=llm_init, the only supported value).
save_outputs=True save_dir=runs/<name> — save per-trial results.

🎮 Single-run example

After implementing your LLMClient and dropping nasbench201_dataset_0.jsonl into data/:

python -m lilo.bo_loop \
  method=lilo seed=0 \
  environment.name=dtlz2 environment.utility_func=piecewise_linear \
  N_iter=8 \
  save_outputs=True save_dir=runs/main_benchmark

This writes runs/main_benchmark/run_<timestamp>/{config.json, results.jsonl}.

📄 Reproducing the paper experiments

Each paper figure corresponds to a sweep over (method, environment, utility_func, seed, ...). Use Hydra's multirun mode (-m) to launch the sweep; aggregate the resulting results.jsonl files yourself (the paper plots are means ± confidence intervals across seeds — no aggregation script ships with the repo).

Main benchmark

python -m lilo.bo_loop -m \
  method=lilo,true_utility_bo,preferential_bo,llm_direct,llm_2step \
  environment.name=dtlz2 environment.utility_func=piecewise_linear \
  seed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
  save_outputs=True save_dir=runs/main_benchmark

Repeat with environment.name/utility_func for each row of the env table above.

LILO with prior knowledge

For synthetic environments, the available prior knowledge type is point and area:

python -m lilo.bo_loop -m \
  method=lilo \
  use_prior_knowledge=True \
  prior_knowledge_type=point,area \
  environment.name=dtlz2 environment.utility_func=piecewise_linear \
  seed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
  save_outputs=True save_dir=runs/prior_knowledge

For real-world environemnts, use domain knowledge type:

python -m lilo.bo_loop -m \
  method=lilo \
  use_prior_knowledge=True \
  prior_knowledge_type=domain \
  environment.name=thermo environment.utility_func=thermo_A,thermo_B \
  seed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
  save_outputs=True save_dir=runs/prior_knowledge

Pairwise vs scalar

python -m lilo.bo_loop -m \
  method=lilo_scalar \
  environment.name=dtlz2 environment.utility_func=piecewise_linear \
  seed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
  save_outputs=True save_dir=runs/pairwise_vs_scalar

Pair acqusition for LLM labeling via qEUBO vs. random

python -m lilo.bo_loop -m \
  method=lilo \
  pair_labeling_acquisition_method=sequential_q_eubo,random \
  environment.name=dtlz2 environment.utility_func=piecewise_linear \
  seed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
  save_outputs=True save_dir=runs/bpf_ablation

Acquisition-method ablation

python -m lilo.bo_loop -m \
  method=lilo \
  acquisition_method=log_nei,ucb_0.5,thompson \
  environment.name=dtlz2 environment.utility_func=piecewise_linear \
  seed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
  save_outputs=True save_dir=runs/acqf_ablation

Summarization

summary_bo_loop.py is not Hydra-wrapped; configure it via the OmegaConf.create(...) block at the bottom of the file.

Key fields:

optimizer: "fixed" or "sampled".
seed, N_trials, N_configs, N_articles, bs_feedback, acqf, persona, optimize_preamble, local_dir (where the parquet files live), llm_model.

Sweep seed and optimizer to reproduce the two summarization plots.

💾 Output format

Each run writes:

<save_dir>/run_<timestamp>/
├── config.json      # the resolved Hydra config
└── results.jsonl    # one JSON object per BO trial

Per-trial fields include:

trial_index
best_value_predicted — best predicted utility seen so far
best_value_true - best ground truth utility seen so far
value_of_best_predicted - ground truth utility value of the best predicted so far
pairwise_accuracy — LLM pairwise-judge accuracy on the trial's labeled pool
label_exp_df, context_df, prior_df — serialized pandas DataFrames

To compare methods across seeds, group by (method, environment.name, environment.utility_func, seed), take best_value_true per trial_index, and compute means / CIs.

⚖️ License

The code is licensed under an MIT license.

📧 Contact

Katarzyna Kobalczyk: knk25@cam.ac.uk

Jerry Lin: zylin@meta.com

✍️ Citation

If you find this repository useful, please consider giving a star ⭐ and please cite as:

@inproceedings{
kobalczyk2026lilo,
title={{LILO}: Bayesian Optimization with Natural Language Feedback},
author={Katarzyna Kobalczyk and Zhiyuan Jerry Lin and Benjamin Letham and Zhuokai Zhao and Maximilian Balandat and Eytan Bakshy},
booktitle={Forty-third International Conference on Machine Learning},
year={2026},
url={https://openreview.net/forum?id=zVrbE9ZtEU}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LILO: Bayesian Optimization with Natural Language Feedback

What this repo contains

💿 Installation

1️⃣ Environment

2️⃣ LLM client setup (required)

Example: OpenAI

3️⃣ Data setup (only required for NAS-Bench-201 and the summarization task)

⚙️ Configuration

🎮 Single-run example

📄 Reproducing the paper experiments

Main benchmark

LILO with prior knowledge

Pairwise vs scalar

Pair acqusition for LLM labeling via qEUBO vs. random

Acquisition-method ablation

Summarization

💾 Output format

⚖️ License

📧 Contact

✍️ Citation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

LILO: Bayesian Optimization with Natural Language Feedback

What this repo contains

💿 Installation

1️⃣ Environment

2️⃣ LLM client setup (required)

Example: OpenAI

3️⃣ Data setup (only required for NAS-Bench-201 and the summarization task)

⚙️ Configuration

🎮 Single-run example

📄 Reproducing the paper experiments

Main benchmark

LILO with prior knowledge

Pairwise vs scalar

Pair acqusition for LLM labeling via qEUBO vs. random

Acquisition-method ablation

Summarization

💾 Output format

⚖️ License

📧 Contact

✍️ Citation