Skip to content

Latest commit

 

History

History
298 lines (216 loc) · 10.9 KB

File metadata and controls

298 lines (216 loc) · 10.9 KB

LILO: Bayesian Optimization with Natural Language Feedback

Official implementation of LILO: Bayesian Optimization with Natural Language Feedback (ICML 2026).

Watermarking Demo

LILO performs Bayesian optimization guided by free-form natural-language feedback from a (possibly LLM-simulated) decision maker. The NL feedback is converted into a optimizable signale via LLM-automated pairwise comparisons followed by GP modelling, which feeds a standard BO acquisition loop.

[openreview] [arXiv] [Colab notebook]

What this repo contains

The main entry point:

  • bo_loop.py — main BO loop on synthetic environments (DTLZ2, Vehicle Safety, CarCab, Thermo, NAS-Bench-201). Hydra-configured.

Supporting modules:

  • environments.py — synthetic environments and their utility functions.
  • gp_models.pySimpleGPProxyModel, CategoricalGPProxyModel.
  • utility_approximator.py — LLM-driven scalar / pairwise utility approximation.
  • human_feedback_simulator.py — simulated NL human responses via an LLM.
  • prompts.py — prompt templates used by the LLM-side machinery.
  • llm_utils.py — abstract LLMClient interface that you must implement (see below).
  • utils.py — small helpers (sigmoid, JSON extraction).
  • config/ — Hydra configs (one per method).

The entry point and supporting module for the summarization task:

  • summary_bo_loop.py — BO over LLM-summarization hyperparameters on CNN/DailyMail.
  • summay_utils.pyArticleSummarizer, SummaryFeedbackGenerator, LLMPairwiseJudge for the summarization task.

💿 Installation

Below we describe the steps to setup the environment.

1️⃣ Environment

First, clone the repository and enter the directory:

git clone https://github.com/facebookresearch/lilo
cd lilo

Then, set up a conda environment as follows:

conda create --name lilo python=3.12
conda activate lilo

Finally, install the required depndencies:

pip install -r requirements.txt
# or, for an editable install:
pip install -e .

2️⃣ LLM client setup (required)

The optimizer relies on an LLM to simulate human feedback, label pairwise preferences, generate prior knowledge, etc. The repo ships with lilo.llm_utils.PlaceholderLLMClient, which raises NotImplementedError on use. Subclass LLMClient against your provider and wire it into bo_loop._make_llm_client (and summary_bo_loop._make_llm_client if you run the summarization experiments).

The only method you must implement is:

async def get_batch_llm_responses(
    self, prompts, num_responses=1, kwargs=None,
    max_retries=8, timeout_per_call=None,
) -> list[list[str]]:
    ...

For each input prompt, return a list of num_responses string completions.

Example: OpenAI

import asyncio
from openai import AsyncOpenAI
from lilo.llm_utils import LLMClient

    def __init__(self, model="gpt-4o-mini", **kw):
        super().__init__(model=model)
        self.client = AsyncOpenAI()

    async def get_batch_llm_responses(self, prompts, num_responses=1, kwargs=None, **_):
        async def one(p):
            r = await self.client.chat.completions.create(
                model=self.model,
                messages=[{"role": "user", "content": p}],
                n=num_responses,
                **(kwargs or {}),
            )
            return [c.message.content for c in r.choices]
        return await asyncio.gather(*(one(p) for p in prompts))

Then in bo_loop.py:

def _make_llm_client(model: str) -> LLMClient:
    return OpenAIClient(model=model)

llm_utils.py contains additional skeletons for Anthropic and local vLLM/HuggingFace.

3️⃣ Data setup (only required for NAS-Bench-201 and the summarization task)

Place the input datasets under data/ (or pass alternative paths via Hydra). See data/README.md for the exact format and download instructions:

  • NAS-Bench-201: data/nasbench201_dataset_<id>.jsonl, derived from https://github.com/D-X-Y/NAS-Bench-201.
  • CNN/DailyMail (summarization task only): data/cnn_daily_mail_{train,test}.parquet, from HuggingFace cnn_dailymail v3.0.0.

The other synthetic environments (dtlz2, vehicle_safety, carcab, thermo) need no external data — they use closed-form objectives or pymoo / botorch test functions.

⚙️ Configuration

The BO loop is configured via Hydra. The top-level entry is config/bo_loop.yaml; methods are selected via the Hydra method group.

method= Description
lilo Pairwise NL utility approximation (the paper's main method)
lilo_scalar Scalar NL utility approximation
true_utility_bo Oracle baseline using the true utility
preferential_bo Preferential BO baseline (PairwiseGP + qEUBO)
llm_direct LLM directly proposes the next candidate
llm_2step LLM-based 2-step acquisition
environment.name Utility functions used in the paper
dtlz2 piecewise_linear, l1, beta_products
vehicle_safety piecewise_linear, beta_products, vehicle_safety_llm
carcab piecewise_linear, carcab_llm
thermo thermo_A, thermo_B
nas201 nas201_research, nas201_edge

Other useful overrides:

  • seed=<int> — random seed (default 0).
  • N_iter=<int> — number of BO trials (default 8).
  • bs_feedback=<int> — number of feedback datapoints (i.e., number of question-answer pairs in LILO, direct utility evaluations in the true utility baseline, pairwise comparisons in the preference BO baseliens) (default 2).
  • bs_exp=<int> — batch size of new candidates per trial (default null = problem dimension).
  • acquisition_method=log_nei|log_ei|thompson|ucb_<beta>|llm_2step|llm_direct.
  • pair_labeling_acquisition_method=sequential_q_eubo|random — for the LILO pairwise feedback acquisition ablation.
  • use_prior_knowledge=True/False, prior_knowledge_type=point|area|domain — for the prior-knowledge ablation. The prior is consumed at initialization (prior_knowledge_inc_method=llm_init, the only supported value).
  • save_outputs=True save_dir=runs/<name> — save per-trial results.

🎮 Single-run example

After implementing your LLMClient and dropping nasbench201_dataset_0.jsonl into data/:

python -m lilo.bo_loop \
  method=lilo seed=0 \
  environment.name=dtlz2 environment.utility_func=piecewise_linear \
  N_iter=8 \
  save_outputs=True save_dir=runs/main_benchmark

This writes runs/main_benchmark/run_<timestamp>/{config.json, results.jsonl}.

📄 Reproducing the paper experiments

Each paper figure corresponds to a sweep over (method, environment, utility_func, seed, ...). Use Hydra's multirun mode (-m) to launch the sweep; aggregate the resulting results.jsonl files yourself (the paper plots are means ± confidence intervals across seeds — no aggregation script ships with the repo).

Main benchmark

python -m lilo.bo_loop -m \
  method=lilo,true_utility_bo,preferential_bo,llm_direct,llm_2step \
  environment.name=dtlz2 environment.utility_func=piecewise_linear \
  seed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
  save_outputs=True save_dir=runs/main_benchmark

Repeat with environment.name/utility_func for each row of the env table above.

LILO with prior knowledge

For synthetic environments, the available prior knowledge type is point and area:

python -m lilo.bo_loop -m \
  method=lilo \
  use_prior_knowledge=True \
  prior_knowledge_type=point,area \
  environment.name=dtlz2 environment.utility_func=piecewise_linear \
  seed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
  save_outputs=True save_dir=runs/prior_knowledge

For real-world environemnts, use domain knowledge type:

python -m lilo.bo_loop -m \
  method=lilo \
  use_prior_knowledge=True \
  prior_knowledge_type=domain \
  environment.name=thermo environment.utility_func=thermo_A,thermo_B \
  seed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
  save_outputs=True save_dir=runs/prior_knowledge

Pairwise vs scalar

python -m lilo.bo_loop -m \
  method=lilo_scalar \
  environment.name=dtlz2 environment.utility_func=piecewise_linear \
  seed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
  save_outputs=True save_dir=runs/pairwise_vs_scalar

Pair acqusition for LLM labeling via qEUBO vs. random

python -m lilo.bo_loop -m \
  method=lilo \
  pair_labeling_acquisition_method=sequential_q_eubo,random \
  environment.name=dtlz2 environment.utility_func=piecewise_linear \
  seed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
  save_outputs=True save_dir=runs/bpf_ablation

Acquisition-method ablation

python -m lilo.bo_loop -m \
  method=lilo \
  acquisition_method=log_nei,ucb_0.5,thompson \
  environment.name=dtlz2 environment.utility_func=piecewise_linear \
  seed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
  save_outputs=True save_dir=runs/acqf_ablation

Summarization

summary_bo_loop.py is not Hydra-wrapped; configure it via the OmegaConf.create(...) block at the bottom of the file.

Key fields:

  • optimizer: "fixed" or "sampled".
  • seed, N_trials, N_configs, N_articles, bs_feedback, acqf, persona, optimize_preamble, local_dir (where the parquet files live), llm_model.

Sweep seed and optimizer to reproduce the two summarization plots.

💾 Output format

Each run writes:

<save_dir>/run_<timestamp>/
├── config.json      # the resolved Hydra config
└── results.jsonl    # one JSON object per BO trial

Per-trial fields include:

  • trial_index
  • best_value_predicted — best predicted utility seen so far
  • best_value_true - best ground truth utility seen so far
  • value_of_best_predicted - ground truth utility value of the best predicted so far
  • pairwise_accuracy — LLM pairwise-judge accuracy on the trial's labeled pool
  • label_exp_df, context_df, prior_df — serialized pandas DataFrames

To compare methods across seeds, group by (method, environment.name, environment.utility_func, seed), take best_value_true per trial_index, and compute means / CIs.

⚖️ License

The code is licensed under an MIT license.

📧 Contact

Katarzyna Kobalczyk: knk25@cam.ac.uk

Jerry Lin: zylin@meta.com

✍️ Citation

If you find this repository useful, please consider giving a star ⭐ and please cite as:

@inproceedings{
kobalczyk2026lilo,
title={{LILO}: Bayesian Optimization with Natural Language Feedback},
author={Katarzyna Kobalczyk and Zhiyuan Jerry Lin and Benjamin Letham and Zhuokai Zhao and Maximilian Balandat and Eytan Bakshy},
booktitle={Forty-third International Conference on Machine Learning},
year={2026},
url={https://openreview.net/forum?id=zVrbE9ZtEU}
}