Implement multiround benchmarking & enhance `LLMService` interfaces #54

GlebSolovev · 2025-01-17T04:12:05Z

Core Changes ⚡️

Multiround Benchmarking Support 🥊

Now it is possible to benchmark multiround proof generation, as is our tradition, with comprehensive error handling and logging.
LLMService Interface Rework 🕊
- Simplified error-handling invariants without reducing functionality.
- Extended proof-generation methods with the ProofGenerationMetadataHolder object, enabling the collection of additional metadata.
- Improved LLMService constructors for cleaner implementations.
- Updated GeneratedProof to store raw proof metadata (e.g., statistics returned by the LLMService generation method), facilitating detailed proof analysis—especially useful for benchmarks.
- Enhanced getter methods for better usability.

Changes to the llm module are thoroughly covered by tests, while the new benchmarking features have been extensively tested manually.

Additional Value 🌟

Enhanced benchmarking result interfaces: redesigned with improved typing for safer and more convenient usage.
Custom errors: introduced specific error types (IllegalStateError, InvariantFailedError, BenchmarkingError) with wrappers, replacing generic throw Error cases. Refactored the benchmarking module accordingly and ensured custom error declarations are reliable.
Stricter and clearer error handling: strengthened and documented error handling in the benchmarking framework.
Utility code improvements: got rid of duplicated utilities, introduced better typings (e.g., for theorem rankers, colorization, LLM iterator).
Bug Fixes
- Resolved unawaited promises in affected tests.
- Fixed error repacking in the OpenAIService.

Also fix custom error classes declarations

Note: it made possible to correctly extract proof generation metadata

Also: improve tests' code, fix `expect(...).toBeRejected()` not being awaited bug

(after comprehensive manual testing)

…xecution

GlebSolovev added 30 commits January 17, 2025 04:35

Implement multiround benchmarking base

e81cce1

Redesign & support comprehensive benchmarking results interfaces

420563e

Typify theorem rankers & support all implemented

e4738c8

Implement proper JSON serialization for BenchmarkedItem

d282524

Fix JSON serialization of results tree

e8cfa45

Save interim multiround benchmarking results

7811802

Refactor JSON printers

eb832d7

Improve benchmarking item logging

aca01b3

Improve ValidatedProof typing

9e3b60b

Handle nullability in multiround benchmarking

2b0160b

Introduce & use throw-error wrappers in benchmarking framework

aef3cef

Also fix custom error classes declarations

Fix errors after refactor

b9aa874

Document & fix round number

04dbc60

Document executeBenchmarkingTask properly

848a80f

Log multiround benchmarking properly

d480616

Extend root benchmarking result serialization

9603cd2

Handle duplicate generated proofs carefully

e6f3fa5

Simplify LLMService interface, modes are reworked

3a89be3

Design & support metadata object for LLMService calls

94d9c0f

Generalize GeneratedProof to store generation metadata

0e8bd03

Refactor GeneratedProof with getters

bdcd6cf

Updated benchmarks with new LLMService interface

8b64459

Note: it made possible to correctly extract proof generation metadata

Fix OpenAiService api key error repacking

e203800

Fix declaration of custom LLMServiceError-s

83ec560

Improve LLM iterator typing

125881d

Minor LLMService defaults update, mark TODOs

82cb332

Rewrite tests according to new LLMService interface

c6140db

Also: improve tests' code, fix `expect(...).toBeRejected()` not being awaited bug

Add plus_comm test example

3a82ef4

Fix round number bug in benchmarks

653ea6a

Make logging color defined, support "default"

4ec4d45

GlebSolovev added 7 commits January 17, 2025 04:35

Improve multiround benchmarking logs

d740ee9

Unify color logging in project scope

80f944f

Make LLM service identifier human-readable

1e8a67b

Improve & document error handling in benchmarks

2406462

Fix & prettify error handling in benchmarks

ea8c94c

(after comprehensive manual testing)

Fix multiround benchmarking bug: successful completion did not stop e…

f8d821f

…xecution

Make multiround benchmarking logs clearer

c62d09b

GlebSolovev self-assigned this Jan 17, 2025

GlebSolovev changed the base branch from main to v2.5.0-dev January 17, 2025 04:13

GlebSolovev merged commit 2556ab3 into v2.5.0-dev Jan 29, 2025
3 checks passed

GlebSolovev deleted the benchmarking-multiround branch January 29, 2025 11:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement multiround benchmarking & enhance `LLMService` interfaces #54

Implement multiround benchmarking & enhance `LLMService` interfaces #54

GlebSolovev commented Jan 17, 2025

Implement multiround benchmarking & enhance LLMService interfaces #54

Implement multiround benchmarking & enhance LLMService interfaces #54

Conversation

GlebSolovev commented Jan 17, 2025

Core Changes ⚡️

Additional Value 🌟

Implement multiround benchmarking & enhance `LLMService` interfaces #54

Implement multiround benchmarking & enhance `LLMService` interfaces #54