Skip to content

A high-concurrency benchmark suite for Approximate Nearest Neighbor (ANN) indexes with mixed read/write workloads.

Notifications You must be signed in to change notification settings

Kwan-Yiu/ccANN-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ccANN-Bench

A high-concurrency benchmark suite for Approximate Nearest Neighbor (ANN) indexes with mixed read/write workloads.

Go C++17 CMake License


Table of Contents


Features

Feature Description
⚑ Concurrent Real-time read/write benchmarking with configurable rates
🌊 Streaming Incremental inserts with recall tracking over time
βš–οΈ Comparison Record/replay mode for lock strategy analysis
🧩 Flexible Multiple query modes: round_robin, chasing, peeking
πŸ“Š Profiling Memory tracking via jemalloc integration
🎨 Visualization Auto-generated recall plots and diff analysis

Supported Algorithms

Algorithm Description
HNSW Hierarchical Navigable Small World
ANNchor ANNchor with future-skip optimization
Vamana DiskANN's Vamana graph
ParlayANN Parallel ANN (HNSW/Vamana variants)

Datasets: SIFT (128d) Β· GIST (960d) Β· Deep1M (96d) Β· MSong (420d) Β· FreewayML


Quick Start

Prerequisites

Requirement Version
Go 1.21+
GCC/G++ C++17 support
CMake 3.16+
jemalloc -

Build

# 1. Build native libraries
cmake -S bench -B bench/build && cmake --build bench/build -j
cmake -S utils -B utils/build && cmake --build utils/build -j

# 2. Build benchmark binary
cd bench && go build -o bench .

Run

# Single config
./bench -config config/annchor/sift/ch_b20.yaml

# Batch experiments
bash run.sh

Configuration

Config path: bench/config/<algorithm>/<dataset>/*.yaml

Config File Naming

Pattern Description
rr.yaml Round-robin mode, multiple batch sizes
ch_b{N}.yaml Chasing mode, batch_size=N
pk_b{N}.yaml Peeking mode, batch_size=N
overall_only.yaml Overall recall test (no streaming)
seg.yaml Segmented index

Full Config Reference

data:
  dataset_name: sift                    # dataset identifier
  max_elements: -1                      # max index size (-1 = unlimited)
  begin_num: 50000                      # initial index size before streaming
  data_type: float                      # float | int8 | uint8
  data_path: ../data/sift/base.bin
  incr_query_path: ../data/sift/query_stream.bin
  incr_gt_path: ../data/sift/stream_i20_offset_index.txt
  overall_query_path: ../data/sift/query.bin
  overall_gt_path: ../data/sift/base.gt20

index:
  index_type: annchor                   # hnsw | annchor | vamana | segmented
  m: 16                                 # max neighbors per node
  ef_construction: 200                  # construction search depth
  sealed_type: hnsw                     # (segmented only) base index type
  seal_threshold: [50000, 100000]       # (segmented only) segment thresholds

search:
  recall_at: 10                         # k for recall@k
  ef_search: 35                         # search depth
  with_node_lock: [true, false]         # node lock during search

workload:
  batch_size: [20, 100, 200]            # vectors per batch
  num_threads: [16, 64]                 # worker threads
  queue_size: 0                         # event queue size (0 = unbounded)
  rate_groups(r/w):                     # [read_rate, write_rate] pairs
    - [20000, 20000]                    # balanced
    - [4000, 36000]                     # write-heavy
    - [36000, 4000]                     # read-heavy
  with_external_rw_lock: false          # external RW lock
  query_mode: chasing                   # round_robin | chasing | peeking

compare:
  enabled: [true, false]                # enable lock comparison mode
  simulate:
    enabled: false                      # simulation mode

result:
  output_dir: ./result/ch

profile:
  memory_monitor_interval: 100          # memory sampling interval (ms)
  enable_memory_profile: true

Sweep Parameters

Arrays in config generate multiple experiment variants:

batch_size: [20, 100, 200]    # 3 variants
num_threads: [16, 64]         # Γ— 2 variants
rate_groups(r/w):             # Γ— 3 variants
  - [20000, 20000]
  - [4000, 36000]
  - [36000, 4000]
# Total: 3 Γ— 2 Γ— 3 = 18 experiments

Query Modes

Mode Behavior
round_robin Queries cycle through query set sequentially
chasing Queries follow the insertion watermark
peeking Queries peek ahead of the watermark

Lock Strategy

Option Behavior
with_node_lock: true Lock nodes during search (default)
with_node_lock: false No node lock (faster, may have consistency issues)
with_external_rw_lock: true External RW lock (auto-disables node lock)

Output

bench/result/<algo>/<dataset>/
β”œβ”€β”€ files/
β”‚   β”œβ”€β”€ *.res              # search results
β”‚   └── *_diff.csv         # recall diff data
└── figures/
    └── *.png              # recall plots

Manual plotting:

python3 utils/scripts/plot_recall_diff.py --csv <diff.csv> --out <plot.png>

Repository Structure

ccANN-Bench/
β”œβ”€β”€ bench/
β”‚   β”œβ”€β”€ algorithms/        # Index implementations (C++)
β”‚   β”œβ”€β”€ config/            # Workload configs (YAML)
β”‚   β”œβ”€β”€ internal/          # Benchmark logic (Go)
β”‚   └── result/            # Output directory
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ build/             # calc_recall, calc_incr_recall
β”‚   └── scripts/           # Plotting scripts (Python)
└── data/                  # Datasets & ground truth

Notes

  • run.sh sets LD_PRELOAD to jemalloc for memory tracking
  • ANNchor exposes extra stats (future-skip counters) in compare runs
  • Set compare.enabled: true for lock strategy experiments
  • When with_external_rw_lock: true, node lock is automatically disabled

License

MIT

About

A high-concurrency benchmark suite for Approximate Nearest Neighbor (ANN) indexes with mixed read/write workloads.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •