High-performance recommender output storage #495

mdekstrand · 2024-10-25T22:41:26Z

Right now, in experiments I have been running, there is a significant bottleneck in retrieving and saving results in parallel batch inference. This is significantly hindering throughput, as each worker is only able to run at 30-40% of a CPU on my large data-crunching rig.

It is possible that item lists will speed this up, but if not, I would like to look at a more efficient way to collect batch-inference results for saving and/or measurement.

One potential solution is to save each worker's results in a separate Parquet file.

Another promising direction is Arrow Flight, an IPC protocol built on top of Arrow. ItemList can be trivially converted to an Arrow Table, which then can be serialized into a flight. We could implement a Flight server, in either Python or Rust, that processes item lists and incorporates them into the results.

Some open questions:

Does Python support concurrent flights Flight server? Or does one client running do_put block other clients?
Do we need Rust, or will Python be sufficiently performant?

The text was updated successfully, but these errors were encountered:

mdekstrand · 2024-10-26T19:38:04Z

I have done a quick benchmark, and serializing an item list to PyArrow IPC is not more efficient than pickling it.

Pickling 5K item lists (HIGHEST_PROTOCOL): 61ms
Pickling 5K item lists (default): 63ms
Arrow IPC 5K item lists: 85ms
Pickling 5K data frames: 122ms
Converting 5K item lists to data frames, then pickling: 313ms
Converting 5K item lists to data frames, then to Arrow, then to IPC: 827ms

That is with short lists; when I let lists get longer, the gap increases.

mdekstrand · 2025-01-15T14:24:38Z

With item-list pickling working much more efficiently than Pandas data frame pickling, marshalling recommendation results is no longer a bottleneck in most evaluations. This will be deferred until we observe a problem again.

mdekstrand added enhancement evaluation labels Oct 25, 2024

mdekstrand closed this as not planned Won't fix, can't repro, duplicate, stale Oct 26, 2024

mdekstrand reopened this Oct 26, 2024

mdekstrand added deferred Tickets we don't plan to act on for a while, if ever. and removed enhancement labels Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High-performance recommender output storage #495

High-performance recommender output storage #495

mdekstrand commented Oct 25, 2024

mdekstrand commented Oct 26, 2024 •

edited

Loading

mdekstrand commented Jan 15, 2025

High-performance recommender output storage #495

High-performance recommender output storage #495

Comments

mdekstrand commented Oct 25, 2024

mdekstrand commented Oct 26, 2024 • edited Loading

mdekstrand commented Jan 15, 2025

mdekstrand commented Oct 26, 2024 •

edited

Loading