Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High-performance recommender output storage #495

Open
mdekstrand opened this issue Oct 25, 2024 · 2 comments
Open

High-performance recommender output storage #495

mdekstrand opened this issue Oct 25, 2024 · 2 comments
Labels
deferred Tickets we don't plan to act on for a while, if ever. evaluation

Comments

@mdekstrand
Copy link
Member

Right now, in experiments I have been running, there is a significant bottleneck in retrieving and saving results in parallel batch inference. This is significantly hindering throughput, as each worker is only able to run at 30-40% of a CPU on my large data-crunching rig.

It is possible that item lists will speed this up, but if not, I would like to look at a more efficient way to collect batch-inference results for saving and/or measurement.

One potential solution is to save each worker's results in a separate Parquet file.

Another promising direction is Arrow Flight, an IPC protocol built on top of Arrow. ItemList can be trivially converted to an Arrow Table, which then can be serialized into a flight. We could implement a Flight server, in either Python or Rust, that processes item lists and incorporates them into the results.

Some open questions:

  • Does Python support concurrent flights Flight server? Or does one client running do_put block other clients?
  • Do we need Rust, or will Python be sufficiently performant?
@mdekstrand
Copy link
Member Author

mdekstrand commented Oct 26, 2024

I have done a quick benchmark, and serializing an item list to PyArrow IPC is not more efficient than pickling it.

  • Pickling 5K item lists (HIGHEST_PROTOCOL): 61ms
  • Pickling 5K item lists (default): 63ms
  • Arrow IPC 5K item lists: 85ms
  • Pickling 5K data frames: 122ms
  • Converting 5K item lists to data frames, then pickling: 313ms
  • Converting 5K item lists to data frames, then to Arrow, then to IPC: 827ms

That is with short lists; when I let lists get longer, the gap increases.

@mdekstrand mdekstrand closed this as not planned Won't fix, can't repro, duplicate, stale Oct 26, 2024
@mdekstrand mdekstrand reopened this Oct 26, 2024
@mdekstrand
Copy link
Member Author

With item-list pickling working much more efficiently than Pandas data frame pickling, marshalling recommendation results is no longer a bottleneck in most evaluations. This will be deferred until we observe a problem again.

@mdekstrand mdekstrand added deferred Tickets we don't plan to act on for a while, if ever. and removed enhancement labels Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deferred Tickets we don't plan to act on for a while, if ever. evaluation
Projects
None yet
Development

No branches or pull requests

1 participant