[mem optimization] add inplace_copy_batch_to_gpu in TrainPipeline #3526

TroyGarden · 2025-11-07T17:41:56Z

Summary:
This diff implements support for pre-allocation in-place copy for host-to-device data transfer in TorchRec train pipelines, addressing CUDA memory overhead issues identified in production RecSys models.

https://fb.workplace.com/groups/429376538334034/permalink/1497469664858044/

Context

As described in the RFC on Workplace, we identified an extra CUDA memory overhead of 3-6 GB per rank on top of the active memory snapshot in most RecSys model training pipelines. This overhead stems from PyTorch's caching allocator behavior when using side CUDA streams for non-blocking host-to-device transfers - the allocator associates transferred tensor memory with the side stream, preventing memory reuse in the main stream and causing up to 13GB extra memory footprint per rank in production models.

The solution proposed in D86068070 enables pre-allocating memory on the main stream and using in-place copy to reduce this overhead. In local train pipeline benchmarks with 1-GB ModelInput (2 KJTs + float features), this approach reduced memory footprint by ~6 GB per rank. This optimization enables many memory-constrained use cases across platforms including APS, Pyper, and MVAI.

Key Changes:

Added inplace_copy_batch_to_gpu parameter: New boolean flag throughout the train pipeline infrastructure that enables switching between standard batch copying (direct allocation on side stream) and in-place copying (pre-allocation on main stream).
New inplace_copy_batch_to_gpu() method: Implemented in TrainPipeline class to handle the new data transfer pattern with proper stream synchronization, using _to_device() with the optional data_copy_stream parameter.
Extended Pipelineable.to() interface: Added optional data_copy_stream parameter to the abstract method, allowing implementations to specify which stream should execute the data copy operation (see [detailed] Add device parameter to KeyedJaggedTensor.empty_like and copy_ method #3510).
Updated benchmark configuration (sparse_data_dist_base.yml):
- Increased num_batches from 5 to 10
- Changed feature_pooling_avg from 10 to 30
- Reduced num_benchmarks from 2 to 1
- Added num_profiles: 1 for profiling
Enhanced table configuration: Added base_row_size parameter (default: 100,000) to EmbeddingTablesConfig for more flexible embedding table sizing.

These changes enable performance and memory comparison between standard and in-place copy strategies, with proper benchmarking infrastructure to measure and trace the differences.

Differential Revision: D86208714

meta-codesync · 2025-11-07T17:42:05Z

@TroyGarden has exported this pull request. If you are a Meta employee, you can view the originating Diff in D86208714.

Summary: This diff implements support for pre-allocation in-place copy for host-to-device data transfer in TorchRec train pipelines, addressing CUDA memory overhead issues identified in production RecSys models. https://fb.workplace.com/groups/429376538334034/permalink/1497469664858044/ ## Context As described in the [RFC on Workplace](https://fb.workplace.com/groups/429376538334034/permalink/1497469664858044/), we identified an extra CUDA memory overhead of 3-6 GB per rank on top of the active memory snapshot in most RecSys model training pipelines. This overhead stems from PyTorch's caching allocator behavior when using side CUDA streams for non-blocking host-to-device transfers - the allocator associates transferred tensor memory with the side stream, preventing memory reuse in the main stream and causing up to 13GB extra memory footprint per rank in production models. The solution proposed in [D86068070](https://www.internalfb.com/diff/D86068070) enables pre-allocating memory on the main stream and using in-place copy to reduce this overhead. In local train pipeline benchmarks with 1-GB ModelInput (2 KJTs + float features), this approach reduced memory footprint by ~6 GB per rank. This optimization enables many memory-constrained use cases across platforms including APS, Pyper, and MVAI. ## Key Changes: 1. **Added `inplace_copy_batch_to_gpu` parameter**: New boolean flag throughout the train pipeline infrastructure that enables switching between standard batch copying (direct allocation on side stream) and in-place copying (pre-allocation on main stream). 2. **New `inplace_copy_batch_to_gpu()` method**: Implemented in `TrainPipeline` class to handle the new data transfer pattern with proper stream synchronization, using `_to_device()` with the optional `data_copy_stream` parameter. 3. **Extended `Pipelineable.to()` interface**: Added optional `data_copy_stream` parameter to the abstract method, allowing implementations to specify which stream should execute the data copy operation (see meta-pytorch#3510). 4. **Updated benchmark configuration** (`sparse_data_dist_base.yml`): - Increased `num_batches` from 5 to 10 - Changed `feature_pooling_avg` from 10 to 30 - Reduced `num_benchmarks` from 2 to 1 - Added `num_profiles: 1` for profiling 5. **Enhanced table configuration**: Added `base_row_size` parameter (default: 100,000) to `EmbeddingTablesConfig` for more flexible embedding table sizing. These changes enable performance and memory comparison between standard and in-place copy strategies, with proper benchmarking infrastructure to measure and trace the differences. Differential Revision: D86208714

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 7, 2025

meta-codesync bot added fb-exported meta-exported labels Nov 7, 2025

TroyGarden force-pushed the export-D86208714 branch from 08a64f8 to 84a20c9 Compare November 7, 2025 18:26

meta-codesync bot closed this in da91c05 Nov 8, 2025

TroyGarden deleted the export-D86208714 branch November 8, 2025 06:28

TroyGarden changed the title ~~add inplace_copy_batch_to_gpu in TrainPipeline~~ [mem optimization] add inplace_copy_batch_to_gpu in TrainPipeline Nov 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[mem optimization] add inplace_copy_batch_to_gpu in TrainPipeline #3526

[mem optimization] add inplace_copy_batch_to_gpu in TrainPipeline #3526

Uh oh!

TroyGarden commented Nov 7, 2025

Uh oh!

meta-codesync bot commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[mem optimization] add inplace_copy_batch_to_gpu in TrainPipeline #3526

[mem optimization] add inplace_copy_batch_to_gpu in TrainPipeline #3526

Uh oh!

Conversation

TroyGarden commented Nov 7, 2025

Context

Key Changes:

Uh oh!

meta-codesync bot commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant