Skip to content

Conversation

@TroyGarden
Copy link
Contributor

Summary:
This diff implements support for pre-allocation in-place copy for host-to-device data transfer in TorchRec train pipelines, addressing CUDA memory overhead issues identified in production RecSys models.

https://fb.workplace.com/groups/429376538334034/permalink/1497469664858044/

Context

As described in the RFC on Workplace, we identified an extra CUDA memory overhead of 3-6 GB per rank on top of the active memory snapshot in most RecSys model training pipelines. This overhead stems from PyTorch's caching allocator behavior when using side CUDA streams for non-blocking host-to-device transfers - the allocator associates transferred tensor memory with the side stream, preventing memory reuse in the main stream and causing up to 13GB extra memory footprint per rank in production models.

The solution proposed in D86068070 enables pre-allocating memory on the main stream and using in-place copy to reduce this overhead. In local train pipeline benchmarks with 1-GB ModelInput (2 KJTs + float features), this approach reduced memory footprint by ~6 GB per rank. This optimization enables many memory-constrained use cases across platforms including APS, Pyper, and MVAI.

Key Changes:

  1. Added inplace_copy_batch_to_gpu parameter: New boolean flag throughout the train pipeline infrastructure that enables switching between standard batch copying (direct allocation on side stream) and in-place copying (pre-allocation on main stream).

  2. New inplace_copy_batch_to_gpu() method: Implemented in TrainPipeline class to handle the new data transfer pattern with proper stream synchronization, using _to_device() with the optional data_copy_stream parameter.

  3. Extended Pipelineable.to() interface: Added optional data_copy_stream parameter to the abstract method, allowing implementations to specify which stream should execute the data copy operation (see [detailed] Add device parameter to KeyedJaggedTensor.empty_like and copy_ method #3510).

  4. Updated benchmark configuration (sparse_data_dist_base.yml):

    • Increased num_batches from 5 to 10
    • Changed feature_pooling_avg from 10 to 30
    • Reduced num_benchmarks from 2 to 1
    • Added num_profiles: 1 for profiling
  5. Enhanced table configuration: Added base_row_size parameter (default: 100,000) to EmbeddingTablesConfig for more flexible embedding table sizing.

These changes enable performance and memory comparison between standard and in-place copy strategies, with proper benchmarking infrastructure to measure and trace the differences.

Differential Revision: D86208714

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 7, 2025
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Nov 7, 2025

@TroyGarden has exported this pull request. If you are a Meta employee, you can view the originating Diff in D86208714.

Summary:

This diff implements support for pre-allocation in-place copy for host-to-device data transfer in TorchRec train pipelines, addressing CUDA memory overhead issues identified in production RecSys models.

https://fb.workplace.com/groups/429376538334034/permalink/1497469664858044/

## Context

As described in the [RFC on Workplace](https://fb.workplace.com/groups/429376538334034/permalink/1497469664858044/), we identified an extra CUDA memory overhead of 3-6 GB per rank on top of the active memory snapshot in most RecSys model training pipelines. This overhead stems from PyTorch's caching allocator behavior when using side CUDA streams for non-blocking host-to-device transfers - the allocator associates transferred tensor memory with the side stream, preventing memory reuse in the main stream and causing up to 13GB extra memory footprint per rank in production models.

The solution proposed in [D86068070](https://www.internalfb.com/diff/D86068070) enables pre-allocating memory on the main stream and using in-place copy to reduce this overhead. In local train pipeline benchmarks with 1-GB ModelInput (2 KJTs + float features), this approach reduced memory footprint by ~6 GB per rank. This optimization enables many memory-constrained use cases across platforms including APS, Pyper, and MVAI.

## Key Changes:

1. **Added `inplace_copy_batch_to_gpu` parameter**: New boolean flag throughout the train pipeline infrastructure that enables switching between standard batch copying (direct allocation on side stream) and in-place copying (pre-allocation on main stream).

2. **New `inplace_copy_batch_to_gpu()` method**: Implemented in `TrainPipeline` class to handle the new data transfer pattern with proper stream synchronization, using `_to_device()` with the optional `data_copy_stream` parameter.

3. **Extended `Pipelineable.to()` interface**: Added optional `data_copy_stream` parameter to the abstract method, allowing implementations to specify which stream should execute the data copy operation (see meta-pytorch#3510).

4. **Updated benchmark configuration** (`sparse_data_dist_base.yml`):
   - Increased `num_batches` from 5 to 10
   - Changed `feature_pooling_avg` from 10 to 30 
   - Reduced `num_benchmarks` from 2 to 1
   - Added `num_profiles: 1` for profiling

5. **Enhanced table configuration**: Added `base_row_size` parameter (default: 100,000) to `EmbeddingTablesConfig` for more flexible embedding table sizing.

These changes enable performance and memory comparison between standard and in-place copy strategies, with proper benchmarking infrastructure to measure and trace the differences.

Differential Revision: D86208714
@meta-codesync meta-codesync bot closed this in da91c05 Nov 8, 2025
@TroyGarden TroyGarden deleted the export-D86208714 branch November 8, 2025 06:28
@TroyGarden TroyGarden changed the title add inplace_copy_batch_to_gpu in TrainPipeline [mem optimization] add inplace_copy_batch_to_gpu in TrainPipeline Nov 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant