[mem optimization] add inplace_copy_batch_to_gpu in TrainPipeline #3526
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
This diff implements support for pre-allocation in-place copy for host-to-device data transfer in TorchRec train pipelines, addressing CUDA memory overhead issues identified in production RecSys models.
https://fb.workplace.com/groups/429376538334034/permalink/1497469664858044/
Context
As described in the RFC on Workplace, we identified an extra CUDA memory overhead of 3-6 GB per rank on top of the active memory snapshot in most RecSys model training pipelines. This overhead stems from PyTorch's caching allocator behavior when using side CUDA streams for non-blocking host-to-device transfers - the allocator associates transferred tensor memory with the side stream, preventing memory reuse in the main stream and causing up to 13GB extra memory footprint per rank in production models.
The solution proposed in D86068070 enables pre-allocating memory on the main stream and using in-place copy to reduce this overhead. In local train pipeline benchmarks with 1-GB ModelInput (2 KJTs + float features), this approach reduced memory footprint by ~6 GB per rank. This optimization enables many memory-constrained use cases across platforms including APS, Pyper, and MVAI.
Key Changes:
Added
inplace_copy_batch_to_gpuparameter: New boolean flag throughout the train pipeline infrastructure that enables switching between standard batch copying (direct allocation on side stream) and in-place copying (pre-allocation on main stream).New
inplace_copy_batch_to_gpu()method: Implemented inTrainPipelineclass to handle the new data transfer pattern with proper stream synchronization, using_to_device()with the optionaldata_copy_streamparameter.Extended
Pipelineable.to()interface: Added optionaldata_copy_streamparameter to the abstract method, allowing implementations to specify which stream should execute the data copy operation (see [detailed] Add device parameter to KeyedJaggedTensor.empty_like and copy_ method #3510).Updated benchmark configuration (
sparse_data_dist_base.yml):num_batchesfrom 5 to 10feature_pooling_avgfrom 10 to 30num_benchmarksfrom 2 to 1num_profiles: 1for profilingEnhanced table configuration: Added
base_row_sizeparameter (default: 100,000) toEmbeddingTablesConfigfor more flexible embedding table sizing.These changes enable performance and memory comparison between standard and in-place copy strategies, with proper benchmarking infrastructure to measure and trace the differences.
Differential Revision: D86208714