Skip to content

LIN-Matrix/SimpleCXLMaximizeQuestion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

因为本问题求解的不是线性方程,所以stage_pre.py是不可以运行的,只是用来写一个思路。 image image image

Mathematical Formulation and Explanation

The problem involves optimizing throughput by utilizing both GPU and CXL memory. Here's a step-by-step breakdown and conversion to LaTeX:

Hardware Parameters

  1. GPU Memory: [ \text{GPU_MEMORY} = 16 \times 1024^3 \text{ bytes} \quad (\text{16 GB}) ]
  2. FLOPS (GPU): [ \text{FLOPS_GPU} = 15.7 \times 10^{12} \text{ FLOPS} \quad (\text{15.7 TFLOPS}) ]
  3. FLOPS (CXL): [ \text{FLOPS_CXL} = 1.57 \times 10^{12} \text{ FLOPS} \quad (\text{1.57 TFLOPS}) ]
  4. Bandwidth: [ \text{BAND_WIDTH} = 16 \times 1024^3 \text{ bytes/s} \quad (\text{16 GB/s}) ]
  5. Latency: [ \text{LATENCY} = 1 \times 10^{-6} \text{ seconds} \quad (\text{1 µs}) ]

Model and Lora Dimensions

  • Model Dimension: [ d = 512 ]
  • Lora Dimension: [ r = 4 ]

Constraints and Time Computations

Stage 1:

  1. GPU Memory Usage: [ 4 \times (b1 + b2 + b3) \times d \leq \text{GPU_MEMORY} ] where ( b1, b2, b3 ) are binary variables indicating which batches are processed.

  2. Stage 1 GPU Time: [ T_{\text{GPU_S1}} = \frac{2 \times b1 \times d \times r}{\text{FLOPS_GPU}} ]

  3. Stage 1 CXL Time: [ T_{\text{CXL_S1}} = \text{LATENCY} + \frac{4 \times (b2 + b3) \times d}{\text{BAND_WIDTH}} + \frac{2 \times (b2 + b3) \times d \times r}{\text{FLOPS_CXL}} + \frac{4 \times b2 \times r}{\text{BAND_WIDTH}} ]

  4. Constraints for Stage 1: [ T_{\text{CXL_S1}} \leq T_{\text{GPU_S1}} ] [ 4 \times b1 \times d + 4 \times d \times r + 4 \times b1 \times r \leq \text{GPU_MEMORY} ] [ 4 \times (b1 + b2) \times r \leq \text{GPU_MEMORY} ]

Stage 2:

  1. Stage 2 GPU Time: [ T_{\text{GPU_S2}} = \frac{2 \times (b1 + b2) \times r \times d}{\text{FLOPS_GPU}} ]

  2. Stage 2 CXL Time: [ T_{\text{CXL_S2}} = \frac{2 \times b3 \times r \times d}{\text{FLOPS_CXL}} + \frac{4 \times b3 \times d}{\text{BAND_WIDTH}} - \frac{4 \times b2 \times r}{\text{BAND_WIDTH}} ]

  3. Constraints for Stage 2: [ T_{\text{CXL_S2}} \leq T_{\text{GPU_S2}} ] [ 4 \times (b1 + b2) \times r + 4 \times r \times d + 4 \times (b1 + b2) \times d \leq \text{GPU_MEMORY} ] [ 4 \times (b1 + b2 + b3) \times d \leq \text{GPU_MEMORY} ]

Objective Function

Maximize throughput: [ \text{Throughput} = \frac{b1 + b2 + b3}{T_{\text{GPU_S1}} + T_{\text{GPU_S2}}} ]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages