GitHub - LIN-Matrix/SimpleCXLMaximizeQuestion

因为本问题求解的不是线性方程，所以stage_pre.py是不可以运行的，只是用来写一个思路。

Mathematical Formulation and Explanation

The problem involves optimizing throughput by utilizing both GPU and CXL memory. Here's a step-by-step breakdown and conversion to LaTeX:

Hardware Parameters

GPU Memory: [ \text{GPU_MEMORY} = 16 \times 1024^3 \text{ bytes} \quad (\text{16 GB}) ]
FLOPS (GPU): [ \text{FLOPS_GPU} = 15.7 \times 10^{12} \text{ FLOPS} \quad (\text{15.7 TFLOPS}) ]
FLOPS (CXL): [ \text{FLOPS_CXL} = 1.57 \times 10^{12} \text{ FLOPS} \quad (\text{1.57 TFLOPS}) ]
Bandwidth: [ \text{BAND_WIDTH} = 16 \times 1024^3 \text{ bytes/s} \quad (\text{16 GB/s}) ]
Latency: [ \text{LATENCY} = 1 \times 10^{-6} \text{ seconds} \quad (\text{1 µs}) ]

Model and Lora Dimensions

Model Dimension: [ d = 512 ]
Lora Dimension: [ r = 4 ]

Constraints and Time Computations

Stage 1:

GPU Memory Usage: [ 4 \times (b1 + b2 + b3) \times d \leq \text{GPU_MEMORY} ] where ( b1, b2, b3 ) are binary variables indicating which batches are processed.
Stage 1 GPU Time: [ T_{\text{GPU_S1}} = \frac{2 \times b1 \times d \times r}{\text{FLOPS_GPU}} ]
Stage 1 CXL Time: [ T_{\text{CXL_S1}} = \text{LATENCY} + \frac{4 \times (b2 + b3) \times d}{\text{BAND_WIDTH}} + \frac{2 \times (b2 + b3) \times d \times r}{\text{FLOPS_CXL}} + \frac{4 \times b2 \times r}{\text{BAND_WIDTH}} ]
Constraints for Stage 1: [ T_{\text{CXL_S1}} \leq T_{\text{GPU_S1}} ] [ 4 \times b1 \times d + 4 \times d \times r + 4 \times b1 \times r \leq \text{GPU_MEMORY} ] [ 4 \times (b1 + b2) \times r \leq \text{GPU_MEMORY} ]

Stage 2:

Stage 2 GPU Time: [ T_{\text{GPU_S2}} = \frac{2 \times (b1 + b2) \times r \times d}{\text{FLOPS_GPU}} ]
Stage 2 CXL Time: [ T_{\text{CXL_S2}} = \frac{2 \times b3 \times r \times d}{\text{FLOPS_CXL}} + \frac{4 \times b3 \times d}{\text{BAND_WIDTH}} - \frac{4 \times b2 \times r}{\text{BAND_WIDTH}} ]
Constraints for Stage 2: [ T_{\text{CXL_S2}} \leq T_{\text{GPU_S2}} ] [ 4 \times (b1 + b2) \times r + 4 \times r \times d + 4 \times (b1 + b2) \times d \leq \text{GPU_MEMORY} ] [ 4 \times (b1 + b2 + b3) \times d \leq \text{GPU_MEMORY} ]

Objective Function

Maximize throughput: [ \text{Throughput} = \frac{b1 + b2 + b3}{T_{\text{GPU_S1}} + T_{\text{GPU_S2}}} ]

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
main.py		main.py
stage_pre.py		stage_pre.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mathematical Formulation and Explanation

Hardware Parameters

Model and Lora Dimensions

Constraints and Time Computations

Objective Function

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

LIN-Matrix/SimpleCXLMaximizeQuestion

Folders and files

Latest commit

History

Repository files navigation

Mathematical Formulation and Explanation

Hardware Parameters

Model and Lora Dimensions

Constraints and Time Computations

Objective Function

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages