Skip to content

Commit bb399cf

Browse files
tdophungtdophungpre-commit-ci[bot]jberchtold-nvidia
authored andcommitted
[JAX] Quickstart documentation (#2310)
* jax quickstart guide first commit Signed-off-by: tdophung <[email protected]> * edit the syntax errors and remove unnecessary comments in utils. Add some footnotes in the quick start notebook Signed-off-by: tdophung <[email protected]> * Fix greptiles comments on spelling, deepcopy, vjp function signature comaptibility with speedometer Signed-off-by: tdophung <[email protected]> * Add Copyright to utils and fix some more greptiles complaints Signed-off-by: tdophung <[email protected]> * Add comments to alternative of layers Signed-off-by: tdophung <[email protected]> * Remove weight sharing between different iterations of the transformerLayer Signed-off-by: tdophung <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: tdophung <[email protected]> * Add enum for attention implementations. Fix inconsistency between fuse and unfused TE impls to achieve same performance (removing extra dropout layer in fused layers. Also some minor wording changes Signed-off-by: tdophung <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: tdophung <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix bug in TransformerLayer expected input shape being [sequence, batch, ...] instead of [batch, sequence,...] Signed-off-by: tdophung <[email protected]> * Changing structure of notebook to bring fp8 ahead of fuse, to allow for fuse to take effect because quantization exist as suggested. Also make TransformerLayer perf get closer to Fused by setting hidden_dropout=0 Signed-off-by: tdophung <[email protected]> * add option to choose between different attention implementation in call of BasicTETransformerLayer and demonstrated difference in runtime between using flax and using te's attetion implementation Signed-off-by: tdophung <[email protected]> * Fix mistake in lacking attention_implementation in FuseTETransformerLayer Signed-off-by: tdophung <[email protected]> * Removing AttentionWrapper and custom built DPA, using flax and TE's impl only, removing last mention of Pytorch Signed-off-by: tdophung <[email protected]> * More changing to markdowns to remove pytorch Signed-off-by: tdophung <[email protected]> * cosmetics fixes Signed-off-by: tdophung <[email protected]> * changing names of all implementations Signed-off-by: tdophung <[email protected]> * change fp8_autocast to autocast, make causal mask, and some wording changes Signed-off-by: tdophung <[email protected]> --------- Signed-off-by: tdophung <[email protected]> Co-authored-by: tdophung <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: jberchtold-nvidia <[email protected]>
1 parent 0870ff0 commit bb399cf

File tree

2 files changed

+869
-0
lines changed

2 files changed

+869
-0
lines changed

0 commit comments

Comments
 (0)