GAOXinyu
|
0cb595ad94
[bugfix] handle_x not define when using checkpoint_lvl = 2 (#502)
|
1 éve |
Tri Dao
|
f1a73d0740
Run isort and black on python files
|
1 éve |
Xuechen Li
|
bb4cded17b
support when num_heads is not divisible by world_size; resolves #459 (#461)
|
1 éve |
Tri Dao
|
cb0daccc41
[FusedDense] Allow Row/ColumnParallelLinear to have uneven split
|
1 éve |
Tri Dao
|
bcfa7c9751
[FusedDense] Run black on fused_dense.py
|
1 éve |
Tri Dao
|
b630aef53f
Implement GatedMlp
|
1 éve |
Tri Dao
|
6f6e9a9aaf
[FusedDense] Enable sqrelu activation in FusedMLP
|
1 éve |
Tri Dao
|
dc08ea1c33
Support H100 for other CUDA extensions
|
1 éve |
Tri Dao
|
88173a1aaf
[FusedDense] Support relu, rename FusedDenseGeluDense -> FusedMLP
|
1 éve |
Tri Dao
|
93383bd55b
[TP] Implement TensorParallel without sequence parallel
|
1 éve |
Tri Dao
|
1ec09ebd90
[FusedDense] Limit matrix dims to 2M (instead of 64k)
|
1 éve |
Tri Dao
|
65b4064b2a
[FusedDense] Kick off input all_gather before weight dtype conversion
|
1 éve |
Tri Dao
|
a8cfe51551
Implement Tensor Parallel for transformer Block
|
2 éve |
Tri Dao
|
226a1b721d
Implement TensorParallel for FusedDense and FusedDenseGeluDense
|
2 éve |
Tri Dao
|
e68ebbe89a
Simplify FusedDense
|
2 éve |
Tri Dao
|
d4b320b31f
Add MLP, MHA, Block, Embedding modules
|
2 éve |
Tri Dao
|
fa6d1ce44f
Add fused_dense and dropout_add_layernorm CUDA extensions
|
2 éve |