Tri Dao
|
320fb59487
Update citation
|
6 months ago |
Tri Dao
|
0a146185d6
[Gen] Remove minor dead code
|
1 year ago |
Tri Dao
|
e0fbaa7016
[Gen] Simplify decode_speculative
|
1 year ago |
Tri Dao
|
e6a8026489
[Gen] Rename max_sequence_len->max_seqlen, sequence_len_offset->seqlen_offset
|
1 year ago |
Tri Dao
|
dfe29f5e2b
[Gen] Don't use ft_attention, use flash_attn_with_kvcache instead
|
1 year ago |
Tri Dao
|
a86442f0f3
[Gen] Use flash_attn_with_kvcache in generation
|
1 year ago |
Tri Dao
|
913922cac5
[Gen] Refactor decoding function
|
1 year ago |
Tri Dao
|
37c6e05406
Implement flash_attn_with_kvcache
|
1 year ago |
Tri Dao
|
8a326bbc9e
[Gen] Minor fix to modify logits for top_p
|
1 year ago |
Tri Dao
|
9f42cb6e7a
[Gen] Clone logits before returning when cg=True
|
1 year ago |
Tri Dao
|
f8aea6ead0
[GPT] Generalize last_token_only arg to num_last_tokens
|
1 year ago |
Tri Dao
|
7a3bd55f1a
[Gen] Fix decode function not using top_p during iterative decoding
|
1 year ago |
Tri Dao
|
847abe653c
[Gen] Refactor decode function a bit
|
1 year ago |
Tri Dao
|
f1a73d0740
Run isort and black on python files
|
1 year ago |
Tri Dao
|
fcab93b43a
[Gen] Minor tweak to allocate_inference_cache
|
1 year ago |
Tri Dao
|
ba2fe7f378
[Gen] Move allocate_inference_cache to within the model
|
1 year ago |
Tri Dao
|
3da42d24b1
[GPT] Add option to only return the logit for the last token
|
1 year ago |
Tri Dao
|
311d6606bf
[Gen] Fix FT kernel smem size, CG when batch size changed
|
1 year ago |
Tri Dao
|
605655bc66
[Gen] Fix FT kernel when using CG
|
1 year ago |
Tri Dao
|
1c9ef9b399
[Gen] Measure prompt processing + decoding time, not just decoding
|
1 year ago |
Tri Dao
|
f5d0fbd468
[FT] Fix FT's single query attention for bf16 hdim128 rotary
|
1 year ago |
Tri Dao
|
4d87e4d875
Implement GPT-J
|
1 year ago |
Tri Dao
|
78b7a1dc18
[OPT] Load fp16 weights on CPU before moving to GPU
|
1 year ago |
Tri Dao
|
f68d41ec77
[Gen] Add OPT to generation test
|
1 year ago |
Tri Dao
|
7c2191542a
[Gen] Make generation work with Tensor Parallel
|
1 year ago |
Tri Dao
|
f95c2fc108
[Gen] Remove commented code
|
1 year ago |
Tri Dao
|
b48599002a
[Gen] Add timing option
|
1 year ago |
Tri Dao
|
e02fd588aa
[Gen] Implement top-k and top-p sampling
|
1 year ago |
Tri Dao
|
a668890fcd
[Gen] Add option to run generation with FT attention kernel
|
1 year ago |
Tri Dao
|
a6ec1782dc
Bump to v0.2.6
|
1 year ago |