.. |
quantization
|
c41462cfcd
feat: exllamav2 quantization (#305)
|
10 months ago |
triton_kernel
|
16615784b3
fix: prefix cache for turing gpus
|
10 months ago |
__init__.py
|
07aa2a492f
upstream: add option to specify tokenizer
|
1 year ago |
activation.py
|
e31c6f0b45
feat: refactor modeling logic and support more models (#274)
|
10 months ago |
attention.py
|
9810daa699
feat: INT8 KV Cache (#298)
|
10 months ago |
layernorm.py
|
e31c6f0b45
feat: refactor modeling logic and support more models (#274)
|
10 months ago |
linear.py
|
c2d77b1822
chore: logging refactor (#302)
|
10 months ago |
rejection.py
|
95bdd35ec9
feat: rejection sampler (#197)
|
1 year ago |
rotary_embedding.py
|
ea0f57b233
feat: allow further support for non-cuda devices (#247)
|
11 months ago |
sampler.py
|
9fa99215f8
feat: add cubic sampling (#280)
|
10 months ago |
vocab_parallel_embedding.py
|
968bde81bf
fix: tensor parallel with GPTQ and AWQ quants (#307)
|
10 months ago |