david/aphrodite-engine: PygmalionAI's large-scale inference engine pygmalion.chat It is designed to serve as the inference endpoint for the PygmalionAI website, and to allow serving the Pygmalion models to a large number of users with blazing fast speeds (thanks to vLLM's Paged Attention).

AlpinDale f1d0b77c92 [0.6.0] Release Candidate (#481)		4 months ago
..
README.md	f1d0b77c92 [0.6.0] Release Candidate (#481)	4 months ago
quantize.py	f1d0b77c92 [0.6.0] Release Candidate (#481)	4 months ago

Quantizer Utilities

quantize.py: NVIDIA Quantization utilities using AMMO, ported from TensorRT-LLM: https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/quantization/quantize.py

Prerequisite

AMMO (AlgorithMic Model Optimization) Installation: nvidia-ammo 0.7.1 or later

pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo

AMMO Download (code and docs)

https://developer.nvidia.com/downloads/assets/cuda/files/nvidia-ammo/nvidia_ammo-0.5.0.tar.gz https://developer.nvidia.com/downloads/assets/cuda/files/nvidia-ammo/nvidia_ammo-0.7.1.tar.gz

Usage

Run on H100 system for speed if FP8; number of GPUs depends on the model size

Example: quantize Llama2-7b model from HF to FP8 with FP8 KV Cache:

python quantize.py --model_dir ./ll2-7b --dtype float16 --qformat fp8 --kv_cache_dtype fp8 --output_dir ./ll2_7b_fp8 --calib_size 512 --tp_size 1

Outputs: model structure, quantized model & parameters (with scaling factors) are in JSON and Safetensors (npz is generated only for the reference)

# ll ./ll2_7b_fp8/
total 19998244
drwxr-xr-x 2 root root        4096 Feb  7 01:08 ./
drwxrwxr-x 8 1060 1061        4096 Feb  7 01:08 ../
-rw-r--r-- 1 root root      176411 Feb  7 01:08 llama_tp1.json
-rw-r--r-- 1 root root 13477087480 Feb  7 01:09 llama_tp1_rank0.npz
-rw-r--r-- 1 root root  7000893272 Feb  7 01:08 rank0.safetensors
#

README.md