Fizz~ 8a71788372 Add OLMoE (#772) | 2 months ago | |
---|---|---|
.. | ||
README.md | 10 months ago | |
convert.py | 2 months ago |
First, you will need a GPTQ model that satisfies the following conditions:
group_size=-1
OR 128
bits=4
desc_act=False
If your model does not meet the requirements above, then run the following script to convert an FP16 model to the appropriate GPTQ format:
from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
pretrained_model_dir = "mistralai/Mistral-7B-Instruct-v0.2"
quantized_model_dir = "/path/to/output"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
examples = [
tokenizer(
"auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
)
]
quantize_config = BaseQuantizeConfig(
bits=4,
group_size=128,
desc_act=False,
)
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
model.quantize(examples)
model.save_quantized(quantized_model_dir, use_safetensors=True)
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0")
Replace the pretrained_model_dir
and quantized_model_dir
with the appropriate paths to your base model and output directory. Save the script, and run it like this:
CUDA_VISIBLE_DEVICES=0 python quantize.py
You may need to install the AutoGPTQ library via pip install auto-gptq
.
Once you have your compatible GPTQ model, follow the steps below to convert it to Marlin format.
You will need to clone and install the Marlin repository:
git clone https://github.com/IST-DASLab/marlin && cd marlin
pip install -e .
Then simply run the following in this directory:
python convert.py --model-id /path/to/gptq/model --save-path /path/to/output/marlin
That should be all you'll need to do. Then simply launch Aphrodite, point --model
to the marlin checkpoint, and that will be all. Happy prompting.