PygmalionAI's large-scale inference engine
pygmalion.chat

It is designed to serve as the inference endpoint for the PygmalionAI website, and to allow serving the Pygmalion models to a large number of users with blazing fast speeds (thanks to vLLM's Paged Attention).

336 Ревизии

112 Клонове

36 Версии

AlpinDale efc6f7fbec chore: reformats (#90)		преди 1 година
.github	9b317aa26a feat: finish up tests and workflows (#87)	преди 1 година
aphrodite	efc6f7fbec chore: reformats (#90)	преди 1 година
assets	fefbf029c9 revert previous commit	преди 1 година
docker	12e296b556 fix: update Dockerfile (#82)	преди 1 година
examples	551c4280cf chore: change default port to 2242	преди 1 година
kernels	3d72f05c7b feat: flattened 1D tensor -> 2D tensor (#85)	преди 1 година
tests	efc6f7fbec chore: reformats (#90)	преди 1 година
.gitignore	9b317aa26a feat: finish up tests and workflows (#87)	преди 1 година
.pylintrc	9b317aa26a feat: finish up tests and workflows (#87)	преди 1 година
LICENSE	5adcb33e14 Revert license back to AGPLv3 (#38)	преди 1 година
MANIFEST.in	1e294e1bfa include klite UI in the build	преди 1 година
README.md	0dcc924088 readme: add benchmarks	преди 1 година
build-linux-wheel.sh	977e8d3507 update readme with new sampling params	преди 1 година
build-windows-wheel.cmd	0b2b62fe96 Micromamba Runtime (#54)	преди 1 година
environment.yaml	0b2b62fe96 Micromamba Runtime (#54)	преди 1 година
formatting.sh	9b317aa26a feat: finish up tests and workflows (#87)	преди 1 година
mypy.ini	9b317aa26a feat: finish up tests and workflows (#87)	преди 1 година
pyproject.toml	1e7d28f96f fix: torch version mismatch (#43)	преди 1 година
requirements-dev.txt	9b317aa26a feat: finish up tests and workflows (#87)	преди 1 година
requirements.txt	9b317aa26a feat: finish up tests and workflows (#87)	преди 1 година
runtime.cmd	0b2b62fe96 Micromamba Runtime (#54)	преди 1 година
runtime.sh	3d72f05c7b feat: flattened 1D tensor -> 2D tensor (#85)	преди 1 година
setup.py	561773dec8 fix: hopefully fixes github actions	преди 1 година
update-runtime.cmd	0b2b62fe96 Micromamba Runtime (#54)	преди 1 година
update-runtime.sh	3d72f05c7b feat: flattened 1D tensor -> 2D tensor (#85)	преди 1 година

Breathing Life into Language

Aphrodite is the official backend engine for PygmalionAI. It is designed to serve as the inference endpoint for the PygmalionAI website, and to allow serving the Pygmalion models to a large number of users with blazing fast speeds (thanks to FasterTransformer and vLLM).

Aphrodite builds upon and integrates the exceptional work from various projects.

Features

Continuous Batching
Efficient K/V management with PagedAttention
Optimized CUDA kernels for improved inference
Quantization support via AWQ and GPTQ
Distributed inference
Variety of sampling methods (top a, tail-free sampling, rep. pen.)

Quickstart

pip install aphrodite-engine

python -m aphrodite.endpoints.api_server_kobold --model PygmalionAI/pygmalion-2-7b

This will create a KoboldAI-compatible API server that can be accessed at port 2242 of the localhost. You can plug in the API into a UI that supports Kobold, such as SillyTavern.

Performance

Speeds vary with different GPUs, model sizes, quantization schemes, batch sizes, etc. Here are some baseline benchmarks conducted by sending requests of varying lengths to the provided API server.

Model	Quantization	GPU	Request Rate	Throughput (req/s)	Avg Latency (s)
7B	None	RTX 3090	19	2.66	18.38
7B	AWQ	RTX 3090	12	3.08	32.47
7B	GPTQ	RTX 3090	12	2.01	49.78
13B	AWQ	RTX 3090	5	1.77	26.77
13B	GPTQ	RTX 3090	5	1.10	39.80
20B	AWQ	RTX 3090	3	0.94	39.07
20B	GPTQ	RTX 3090	3	0.58	75.54

Benchmarks with other GPUs will be added soon.

Requirements

Operating System: Linux (or WSL for Windows)
Python: at least 3.8
CUDA 11.8 (recommended, supports 11.0-11.8)

Supported GPUs

Any NVIDIA GPU with a compute capability of 6.0 or higher. Refer to this page for a full list of CUDA GPUs:

https://developer.nvidia.com/cuda-gpus.

Or, you can manually find out your GPU's Compute Capability by opening a Python interpreter and running:

>>> import torch    # if you don't have `torch` installed, run `pip install torch` first
>>> print(torch.cuda.get_device_capability())

This should print something like this: (7, 5), which would indicate a CC of 7.5

If you do not meet the minimum CC, you will not be able to run Aphrodite. At the moment, compute capability of 7.5 or higher is required for AWQ quantization scheme; you can use GPTQ if your GPU does not support it.

Setting up the environment

If you run into any problems, please refer to the common Common Issues section, or open an Issue if you can't find the answer there.

Aphrodite will require a slightly specialized environment to run, as the latest CUDA versions are currently not supported. You can use Conda to easily configure your environment. If you're on windows, make sure you have WSL2 installed. You can do this by opening Windows PowerShell and running:

wsl --install

Aphrodite provides an easy-to-use install script, which helps with both setting up a suitable environment for installing via the pip package and/or building from source.

The requirements is git, wget, bzip2, and tar - all of which are available on the majority of Linux distributions, including WSL.

git clone https://github.com/PygmalionAI/aphrodite-engine && cd aphrodite-engine

Then you can simply run:

./runtime.sh python -m aphrodite.endpoints.api_server_kobold --help

The ./runtime.sh prefix will need to be appended to every command you run that involves Aphrodite, as it launches your commands within the created environment. If you prefer not doing that, you can run ./runtime.sh by itself to enter the environment and execute commands as normal.

For updating the engine, run git pull and then ./update-runtime.sh to update the environment.

Usage

Aphrodite Engine provides 3 API endpoint types:

KoboldAI:

python -m aphrodite.endpoints.api_server_kobold --model PygmalionAI/pygmalion-2-7b

Text Generation WebUI

python -m aphrodite.endpoints.api_server_ooba --model PygmalionAI/pygmalion-2-7b

OpenAI

python -m aphrodite.endpoints.openai.api_server --model PygmalionAI/pygmalion-2-7b

Please refer to each endpoint's documentation on how to query them. Generally, they all work with SillyTavern.

To run a quantized model, use the --quantization flag with either gptq or awq and the --dtype float16 flag. Make sure your model is in AWQ/GPTQ format and not GGUF. Run with only the --help flag for a full list of arguments.

For the full list of Sampling parameters, please refer to SamplingParams:

https://github.com/PygmalionAI/aphrodite-engine/blob/ab1ac578bafa922a6c7e323986bd320615311dad/aphrodite/common/sampling_params.py#L24-L88

Common Issues

`The detected CUDA version (12.1) mismatches the version that was used to compile

  PyTorch (11.8). Please make sure to use the same CUDA versions.`

This is normally due to your environment referring to the global installation of CUDA and not the one in your current env. Run which nvcc and note down the output. For example, if your output is /home/anon/miniconda3/envs/aphrodite/bin/nvcc, run this command:

export CUDA_HOME=/home/anon/miniconda3/envs/aphrodite

Then run the installation command again.

Aborted due to the lack of CPU swap space. Please increase the swap space to avoid this error.

You've run out of swap space! Please pass the --swap-space followed by the amount of swap (in GBs) to allocate. Make sure you leave enough RAM for the model loading process.

ncclInternalError: Internal check failed.
Last error:
No NVML device handle. Skipping nvlink detection.

This happens if you're doing tensor parallelism (multi-GPU) on NVLinked NVIDIA GPUs and they don't support P2P. Please run this command before running the server:

export NCCL_P2P_DISABLE=1

Alternatively, you can prepend NCCL_P2P_DISABLE=1 to your server launch command.

Notes

By design, Aphrodite takes up 90% of your GPU's VRAM. If you're not serving an LLM at scale, you may want to limit the amount of memory it takes up. You can do this in the API example by launching the server with the --gpu-memory-utilization 0.6 (0.6 means 60%).
You can view the full list of commands by running python -m aphrodite.endpoints.api_server_ooba --help.
Context Length extension via the RoPE method is supported for Llama models. Edit the config.json with the following values:
```
"rope_scaling": { "factor": 2.0, "type": "dynamic"},
```

Acknowledgements

Aphrodite Engine would have not been possible without the phenomenal work of other open-source projects. Credits go to:

Contributing

Everyone is welcome to contribute. You can support the project by opening Pull Requests for new features, fixes, or general UX improvements.

README.md