PygmalionAI's large-scale inference engine
pygmalion.chat

It is designed to serve as the inference endpoint for the PygmalionAI website, and to allow serving the Pygmalion models to a large number of users with blazing fast speeds (thanks to vLLM's Paged Attention).

AlpinDale 229733d39f feat: bump up the version 1 năm trước cách đây
aphrodite 229733d39f feat: bump up the version 1 năm trước cách đây
assets fefbf029c9 revert previous commit 1 năm trước cách đây
examples c240ac58e0 chore: update openai example 1 năm trước cách đây
kernels 24c78e7306 optimization: multi-query attention kernel 1 năm trước cách đây
.gitignore 6dfca19dda fix: gpt-j loading 1 năm trước cách đây
LICENSE b9bcbe5d4e Change license from Apache 2.0 to AGPLv3 1 năm trước cách đây
README.md d6f705d90b readme: more info on usage 1 năm trước cách đây
requirements.txt 862e48176e fix: handle special tokens 1 năm trước cách đây
setup.py d8105984b8 fix: update setuptools again 1 năm trước cách đây
test.py 89c7f0469f fix: calculate the key/value outputs with kvhead 1 năm trước cách đây

README.md

Breathing Life into Language

aphrodite

Aphrodite is the official backend engine for PygmalionAI. It is designed to serve as the inference endpoint for the PygmalionAI website, and to allow serving the Pygmalion models to a large number of users with blazing fast speeds (thanks to FasterTransformer).

Aphrodite builds upon and integrates the exceptional work from various projects, including:

Please note that Aphrodite is currently in active development and not yet fully functional.

Features

  • Continuous Batching
  • Efficient K/V management with PagedAttention
  • Optimized CUDA kernels for improved inference
  • Distributed inference
  • Multiple decoding algorithms (e.g. parallel sampling, beam search)

Requirements

  • Operating System: Linux (or WSL for Windows)
  • Python: at least 3.8
  • CUDA 11.7 (recommended, supports 11.0-11.8)

Supported GPUs

Basically, anything with a compute capability of 7.0 or higher. Here's a full list of supported consumer GPUs:

GPU CC GPU CC GPU CC
2060 7.5 2070 7.5 2080 7.5
2080 Ti 7.5 Titan RTX 7.5 1650 Ti 7.5
3060 8.6 3060 Ti 8.6 3070 8.6
3070 Ti 8.6 3080 8.6 3080 Ti 8.6
3090 8.6 3090 Ti 8.6 4070 Ti 8.9
4080 8.9 4090 8.9

* CC: Compute Capability

Most datacenter/workstation GPUs are supported, so long as they have a compute capability of 7.0 or higher.

If you're unsure, you can find out by opening a Python interpreter and running:

>>> import torch
>>> print(torch.cuda.get_device_capability())

This should print something like this: (7, 5), which would indicate a CC of 7.5

If your GPU is not listed here or you do not meet the minimum CC, you will not be able to run Aphrodite.

Setting up the environment

Aphrodite will require a slightly specialized environment to run, as the latest CUDA and GCC versions are not supported. You can use Conda to easily configure your environment.

Install miniconda3

$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash ./Miniconda3*

You can follow the on-screen instructions, though you may want to set the installation directory to somewhere with a large empty storage space.

You can either source your shell script (. ~/.bashrc or . ~/.zshrc) or restart your terminal instance to begin using conda.

Configuring the env for Aphrodite-engine

$ conda config --set auto_activate_base false
$ conda create -n aphrodite python=3.9
$ conda activate aphrodite
$ conda install -c conda-forge cudatoolkit-dev gcc=11.3 gxx=11.3

The last command will take a long time, depending on your internet speed.

Whenever you want to launch Aphrodite later on, make sure you run conda activate aphrodite first. The other steps outlined above are one-time only.

Insallation

  • Clone the repository:

    git clone https://github.com/PygmalionAI/aphrodite-engine && cd aphrodite-engine
    
  • Install the package:

    pip install -e .
    

    If you receive any import errors here, try running pip install -r requirements.txt first.

If you receive an error for CUDA version mismatch, run which nvcc and note down the output. For example, if your output is /home/anon/miniconda3/envs/aphrodite/bin/nvcc, run this command:

$ export CUDA_HOME=/home/anon/miniconda3/envs/aphrodite

Then run the installation command again.

Example usage

Inference with LLM

  from aphrodite import LLM, SamplingParams

  prompts = [
    "What is a man? A",
    "The sun is a wondrous body, like a magnificent",
    "All flesh is grass and all the comeliness thereof",
  ]
  sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

  llm = LLM(model="EleutherAI/pythia-70m")        # you can also use a local directory path
  outputs = llm.generate(prompts, sampling_params)
  for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Continuous inference with API

$ python -m python -m aphrodite.endpoints.openai.api_server --model EleutherAI/pythia-70m
$ curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "EleutherAI/pythia-70m",
        "prompt": "What is a man? A",
        "max_tokens": 512,
        "n": 2048,
        "temperature": 0.8
    }'

For the full list of request parameters, see OpenAI Completions API reference.

Contributing

We accept PRs! There will likely be a few typos or other errors we've failed to catch, so please let us know either via an issue or make a Pull Request.