Aphrodite allows loading LoRA adapters on top of supported LLMs.
We use the method proposed by S-LoRA: Serving Thousands of Concurrent LoRA Adapters to efficiently serve LoRA adapters to thousands of users.
We can load LoRA adapters directly from Hugging Face or from disk. They're served on a per-request basis.
from aphrodite import LLM, SamplingParams
from aphrodite.lora.request import LoRARequest # [!code highlight]
llm = LLM(model="meta-llama/Llama-2-7b-hf",
enable_lora=True) # [!code highlight]
With this, we've launched the model with LoRA support enabled. We can now submit our prompts and call llm.generate()
with the lora_request
parameter. The first parameter of LoRARequest
is a human identifiable name, the second is a globally unique ID for the adapter, and the third is the path to the LoRA adapter.
sampling_params = SamplingParams(
temperature=0,
max_tokens=256,
stop=["[/assistant]"]
)
prompts = [
"[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport [/user] [assistant]",
"[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_11 (nationality VARCHAR, elector VARCHAR)\n\n question: When Anchero Pantaleone was the elector what is under nationality? [/user] [assistant]",
]
outputs = llm.generate( # [!code highlight]
prompts, # [!code highlight]
sampling_params, # [!code highlight]
lora_request=LoRARequest("alpindale/l2-lora-test", 1, sql_lora_path) # [!code highlight]
)
LoRA adapters can be served with our OpenAI-compatible API server. To do so, specify --lora-modules {name}={path} {name}={path} ...
in the CLI.
aphrodite run meta-llama/Llama-2-7b-hf \
--enable-lora \
--lora-modules l2-lora-test=alpindale/l2-lora-test
The YAML config:
model: meta-llama/Llama-2-7b-hf
lora_modules:
- l2-lora-test=alpindale/l2-lora-test
The server endpoint accepts all LoRA configuration parameters (max_loras
, max_lora_rank
, max_cpu_loras
), which will apply to all forthcoming requests. Upon querying the /v1/models
endpoint, you should see the LoRA along with its base model:
{
"object": "list",
"data": [
{
"id": "meta-llama/Llama-2-7b-hf",
"object": "model",
...
},
{
"id": "l2-lora-test",
"object": "model",
...
}
]
}
Requests can specify the LoRA adapter as if it were any other model
via the model request parameter. The requests will be processed according to the server-wide LoRA configuration (i.e. in parallel with base model requests, and potentially other LoRA adapter requests if they were provided and max_loras
is set high enough).
Example curl request:
curl http://localhost:2242/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "l2-lora-test",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}' | jq