Bladeren bron

Merge branch 'main' of https://github.com/PygmalionAI/aphrodite-engine into new_samplers

50h100a 1 jaar geleden
bovenliggende
commit
b1bbec5625

+ 19 - 14
README.md

@@ -35,8 +35,9 @@ Aphrodite builds upon and integrates the exceptional work from various projects,
 ## Quickstart
 
 ```sh
-$ pip install aphrodite-engine
-$ python -m aphrodite.endpoints.api_server_ooba --model PygmalionAI/pygmalion-2-7b
+pip install aphrodite-engine
+
+python -m aphrodite.endpoints.api_server_ooba --model PygmalionAI/pygmalion-2-7b
 ```
 
 ## Requirements
@@ -64,13 +65,17 @@ If you do not meet the minimum CC, you will not be able to run Aphrodite.
 ## Setting up the environment
 **If you run into any problems, please refer to the common [Common Issues](#common-issues) section, or open an [Issue](https://github.com/PygmalionAI/aphrodite-engine/issues) if you can't find the answer there.**
 
-Aphrodite will require a slightly specialized environment to run, as the latest CUDA and GCC versions are not supported. You can use Conda to easily configure your environment.
+Aphrodite will require a slightly specialized environment to run, as the latest CUDA and GCC versions are not supported. You can use Conda to easily configure your environment. If you're on windows, make sure you have [WSL2](https://learn.microsoft.com/en-us/windows/wsl/install) installed. You can do this by opening Windows PowerShell and running:
+```sh
+wsl --install
+```
 
 ### Install miniconda3
 
 ```sh
-$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
-$ bash ./Miniconda3*
+wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
+
+bash ./Miniconda3*
 ```
 You can follow the on-screen instructions, though you may want to set the installation directory to somewhere with a large empty storage space.
 
@@ -78,16 +83,16 @@ You can either source your shell script (`. ~/.bashrc` or `. ~/.zshrc`) or resta
 
 ### Configuring the env for Aphrodite-engine
 ```sh
-$ conda config --set auto_activate_base false
-$ conda create -n aphrodite python=3.10
-$ conda activate aphrodite
-$ conda install -c "nvidia/label/cuda-11.8.0" cuda
+conda config --set auto_activate_base false
+conda create -n aphrodite python=3.10
+conda activate aphrodite
+conda install -c "nvidia/label/cuda-11.8.0" cuda
 ```
 
 ## Installation
 
 ```sh
-$ pip install aphrodite-engine
+pip install aphrodite-engine
 ```
 
 ### Install from source
@@ -106,7 +111,7 @@ $ pip install aphrodite-engine
 You can spawn a [text-generation-webui](https://github.com/oobabooga/text-generation-webui)-compatible API server to use with [SillyTavern](https://github.com/SillyTavern/SillyTavern):
 
 ```sh
-$ python -m aphrodite.endpoints.api_server_ooba --model PygmalionAI/pygmalion-2-13b --max-model-len 4096 --max-num-batched-tokens 4096
+python -m aphrodite.endpoints.api_server_ooba --model PygmalionAI/pygmalion-2-13b --max-model-len 4096 --max-num-batched-tokens 4096
 ```
 
 This will create a server which runs on port `8000` of your machine. You can navigate to SillyTavern's API menu, select TextGen WebUI, and set the API Type to Aphrodite. The default API key is `EMPTY`, but you can change it as necessary. Use `http://localhost:8000/api` as the API URL.
@@ -116,7 +121,7 @@ To run a quantized model, use the `--quantization` flag with either `gptq` or `a
 To manually query the API, run:
 
 ```sh
-$ curl -X POST "http://localhost:8000/api/v1/generate" \
+curl -X POST "http://localhost:8000/api/v1/generate" \
 -H "Content-Type: application/json" \
 -H "x-api-key: EMPTY" \
 -d '{
@@ -134,7 +139,7 @@ https://github.com/PygmalionAI/aphrodite-engine/blob/99657d444bc2bab5e4293e9ee96
 ### OpenAI-compatible server
 An OpenAI-compatible server is also provided. You can launch the server with:
 ```sh
-$ python -m aphrodite.endpoints.openai.api_server --model PygmalionAI/pygmalion-2-13b
+python -m aphrodite.endpoints.openai.api_server --model PygmalionAI/pygmalion-2-13b
 ```
 
 You can query the server the same as any other OpenAI Completion/Chat Completion endpoint, though without an API key.
@@ -145,7 +150,7 @@ You can query the server the same as any other OpenAI Completion/Chat Completion
 
 This is normally due to your environment referring to the global installation of CUDA and not the one in your current env. Run `which nvcc` and note down the output. For example, if your output is `/home/anon/miniconda3/envs/aphrodite/bin/nvcc`, run this command:
 ```sh
-$ export CUDA_HOME=/home/anon/miniconda3/envs/aphrodite
+export CUDA_HOME=/home/anon/miniconda3/envs/aphrodite
 ```
 
 Then run the installation command again.

+ 1 - 1
aphrodite/modeling/layers/sampler.py

@@ -325,7 +325,7 @@ def _apply_tfs(
     z = torch.tensor(tfss, dtype=logits.dtype, device=logits.device)
     logits_sort, logits_idx = logits.sort(dim=-1, descending=True)
     d2 = logits_sort.softmax(dim=-1).diff().diff().abs()
-    normalized_d2 = d2 / torch.sum(d2, dim=-1)
+    normalized_d2 = d2 / torch.sum(d2, dim=-1, keepdim=True)
     curvature_cdf = torch.cumsum(normalized_d2, dim=-1)
 
     tfs_mask = curvature_cdf > z.unsqueeze(dim=-1)

+ 5 - 0
aphrodite/modeling/models/llama.py

@@ -93,6 +93,7 @@ class LlamaAttention(nn.Module):
         num_heads: int,
         num_kv_heads: int,
         rope_theta: float = 10000,
+        rope_scaling: Optional[Dict[str, Any]] = None,
         max_position_embeddings: int = 8192,
         quant_config: Optional[QuantizationConfig] = None,
     ) -> None:
@@ -110,6 +111,7 @@ class LlamaAttention(nn.Module):
         self.kv_size = self.num_kv_heads * self.head_dim
         self.scaling = self.head_dim**-0.5
         self.rope_theta = rope_theta
+        self.rope_scaling = rope_scaling
         self.max_position_embeddings = max_position_embeddings
 
         self.qkv_proj = ParallelLinear.column(
@@ -134,6 +136,7 @@ class LlamaAttention(nn.Module):
             self.head_dim,
             self.scaling,
             base=self.rope_theta,
+            rope_scaling = self.rope_scaling,
             max_position=self.max_position_embeddings,
             rotary_dim=self.head_dim,
             num_kv_heads=self.num_kv_heads)
@@ -166,6 +169,7 @@ class LlamaDecoderLayer(nn.Module):
         self.hidden_size = config.hidden_size
         # Requires transformers > 4.32.0
         rope_theta = getattr(config, "rope_theta", 10000)
+        rope_scaling = getattr(config, "rope_scaling", None)
         max_position_embeddings = getattr(config, "max_position_embeddings",
                                           8192)
         self.self_attn = LlamaAttention(
@@ -173,6 +177,7 @@ class LlamaDecoderLayer(nn.Module):
             num_heads=config.num_attention_heads,
             num_kv_heads=config.num_key_value_heads,
             rope_theta=rope_theta,
+            rope_scaling=rope_scaling,
             max_position_embeddings=max_position_embeddings,
             quant_config=quant_config,
         )

+ 51 - 0
examples/aphrodite_engine_example.py

@@ -0,0 +1,51 @@
+import argparse
+
+from aphrodite import EngineArgs, AphroditeEngine, SamplingParams
+
+
+def main(args: argparse.Namespace):
+    # Parse the CLI argument and initialize the engine.
+    engine_args = EngineArgs.from_cli_args(args)
+    engine = AphroditeEngine.from_engine_args(engine_args)
+
+    # Test the following prompts.
+    test_prompts = [
+        ("<|system|>Enter chat mode.<|user|>Hello!<|model|>",
+         SamplingParams(temperature=0.0)),
+        ("<|system|>Enter RP mode.<|model|>Hello!<|user|>What are you doing?<|model|>",
+         SamplingParams(temperature=0.8, top_k=5, presence_penalty=0.2)),
+        ("<|system|>Enter chat mode.<|user|>What is the meaning of life?<|model|>",
+         SamplingParams(n=2,
+                        best_of=5,
+                        temperature=0.8,
+                        top_p=0.95,
+                        frequency_penalty=0.1)),
+        ("<|system|>Enter QA mode.<|user|>What is a man?<|model|>A miserable",
+         SamplingParams(n=3, best_of=3, use_beam_search=True,
+                        temperature=0.0)),
+    ]
+
+    # Run the engine by calling `engine.step()` manually.
+    request_id = 0
+    while True:
+        # To test continuous batching, we add one request at each step.
+        if test_prompts:
+            prompt, sampling_params = test_prompts.pop(0)
+            engine.add_request(str(request_id), prompt, sampling_params)
+            request_id += 1
+
+        request_outputs = engine.step()
+        for request_output in request_outputs:
+            if request_output.finished:
+                print(request_output)
+
+        if not (engine.has_unfinished_requests() or test_prompts):
+            break
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(
+        description='Demo on using the AphroditeEngine class directly')
+    parser = EngineArgs.add_cli_args(parser)
+    args = parser.parse_args()
+    main(args)

+ 32 - 49
examples/api_client.py

@@ -1,69 +1,52 @@
 import argparse
 import json
-from typing import Iterable, List
-import requests
 
-def clear_line(n: int = 1) -> None:
-    LINE_UP = '\033[1A'
-    LINE_CLEAR = '\x1b[2K'
-    for _ in range(n):
-        print(LINE_UP, end=LINE_CLEAR, flush=True)
+import gradio as gr
+import requests
 
 
-def post_http_request(prompt: str, api_url: str, n: int = 1,
-                       stream: bool = False) -> requests.Response:
-    headers = {"User-Agent": "Test Client"}
+def http_bot(prompt):
+    headers = {"User-Agent": "Aphrodite Client"}
     pload = {
         "prompt": prompt,
-        "n": n,
-        "use_beam_search": True,
-        "temperature": 0.0,
-        "max_tokens": 28,
-        "stream": stream,
+        "stream": True,
+        "max_tokens": 128,
     }
-    response = requests.post(api_url, headers=headers, json=pload, stream=True)
-    return response
-
-
-def get_streaming_response(response: requests.Response) -> Iterable[List[str]]:
-    for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False, delimiter=b"\0"):
+    response = requests.post(args.model_url,
+                             headers=headers,
+                             json=pload,
+                             stream=True)
+
+    for chunk in response.iter_lines(chunk_size=8192,
+                                     decode_unicode=False,
+                                     delimiter=b"\0"):
         if chunk:
             data = json.loads(chunk.decode("utf-8"))
-            output = data["text"]
+            output = data["text"][0]
             yield output
 
 
-def get_response(response: requests.Response) -> List[str]:
-    data = json.loads(response.content)
-    output = data["text"]
-    return output
+def build_demo():
+    with gr.Blocks() as demo:
+        gr.Markdown("# Aphrodite text completion demo\n")
+        inputbox = gr.Textbox(label="Input",
+                              placeholder="Enter text and press ENTER")
+        outputbox = gr.Textbox(label="Output",
+                               placeholder="Generated result from the model")
+        inputbox.submit(http_bot, [inputbox], [outputbox])
+    return demo
 
 
 if __name__ == "__main__":
     parser = argparse.ArgumentParser()
     parser.add_argument("--host", type=str, default="localhost")
-    parser.add_argument("--port", type=int, default=8000)
-    parser.add_argument("--n", type=int, default=4)
-    parser.add_argument("--prompt", type=str, default="What is a man? A")
-    parser.add_argument("--stream", action="store_true")
+    parser.add_argument("--port", type=int, default=8001)
+    parser.add_argument("--model-url",
+                        type=str,
+                        default="http://localhost:8000/generate")
     args = parser.parse_args()
-    prompt = args.prompt
-    api_url = f"http://{args.host}:{args.port}/generate"
-    n = args
-    stream = args.stream
-
-    print(f"Prompt: {prompt!r}\n", flush=True)
-    response = post_http_request(prompt, api_url, n, stream)
 
-    if stream:
-        num_printed_lines = 0
-        for h in get_streaming_response(response):
-            clear_line(num_printed_lines)
-            num_printed_lines = 0
-            for i, line in enumerate(h):
-                num_printed_lines += 1
-                print(f"Beam candidate {i}: {line!r}", flush=True)
-    else:
-        output = get_response(response)
-        for i, line in enumerate(output):
-            print(f"Beam candidate {i}: {line!r}", flush=True)
+    demo = build_demo()
+    demo.queue(concurrency_count=100).launch(server_name=args.host,
+                                             server_port=args.port,
+                                             share=True)

+ 22 - 0
examples/offline_inference.py

@@ -0,0 +1,22 @@
+from aphrodite import LLM, SamplingParams
+
+# Sample prompts.
+prompts = [
+    "<|system|>Enter chat mode.<|user|>Hello!<|model|>",
+    "<|system|>Enter RP mode.<|model|>Hello!<|user|>What are you doing?<|model|>",
+    "<|system|>Enter chat mode.<|user|>What is the meaning of life?<|model|>",
+    "<|system|>Enter QA mode.<|user|>What is a man?<|model|>A miserable",
+]
+# Create a sampling params object.
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+
+# Create an LLM.
+llm = LLM(model="PygmalionAI/pygmalion-2-7b") # pass additional arguments here, such as `quantization`
+# Generate texts from the prompts. The output is a list of RequestOutput objects
+# that contain the prompt, generated text, and other information.
+outputs = llm.generate(prompts, sampling_params)
+# Print the outputs.
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")