config.yaml 14 KB


  1. # Sample configuration file for Aphrodite Engine
  2. # You can launch the engine using a provided config file by running
  3. # `aphrodite yaml config.yaml` in the CLI
  4. # You can run `aphrodite run -h` to see the full list of options
  5. # that you can pass to the engine.
  6. # Uncomment and modify the following lines to configure the engine
  7. # The basic options. You will usually need to specify these
  8. basic_args:
  9. # Your model name. Can be a local path or huggingface model ID
  10. - model:
  11. # If you want a custom model name for the API, specify it here
  12. - served_model_name:
  13. # Whether or not to launch the Kobold API server. Used for hosting
  14. # on Kobold Horde. Takes a boolean value (true/false)
  15. - launch_kobold_api:
  16. # The maximum sequence length/context window for the model
  17. # You can leave this blank to use the default value (recommended)
  18. - max_model_len:
  19. # The tensor parallelism degree. Set this to the number of GPUs you have
  20. # Keep in mind that for **quantized** models, this will typically only work
  21. # with values between 1, 2, 4, and 8.
  22. - tensor_parallel_size:
  23. # The pipeline parallelism degree. This is similar to tensor parallel,
  24. # but splits the layers across GPUs rather than the tensors. Only use this
  25. # if you're doing multi-node, or need 3, 5, 6, 7 GPUs for quantized models.
  26. - pipeline_parallel_size:
  27. # The data type to use for KV cache. You can set it to 'fp8' to reduce
  28. # memory usage for large contexts.
  29. - kv_cache_dtype:
  30. # Enable chunking the prefill tokens. This greatly reduces memory usage
  31. # at high contexts, but it mutually exclusive with kv_cache_dtype=fp8
  32. # Takes a boolean value (true/false)
  33. - enable_chunked_prefill:
  34. # By default, Aphrodite Engine reserves 90% of VRAM for every GPU it's using.
  35. # Pass a value between 0-1 (e.g. 0.95 for 95%) to increase or decrease this.
  36. - gpu_memory_utilization:
  37. # If your model doesn't fit on the GPU, use this. It takes values in GiB.
  38. # e.g., if you pass `10`, it'll virtually add 10 GiB of VRAM to your GPU.
  39. # Not recommended because CPU offloading is generally slow.
  40. - cpu_offload_gb:
  41. # This is essentially the maximum batch size. It's set to `256` by default.
  42. # You can lower this to use less memory, but it doesn't affect things that much,
  43. # unless `enforce_eager` is enabled.
  44. - max_num_seqs:
  45. # Whether to enable CUDA graphs. By default, CUDA graphs are disabled. Pass
  46. # `false` here to enable them, and leave blank or pass `true` to keep it disabled.
  47. - enforce_eager:
  48. # The load format to use. You can usually leave this blank.
  49. # If you want to use bitsandbytes on-the-fly quantization,
  50. # pass `bitsandbytes`, along with `quantization=bitsandbytes`
  51. # in the category below.
  52. - load_format:
  53. # Whether or not to enable prefix caching. This will cache
  54. # previous prompts so that they're not recomputed. Helps
  55. # with large prompts.
  56. - enable_prefix_caching:
  57. # Whether or not to trust remote code in the repository. Needed
  58. # for some models that have custom code.
  59. - trust_remote_code:
  60. # The download directory if the `model` is a Hugging Face ID.
  61. - download_dir:
  62. # The data type to use for the model. Can be `auto`, `float16`, `bfloat16`,
  63. # `float32`. Defaults to `auto`, which will use fp16 for fp32 and fp16 models,
  64. # and bf16 for bf16 models.
  65. - dtype:
  66. # Quantization options.
  67. quantization_args:
  68. # The quantization type to use. You don't usually need to pass this,
  69. # as the engine will figure out the quant from the model itself.
  70. # You may need to use this if you want to perform online quantization,
  71. # i.e., quantizing a 16-bit model on-the-fly.
  72. # To use FP8 (only supported by Ampere and newer GPUs), pass `fp8`.
  73. # To use bitsandbytes, pass `bitsandbytes`.
  74. - quantization:
  75. # Path to the JSON file containing the KV cache scaling factors.
  76. # This should generally be supplied when KV cache dtype is FP8.
  77. # Otherwise, KV cache scaling factors default to 1.0, which
  78. # may cause accuracy issues. FP8_E5M2 (without scaling) is
  79. # only supported on CUDA versions greater than 11.8. On ROCm,
  80. # FP8_E4M3 is used instead.
  81. # For most use cases, you can leave this blank. If you want to
  82. # generate scales for your model, look at examples/fp8 directory.
  83. - quantization_param_path:
  84. # The number of floating point bits to use for deepspeed_fp
  85. # on-the-fly quantization. Only pass this if you've set
  86. # quantization to `deepspeedfp`. Takes 4, 6, 8, 12.
  87. - deepspeed_fp_bits:
  88. # The API-specific options. These are decoupled from the engine.
  89. api_args:
  90. # The API key to use for the server. Leave blank to disable API key.
  91. - api_keys:
  92. # The local path or http address to the chat template to use.
  93. # This will override the model's existing chat template, if
  94. # it has one.
  95. - chat_template:
  96. # When max_logprobs is specified, represents single tokens as
  97. # strings of the form `token_ids:{token_id}` so that tokens
  98. # that are not JSON-encodable can be identified.
  99. - return_tokens_as_token_ids:
  100. # These are the options for speculative decoding. Spec Decoding
  101. # is a way to speed up inference by loading a smaller model
  102. # and letting it do the predictions, and your main model
  103. # will only verify its outputs. The outputs will match
  104. # 1:1 with your main model.
  105. # We currently support the following speculative decoding algorithms:
  106. # Draft Model, Ngram Prompt Lookup, MLPSpeculator, and Medusa.
  107. speculative_args:
  108. # Use the V2 block manager. Mandatory for speculative decoding.
  109. # Takes a boolean value (true/false)
  110. - use_v2_block_manager:
  111. # The speculative model to use. Can take either a Hugging Face ID
  112. # or a local path. You can also pass "[ngram]" to use ngram prompt
  113. # lookup decoding without needing a draft model.
  114. - speculative_model:
  115. # The number of tokens for the speculative model to predict.
  116. # Spec decoding can generate multiple tokens in single forward
  117. # pass to speed up inference. Don't set this too high, a good
  118. # value is between 3-10, depending on model size.
  119. - num_speculative_tokens:
  120. # The tensor parallel size to use for the speculative model.
  121. # Usually, you want this set to 1.
  122. - speculative_draft_tensor_parallel_size:
  123. # The maximum window size for ngram prompt lookup
  124. # This needs to be set if you're using ngram prompt lookup
  125. - ngram_prompt_lookup_max:
  126. # The minimum window size for ngram prompt lookup
  127. - ngram_prompt_lookup_min:
  128. # Disable speculative decoding if the number of queued
  129. # requests is larger than this value. This is useful
  130. # to prevent speculative decoding from using too much
  131. # compute.
  132. - speculative_disable_by_batch_size:
  133. # The acceptance method to use for speculative decoding.
  134. # Can be either `rejection_sampler` or `typical_acceptance_sampler`.
  135. # The default is `rejection_sampler`.
  136. # Rejection sampler does not allow changing the acceptance rate
  137. # of draft tokens. More accurate but slower.
  138. # Typical acceptance sampler allows changing the acceptance rate
  139. # of draft tokens. Less accurate but faster.
  140. - spec_decoding_acceptance_method:
  141. # The lower bound threshold for the posterior probability
  142. # of a token to be accepted. Only set this if you're using
  143. # the typical acceptance sampler. Defaults to 0.09.
  144. - typical_acceptance_sampler_posterior_threshold:
  145. # A scaling factor for the entropy-based threshold for token
  146. # acceptance in the typical acceptance sampler. Only set this
  147. # if you're using the typical acceptance sampler. Defaults to
  148. # sqrt of typical_acceptance_sampler_posterior_threshold, i.e. 0.3.
  149. - typical_acceptance_sampler_posterior_alpha:
  150. # Whether to disable logprobs during speculative decoding.
  151. # If True, token log probabilities are not returned. If False,
  152. # log probabilities are returned according to the settings
  153. # in samplingParams. Defaults to True.
  154. # Disabling this (setting to True) speeds up inference
  155. # during speculative decoding by skipping log probability
  156. # calculation in proposal and target sampling.
  157. - disable_logprobs_during_spec_decoding:
  158. # The config options for LoRA adapters.
  159. # Each adapter is treated as a separate model in the API server,
  160. # and your requests will need to be sent to the specific model.
  161. lora_args:
  162. # Whether or not to enable handling LoRA adapters.
  163. # Takes a boolean value (true/false)
  164. - enable_lora:
  165. # The LoRA adapters to use for the API server.
  166. # You can specify multiple adapters here.
  167. - lora_modules:
  168. # Change the name of the adapter to something more descriptive
  169. # e.g. ` - my_sql_lora: /path/to/my_sql_lora`
  170. - lora1:
  171. - lora2:
  172. # The maximum number of LoRA adapters in a single batch.
  173. - max_loras:
  174. # The maximum rank of the LoRA adapters. We currently support
  175. # up to 64.
  176. - max_lora_rank:
  177. # The maximum size of extra vocabulary that can be present
  178. # in a LoRA adapter (added to the base model vocab)
  179. - lora_extra_vocab_size:
  180. # The data type for the LoRA adapter.
  181. # Can take "auto", "float16", "bfloat16", and "float32"
  182. - lora_dtype:
  183. # The maximum number of LoRA adapters to store in CPU memory.
  184. # This number must be larger or equal to max_num_seqs.
  185. # Defaults to max_num_seqs.
  186. - max_cpu_loras:
  187. # Specify multiple scaling factors (which can be different from base
  188. # model scaling factor) to allow for multiple LoRA adapters trained
  189. # with those scaling factors to be used at the same time.
  190. # If not specified, only adapters trained with the base model scaling
  191. # factor are allowed.
  192. - long_lora_scaling_factors:
  193. # By default, only half of the LoRA computation is sharded with tensor
  194. # parallelism. Enabling this will use the fully sharded layers. At high
  195. # sequence length, max rank, or tensor parallel size, this is likely faster.
  196. - fully_sharded_loras:
  197. # The name or path of the QLoRA adapter to use.
  198. - qlora_adapter_name_or_path:
  199. # The config options for the Soft Prompt adapters.
  200. # Soft prompts are a way to tune prompts for a specific task
  201. # and load them at a request-level.
  202. soft_prompt_args:
  203. # Whether or not to enable handling Soft Prompt adapters.
  204. # Takes a boolean value (true/false)
  205. - enable_prompt_adapter:
  206. # The Soft Prompt adapters to use for the API server.
  207. # You can specify multiple adapters here.
  208. - prompt_adapters:
  209. # Change the name of the adapter to something more descriptive
  210. # e.g. ` - my_sql_prompt: /path/to/my_sql_prompt`
  211. - prompt1:
  212. - prompt2:
  213. # The maximum number of Soft Prompt adapters in a single batch.
  214. - max_prompt_adapters:
  215. # The maximum number of PromptAdapter tokens.
  216. - max_prompt_adapter_token:
  217. # These are advanced options. You usually don't need to modify these.
  218. advanced_args:
  219. # The backend to use for distributed inference. Can be either `ray`
  220. # or `mp` (multiprocessing). Defaults to `mp` for single-node,
  221. # `ray` for multi-node.
  222. # Note that specifying a custom backend by passing a custom class
  223. # is intended for expert use only. The API may change without notice.
  224. - distributed_executor_backend:
  225. # The tokenizer to use. Defaults to the model's tokenizer.
  226. - tokenizer:
  227. # The model revision to use if pulling from HF. Defaults to main.
  228. - revision:
  229. # The revision for the remote code in the model repository.
  230. - code_revision:
  231. # The revision for the tokenizer.
  232. - tokenizer_revision:
  233. # The maximum number of tokens to be captured by CUDA graphs.
  234. # This is set to 8192 by default. If your prompt exceeds this
  235. # threshold, it'll fallback to eager execution.
  236. - max_seq_len_to_capture:
  237. # RoPE scaling config in JSON format.
  238. # For example, `{"type": "dynamic", "factor": 2.0}`
  239. - rope_scaling:
  240. # The RoPE theta value. Use with `rope_scaling`. In some cases,
  241. # changing the RoPE theta improves performance of the scaled
  242. # model.
  243. - rope_theta:
  244. # Extra config for the model loader.
  245. # This will be passed to the model loader corresponding
  246. # to the chosen load_format. This should be a JSON string that
  247. # will be parsed into a dictionary.
  248. - model_loader_extra_config:
  249. # Whether to skip tokenizer and detokenizer initialization.
  250. - skip_tokenizer_init:
  251. # The size of tokenizer pool to use for asynchronous tokenization.
  252. # IF 0, will use synchronous tokenization.
  253. - tokenizer_pool_size:
  254. # The type of tokenizer pool to use for asynchronous tokenization.
  255. # Ignored if tokenizer_pool_size is 0.
  256. # Note that specifying a tokenizer pool by passing a custom class
  257. # is intended for expert use only. The API may change without notice.
  258. - tokenizer_pool_type:
  259. # The extra config for tokenizer pool. This should be a JSON string
  260. # that will be parsed into a dictionary. Ignored if tokenizer_pool_size
  261. # is 0.
  262. - tokenizer_pool_extra_config:
  263. # The maximum log probabilities to return in the API. Defaults to 10.
  264. - max_logprobs:
  265. # The device to use for model execution. You usually don't
  266. # need to modify this.
  267. # We support `auto`, `cuda`, `neuron`, `cpu`, `openvino`, `tpu`, and `xpu.
  268. - device:
  269. # The pattern(s) to ignore when loading the model.
  270. # Defaults to `original/**/*` to avoid repeated loading
  271. # of llama's checkpoints.
  272. - ignore_patterns:
  273. # If specified, use nsight to profile ray workers.
  274. - ray_workers_use_nsight:
  275. # If specified, disable the custom all-reduce kernels.
  276. # They're enabled by default for GPUs with P2P support.
  277. - disable_custom_all_reduce:
  278. # The preemption mode to use for the scheduler. If `recompute`,
  279. # the engine performs preemption by block recomputation. If `swap`,
  280. # the engine performs preemption by block swapping.
  281. - preemption_mode:
  282. # If specified, ignore GPU profiling result and use this
  283. # number of GPU blocks. Only used for testing.
  284. - num_gpu_blocks_override:
  285. # The CPU swap space size (GiB) per GPU. Not related to CPU offloading.
  286. - swap_space:
  287. # Whether to disable sliding window.
  288. - disable_sliding_window:
  289. # The token block size. Takes values between 8, 16, 32.
  290. - block_size: