RVC-Boss e60988a568 Update colab_webui.ipynb | 10 месяцев назад | |
---|---|---|
Docker | 10 месяцев назад | |
GPT_SoVITS | 10 месяцев назад | |
docs | 10 месяцев назад | |
i18n | 10 месяцев назад | |
tools | 10 месяцев назад | |
.dockerignore | 10 месяцев назад | |
.gitignore | 10 месяцев назад | |
Dockerfile | 10 месяцев назад | |
LICENSE | 11 месяцев назад | |
README.md | 10 месяцев назад | |
api.py | 10 месяцев назад | |
colab_webui.ipynb | 10 месяцев назад | |
config.py | 10 месяцев назад | |
docker-compose.yaml | 10 месяцев назад | |
dockerbuild.sh | 10 месяцев назад | |
go-webui.bat | 10 месяцев назад | |
go-webui.ps1 | 10 месяцев назад | |
install.sh | 11 месяцев назад | |
requirements.txt | 10 месяцев назад | |
webui.py | 10 месяцев назад |
Check out our demo video here!
Unseen speakers few-shot fine-tuning demo:
https://github.com/RVC-Boss/GPT-SoVITS/assets/129054828/05bee1fa-bdd8-4d85-9350-80c060ab47fb
For users in China region, you can use AutoDL Cloud Docker to experience the full functionality online: https://www.codewithgpu.com/i/RVC-Boss/GPT-SoVITS/GPT-SoVITS-Official
Zero-shot TTS: Input a 5-second vocal sample and experience instant text-to-speech conversion.
Few-shot TTS: Fine-tune the model with just 1 minute of training data for improved voice similarity and realism.
Cross-lingual Support: Inference in languages different from the training dataset, currently supporting English, Japanese, and Chinese.
WebUI Tools: Integrated tools include voice accompaniment separation, automatic training set segmentation, Chinese ASR, and text labeling, assisting beginners in creating training datasets and GPT/SoVITS models.
If you are a Windows user (tested with win>=10) you can install directly via the prezip. Just download the prezip, unzip it and double-click go-webui.bat to start GPT-SoVITS-WebUI.
Note: numba==0.56.4 require py<3.11
conda create -n GPTSoVits python=3.9
conda activate GPTSoVits
bash install.sh
pip install -r requirements.txt
conda install ffmpeg
sudo apt install ffmpeg
sudo apt install libsox-dev
conda install -c conda-forge 'ffmpeg<7'
brew install ffmpeg
Download and place ffmpeg.exe and ffprobe.exe in the GPT-SoVITS root.
Download pretrained models from GPT-SoVITS Models and place them in GPT_SoVITS/pretrained_models
.
For UVR5 (Vocals/Accompaniment Separation & Reverberation Removal, additionally), download models from UVR5 Weights and place them in tools/uvr5/uvr5_weights
.
Users in China region can download these two models by entering the links below and clicking "Download a copy"
For Chinese ASR (additionally), download models from Damo ASR Model, Damo VAD Model, and Damo Punc Model and place them in tools/damo_asr/models
.
If you are a Mac user, make sure you meet the following conditions for training and inferencing with GPU:
xcode-select --install
Other Macs can do inference with CPU only.
Then install by using the following commands:
conda create -n GPTSoVits python=3.9
conda activate GPTSoVits
pip install -r requirements.txt
pip uninstall torch torchaudio
pip3 install --pre torch torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu
docker compose -f "docker-compose.yaml" up -d
As above, modify the corresponding parameters based on your actual situation, then run the following command:
docker run --rm -it --gpus=all --env=is_half=False --volume=G:\GPT-SoVITS-DockerTest\output:/workspace/output --volume=G:\GPT-SoVITS-DockerTest\logs:/workspace/logs --volume=G:\GPT-SoVITS-DockerTest\SoVITS_weights:/workspace/SoVITS_weights --workdir=/workspace -p 9880:9880 -p 9871:9871 -p 9872:9872 -p 9873:9873 -p 9874:9874 --shm-size="16G" -d breakstring/gpt-sovits:xxxxx
The TTS annotation .list file format:
vocal_path|speaker_name|language|text
Language dictionary:
Example:
D:\GPT-SoVITS\xxx/xxx.wav|xxx|en|I like playing Genshin.
[ ] High Priority:
[ ] Features:
Use the command line to open the WebUI for UVR5
python tools/uvr5/webui.py "<infer_device>" <is_half> <webui_port_uvr5>
If you can't open a browser, follow the format below for UVR processing,This is using mdxnet for audio processing
python mdxnet.py --model --input_root --output_vocal --output_ins --agg_level --format --device --is_half_precision
This is how the audio segmentation of the dataset is done using the command line
python audio_slicer.py \
--input_path "<path_to_original_audio_file_or_directory>" \
--output_root "<directory_where_subdivided_audio_clips_will_be_saved>" \
--threshold <volume_threshold> \
--min_length <minimum_duration_of_each_subclip> \
--min_interval <shortest_time_gap_between_adjacent_subclips>
--hop_size <step_size_for_computing_volume_curve>
This is how dataset ASR processing is done using the command line(Only Chinese)
python tools/damo_asr/cmd-asr.py "<Path to the directory containing input audio files>"
ASR processing is performed through Faster_Whisper(ASR marking except Chinese)
(No progress bars, GPU performance may cause time delays)
python ./tools/damo_asr/WhisperASR.py -i <input> -o <output> -f <file_name.list> -l <language>
A custom list save path is enabled
Special thanks to the following projects and contributors: