Skip to content

An AI-Powered Speech Processing Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Enhancement, Separation, and Target Speaker Extraction, etc.

License

Notifications You must be signed in to change notification settings

modelscope/ClearerVoice-Studio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ClearerVoice-Studio is an open-source, AI-powered speech processing toolkit designed for researchers, developers, and end-users. It provides capabilities of speech enhancement, speech separation, speech super-resolution, target speaker extraction, and more. The toolkit provides state-of-the-art pre-trained models, along with training and inference scripts, all accessible from this repository.

👉🏻HuggingFace Demo👈🏻 | 👉🏻ModelScope Demo | 👉🏻SpeechScore Demo👈🏻


GitHub Repo stars Please support this community project by leaving your ⭐ on our GitHub!

记得点击右上角的星星⭐来支持我们一下,您的支持是我们更新模型的最大动力!

News 🔥

  • Upcoming: More tasks will be added to ClearVoice.
  • [2025.1] ClearVoice demo is ready for try on both HuggingFace and ModelScope. However, HuggingFace has limited GPU usage, and ModelScope has more GPU usage quota.
  • [2025.1] ClearVoice now offers speech super-resolution, also known as bandwidth extension. This feature improves the perceptual quality of speech by converting low-resolution audio (with an effective sampling rate of at least 16,000 Hz) into high-resolution audio with a sampling rate of 48,000 Hz. A full upscaled LJSpeech-1.1-48kHz dataset can be downloaded from HuggingFace and ModelScope.
  • [2025.1] ClearVoice now supports more audio formats including "wav", "aac", "ac3", "aiff", "flac", "m4a", "mp3", "ogg", "opus", "wma", "webm", etc. It also supports both mono and stereo channels with 16-bit or 32-bit precisions. A latest version of ffmpeg is required for audio codecs.
  • [2024.12] Upload pre-trained models on ModelScope. User now can download the models from either ModelScope or Huggingface
  • [2024.11] Our FRCRN speech denoiser has been used over 3.0 million times on ModelScope
  • [2024.11] Our MossFormer speech separator has been used over 2.5 million times on ModelScope
  • [2024.11] Release of this repository

🌟 Why Choose ClearerVoice-Studio?

  • Pre-Trained Models: Includes cutting-edge pre-trained models, fine-tuned on extensive, high-quality datasets. No need to start from scratch!
  • Ease of Use: Designed for seamless integration with your projects, offering a simple yet flexible interface for inference and training.
  • Comprehensive Features: Combines advanced algorithms for multiple speech processing tasks in one platform.
  • Community-Driven: Built for researchers, developers, and enthusiasts to collaborate and innovate together.

Contents of this repository

This repository is organized into three main components: ClearVoice, Train, and SpeechScore.

1. ClearVoice [Readme][文档]

ClearVoice offers a user-friendly solution for speech processing tasks such as speech denoising, separation, super-resolution, audio-visual target speaker extraction, and more. It is designed as a unified inference platform leveraged pre-trained models (e.g., FRCRN, MossFormer), all trained on extensive datasets. If you're looking for a tool to improve speech quality, ClearVoice is the perfect choice. Simply click on ClearVoice and follow our detailed instructions to get started.

2. Train

For advanced researchers and developers, we provide model finetune and training scripts for all the tasks offerred in ClearVoice and more:

  • Task 1: Speech enhancement (16kHz & 48kHz)
  • Task 2: Speech separation (8kHz & 16kHz)
  • Task 2: Speech super-resolution (48kHz) (comming soon)
  • Task 4: Target speaker extraction
    • Sub-Task 1: Audio-only Speaker Extraction Conditioned on a Reference Speech (8kHz)
    • Sub-Task 2: Audio-visual Speaker Extraction Conditioned on Face (Lip) Recording (16kHz)
    • Sub-Task 3: Audio-visual Speaker Extraction Conditioned on Body Gestures (16kHz)
    • Sub-Task 4: Neuro-steered Speaker Extraction Conditioned on EEG Signals (16kHz)

Contributors are welcomed to include more model architectures and tasks!

3. SpeechScore [Readme][文档]

SpeechScore is a speech quality assessment toolkit. We include it here to evaluate different model performance. SpeechScore includes many popular speech metrics:

  • Signal-to-Noise Ratio (SNR)
  • Perceptual Evaluation of Speech Quality (PESQ)
  • Short-Time Objective Intelligibility (STOI)
  • Deep Noise Suppression Mean Opinion Score (DNSMOS)
  • Scale-Invariant Signal-to-Distortion Ratio (SI-SDR)
  • and many more quality benchmarks

Contact

If you have any comments or questions about ClearerVoice-Studio, feel free to raise an issue in this repository or contact us directly at:

  • email: {shengkui.zhao, zexu.pan}@alibaba-inc.com

Alternatively, welcome to join our DingTalk and WeChat groups to share and discuss algorithms, technology, and user experience feedback. You may scan the following QR codes to join our official chat groups accordingly.

ClearVoice in DingTalk ClearVoice in WeChat
Light Light

Friend Links

Checkout some awesome Github repositories from Speech Lab of Institute for Intelligent Computing, Alibaba Group.

Demo Github Demo Demo

Acknowledge

ClearerVoice-Studio contains third-party components and code modified from some open-source repos, including:
Speechbrain, ESPnet, TalkNet-ASD

About

An AI-Powered Speech Processing Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Enhancement, Separation, and Target Speaker Extraction, etc.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •