Skip to content

Commit a705139

Browse files
authored
Release Emilia-Large (#398)
* Release Emilia-Large * Update README.md
1 parent b7eb09e commit a705139

File tree

2 files changed

+30
-53
lines changed

2 files changed

+30
-53
lines changed

README.md

+1
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@
3434
In addition to the specific generation tasks, Amphion includes several **vocoders** and **evaluation metrics**. A vocoder is an important module for producing high-quality audio signals, while evaluation metrics are critical for ensuring consistent metrics in generation tasks. Moreover, Amphion is dedicated to advancing audio generation in real-world applications, such as building **large-scale datasets** for speech synthesis.
3535

3636
## 🚀 News
37+
- **2025/02/24**: *The Emilia-Large dataset, featuring over 200,000 hours of data, is now available!!!* Emilia-Large combines the original 101k-hour Emilia dataset (licensed under `CC BY-NC 4.0`) with the brand-new 114k-hour **Emilia-YODAS dataset** (licensed under `CC BY 4.0`). Download at [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Dataset-yellow)](https://huggingface.co/datasets/amphion/Emilia-Dataset). Check details at [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2501.15907).
3738
- **2025/01/30**: We release [Amphion v0.2 Technical Report](https://arxiv.org/abs/2501.15442), which provides a comprehensive overview of the Amphion updates in 2024. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2501.15442)
3839
- **2025/01/23**: [MaskGCT](https://arxiv.org/abs/2409.00750) and [Vevo](https://openreview.net/pdf?id=anQDiQZhDP) got accepted by ICLR 2025! 🎉
3940
- **2024/12/22**: We release the reproduction of **Vevo**, a zero-shot voice imitation framework with controllable timbre and style. Vevo can be applied into a series of speech generation tasks, including VC, TTS, AC, and more. The released pre-trained models are trained on [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) dataset and achieve SOTA zero-shot VC performance. [![arXiv](https://img.shields.io/badge/OpenReview-Paper-COLOR.svg)](https://openreview.net/pdf?id=anQDiQZhDP) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/Vevo) [![WebPage](https://img.shields.io/badge/WebPage-Demo-red)](https://versavoice.github.io/) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](models/vc/vevo/README.md)

preprocessors/Emilia/README.md

+29-53
Original file line numberDiff line numberDiff line change
@@ -1,83 +1,53 @@
1-
# Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation
1+
# Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation
22
[![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2407.05361) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Dataset-yellow)](https://huggingface.co/datasets/amphion/Emilia-Dataset) [![OpenDataLab](https://img.shields.io/badge/OpenDataLab-Dataset-blue)](https://opendatalab.com/Amphion/Emilia) [![GitHub](https://img.shields.io/badge/GitHub-Repo-green)](https://github.com/open-mmlab/Amphion/tree/main/preprocessors/Emilia) [![demo](https://img.shields.io/badge/WebPage-Demo-red)](https://emilia-dataset.github.io/Emilia-Demo-Page/)
33

44
This is the official repository 👑 for the **Emilia** dataset and the source code for **Emilia-Pipe** speech data preprocessing pipeline.
55

66
<div align="center"><img width="500px" src="https://github.com/user-attachments/assets/b1c1a1f8-3149-4f96-8eb4-af470152a9b7" /></div>
77

88
## News 🔥
9+
- **2025/02/24**: *The Emilia-Large dataset, featuring over 200,000 hours of data, is now available!!!* Emilia-Large combines the original 101k-hour Emilia dataset (licensed under `CC BY-NC 4.0`) with the brand-new 114k-hour **Emilia-YODAS dataset** (licensed under `CC BY 4.0`)!!!
10+
- **2025/01/27**: We release the extened version of Emilia's paper on [arXiv](https://arxiv.org/abs/2501.15907)! More experiments and more insights!
11+
- **2024/12/04**: We present Emilia at the [IEEE SLT 2024](https://2024.ieeeslt.org/)!
912
- **2024/09/01**: [Emilia](https://arxiv.org/abs/2407.05361) got accepted by IEEE SLT 2024! 🤗
1013
- **2024/08/28**: Welcome to join Amphion's [Discord channel](https://discord.com/invite/drhW7ajqAG) to stay connected and engage with our community!
1114
- **2024/08/27**: *The Emilia dataset is now publicly available!* Discover the most extensive and diverse speech generation dataset with 101k hours of in-the-wild speech data now at [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Dataset-yellow)](https://huggingface.co/datasets/amphion/Emilia-Dataset) or [![OpenDataLab](https://img.shields.io/badge/OpenDataLab-Dataset-blue)](https://opendatalab.com/Amphion/Emilia)! 👑👑👑
1215
- **2024/07/08**: Our preprint [paper](https://arxiv.org/abs/2407.05361) is now available! 🔥🔥🔥
1316
- **2024/07/03**: We welcome everyone to check our [homepage](https://emilia-dataset.github.io/Emilia-Demo-Page/) for our brief introduction for Emilia dataset and our demos!
1417
- **2024/07/01**: We release of Emilia and Emilia-Pipe! We welcome everyone to explore it on our [GitHub](https://github.com/open-mmlab/Amphion/tree/main/preprocessors/Emilia)! 🎉🎉🎉
1518

16-
## Emilia Overview ⭐️
17-
The **Emilia** dataset is a comprehensive, multilingual dataset with the following features:
18-
- containing over *101k* hours of speech data;
19+
20+
## Emilia-Large Overview ⭐️
21+
22+
The **Emilia-Large** dataset is a comprehensive, multilingual dataset with the following features:
23+
- with *Emilia* containing over *101k* hours and *Emilia-YODAS* containing over *114k* hours of speech data;
1924
- covering six different languages: *English (En), Chinese (Zh), German (De), French (Fr), Japanese (Ja), and Korean (Ko)*;
2025
- containing diverse speech data with *various speaking styles* from diverse video platforms and podcasts on the Internet, covering various content genres such as talk shows, interviews, debates, sports commentary, and audiobooks.
2126

2227
The table below provides the duration statistics for each language in the dataset.
2328

24-
| Language | Duration (hours) |
25-
|:-----------:|:----------------:|
26-
| English | 46,828 |
27-
| Chinese | 49,922 |
28-
| German | 1,590 |
29-
| French | 1,381 |
30-
| Japanese | 1,715 |
31-
| Korean | 217 |
29+
| Language | Emilia Duration (hours) | Emilia-YODAS Duration (hours) | Total Duration (hours) |
30+
|:-----------:|:-----------------------:|:----------------------------:|:----------------------:|
31+
| English | 46.8k | 92.2k | 139.0k |
32+
| Chinese | 49.9k | 0.3k | 50.3k |
33+
| German | 1.6k | 5.6k | 7.2k |
34+
| French | 1.4k | 7.4k | 8.8k |
35+
| Japanese | 1.7k | 1.1k | 2.8k |
36+
| Korean | 0.2k | 7.3k | 7.5k |
37+
| **Total** | **101.7k** | **113.9k** | **215.6k** |
3238

3339

3440
The **Emilia-Pipe** is the first open-source preprocessing pipeline designed to transform raw, in-the-wild speech data into high-quality training data with annotations for speech generation. This pipeline can process one hour of raw audio into model-ready data in just a few minutes, requiring only the raw speech data.
3541

36-
Detailed description for the Emilia and Emilia-Pipe could be found in our [paper](https://arxiv.org/abs/2407.05361).
42+
Detailed descriptions for the Emilia and Emilia-Pipe can be found in our [paper](https://arxiv.org/abs/2407.05361), and [extended version](https://arxiv.org/abs/2501.15907).
3743

3844
## Emilia Dataset Usage 📖
39-
The Emilia dataset is now publicly available at [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Dataset-yellow)](https://huggingface.co/datasets/amphion/Emilia-Dataset)! Users in mainland China can also download Emilia from [![OpenDataLab](https://img.shields.io/badge/OpenDataLab-Dataset-blue)](https://opendatalab.com/Amphion/Emilia)!
40-
41-
- To download from HuggingFace, you must first gain access to the dataset by completing the request form and accepting the terms of access. Please note that due to HuggingFace's file size limit of 50 GB, the `EN/EN_B00008.tar.gz` file has been split into `EN/EN_B00008.tar.gz.0` and `EN/EN_B00008.tar.gz.1`. Before extracting the files, you will need to run the following command to combine the parts: `cat EN/EN_B00008.tar.gz.* > EN/EN_B00008.tar.gz`
42-
43-
- To download from OpenDataLab (i.e., OpenXLab), please follow the guidence [here](https://speechteam.feishu.cn/wiki/PC8Ew5igviqBiJkElMJcJxNonJc) to gain access.
44-
45-
**ENJOY USING EMILIA!!!** 🔥
45+
Emilia and Emilia-YODAS is publicly available at [HuggingFace](https://huggingface.co/datasets/amphion/Emilia-Dataset). Please check the README in HuggingFace for usage guideline.
4646

4747
If you wish to re-build Emilia from scratch, you may download the raw audio files from the [provided URL list](https://huggingface.co/datasets/amphion/Emilia) and use our open-source [Emilia-Pipe](https://github.com/open-mmlab/Amphion/tree/main/preprocessors/Emilia) preprocessing pipeline to preprocess the raw data. Additionally, users can easily use Emilia-Pipe to preprocess their own raw speech data for custom needs. By open-sourcing the Emilia-Pipe code, we aim to enable the speech community to collaborate on large-scale speech generation research.
4848

4949
*Please note that Emilia does not own the copyright to the audio files; the copyright remains with the original owners of the videos or audio. Users are permitted to use this dataset only for non-commercial purposes under the CC BY-NC-4.0 license.*
5050

51-
## Emilia Dataset Structure ⛪️
52-
The Emilia dataset will be structured as follows:
53-
54-
Structure example:
55-
```
56-
|-- openemilia_all.tar.gz (all .JSONL files are gzipped with directory structure in this file)
57-
|-- EN (114 batches)
58-
| |-- EN_B00000.jsonl
59-
| |-- EN_B00000 (= EN_B00000.tar.gz)
60-
| | |-- EN_B00000_S00000
61-
| | | `-- mp3
62-
| | | |-- EN_B00000_S00000_W000000.mp3
63-
| | | `-- EN_B00000_S00000_W000001.mp3
64-
| | |-- ...
65-
| |-- ...
66-
| |-- EN_B00113.jsonl
67-
| `-- EN_B00113
68-
|-- ZH (92 batches)
69-
|-- DE (9 batches)
70-
|-- FR (10 batches)
71-
|-- JA (7 batches)
72-
|-- KO (4 batches)
73-
74-
```
75-
JSONL files example:
76-
```
77-
{"id": "EN_B00000_S00000_W000000", "wav": "EN_B00000/EN_B00000_S00000/mp3/EN_B00000_S00000_W000000.mp3", "text": " You can help my mother and you- No. You didn't leave a bad situation back home to get caught up in another one here. What happened to you, Los Angeles?", "duration": 6.264, "speaker": "EN_B00000_S00000", "language": "en", "dnsmos": 3.2927}
78-
{"id": "EN_B00000_S00000_W000001", "wav": "EN_B00000/EN_B00000_S00000/mp3/EN_B00000_S00000_W000001.mp3", "text": " Honda's gone, 20 squads done. X is gonna split us up and put us on different squads. The team's come and go, but 20 squad, can't believe it's ending.", "duration": 8.031, "speaker": "EN_B00000_S00000", "language": "en", "dnsmos": 3.0442}
79-
```
80-
8151

8252
## Emilia-Pipe Overview 👀
8353
The Emilia-Pipe includes the following major steps:
@@ -213,18 +183,24 @@ We acknowledge the wonderful work by these excellent developers!
213183
## Reference 📖
214184
If you use the Emilia dataset or the Emilia-Pipe pipeline, please cite the following papers:
215185
```bibtex
186+
@inproceedings{emilialarge,
187+
author={He, Haorui and Shang, Zengqiang and Wang, Chaoren and Li, Xuyuan and Gu, Yicheng and Hua, Hua and Liu, Liwei and Yang, Chen and Li, Jiaqi and Shi, Peiyang and Wang, Yuancheng and Chen, Kai and Zhang, Pengyuan and Wu, Zhizheng},
188+
title={Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation},
189+
booktitle={arXiv:2501.15907},
190+
year={2025}
191+
}
192+
216193
@inproceedings{emilia,
217194
author={He, Haorui and Shang, Zengqiang and Wang, Chaoren and Li, Xuyuan and Gu, Yicheng and Hua, Hua and Liu, Liwei and Yang, Chen and Li, Jiaqi and Shi, Peiyang and Wang, Yuancheng and Chen, Kai and Zhang, Pengyuan and Wu, Zhizheng},
218195
title={Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation},
219196
booktitle={Proc.~of SLT},
220197
year={2024}
221198
}
222-
```
223-
```bibtex
199+
224200
@inproceedings{amphion,
225201
author={Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and Li, Jiaqi and He, Haorui and Wang, Chaoren and Song, Ting and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
226202
title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
227-
booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024},
203+
booktitle={Proc.~of SLT},
228204
year={2024}
229205
}
230206
```

0 commit comments

Comments
 (0)