Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Help]: How to Obtain Emilia-YODAS 114k Hours Raw URLs for Processing with emilia-pipe #402

Open
zhuangweiji opened this issue Feb 28, 2025 · 1 comment

Comments

@zhuangweiji
Copy link

Problem Overview

I am working with the Emilia dataset and have been using the Emilia 101k hours dataset. Recently, I noticed that the dataset has been expanded with an additional 114k hours of data under the Emilia-YODAS section. I would like to obtain the raw URLs for the Emilia-YODAS data so that I can process it using emilia-pipe. However, I haven't been able to find a direct way to retrieve these URLs.

Steps Taken

Checked the Hugging Face dataset page for any listed download URLs.

Used huggingface_hub.snapshot_download(repo_id="amphion/Emilia-Dataset", allow_patterns=["Emilia-YODAS/*"]) to fetch the new data, but this does not provide direct access to raw URLs.

Explored the dataset structure and metadata to find potential references to source URLs but couldn't locate them.

Searched previous issues and discussions for information related to extracting original dataset URLs but did not find a solution.

Expected Outcome

I would like to:

Obtain the raw URLs of the Emilia-YODAS 114k hours dataset.

Use these URLs to feed data into emilia-pipe for further processing.

Understand if there is a recommended way to extract or generate these URLs from the Hugging Face dataset.

Screenshots

N/A

Environment Information

N/A

Additional Context

If there is an existing way to extract the URLs or if they are stored in a metadata file, please let me know. Any guidance on accessing this data efficiently would be greatly appreciated!

@HarryHe11
Copy link
Collaborator

Hi, thank you so much for your attention.tion to our work! Please refer to the original Yodas dataset for the raw data and meta information: https://huggingface.co/datasets/espnet/yodas2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants