Skip to content

[Bug] Index out of range issue with voice cloning #4179

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
qqwjq1981 opened this issue Apr 1, 2025 · 1 comment
Open

[Bug] Index out of range issue with voice cloning #4179

qqwjq1981 opened this issue Apr 1, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@qqwjq1981
Copy link

Describe the bug

I frequently experience the index out of range issue when cloning voice given speaker sample and text scripts. It is not always related to the number of tokens. As text with a larger number of tokens may succeed but with smaller number of tokens may fail.

Successful Text: Nature has a delicate and complex structure that allows thousands of creatures to maintain a delicate balance.
Tokenized length: 60
Tokens: [259, 467, 1375, 18, 2, 1221, 2, 14, 2, 636, 91, 186, 2, 53, 2, 884, 25, 169, 2, 32, 1951, 2766, 861, 2, 73, 2, 14, 84, 69, 32, 2, 40, 206, 43, 2864, 2, 58, 2, 814, 18, 14, 1375, 61, 2, 51, 2, 845, 33, 137, 2, 14, 2, 636, 91, 186, 2, 15, 1821, 3263, 9]
Failed Text: Life is so beautiful
Tokenized length: 13
Tokens: [259, 25, 140, 18, 2, 54, 2, 123, 2, 67, 847, 140, 167]

Code:
tts = TTS(model_name="tts_models/multilingual/multi-dataset/xtts_v2")

tts.tts_to_file(
text=full_text,
speaker_wav=speaker_wav_path,
language=target_language,
file_path=output_audio_path,
speed=speed_tts,
split_sentences=True
)

Error message:
ERROR:main:❌ Error during voice cloning:
ERROR:main:Traceback (most recent call last):
File "/home/user/app/app.py", line 485, in generate_voiceover_clone
tts.tts_to_file(
File "/usr/local/lib/python3.10/site-packages/TTS/api.py", line 334, in tts_to_file
wav = self.tts(
File "/usr/local/lib/python3.10/site-packages/TTS/api.py", line 276, in tts
wav = self.synthesizer.tts(
File "/usr/local/lib/python3.10/site-packages/TTS/utils/synthesizer.py", line 386, in tts
outputs = self.tts_model.synthesize(
File "/usr/local/lib/python3.10/site-packages/TTS/tts/models/xtts.py", line 419, in synthesize
return self.full_inference(text, speaker_wav, language, **settings)
File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/TTS/tts/models/xtts.py", line 488, in full_inference
return self.inference(
File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/TTS/tts/models/xtts.py", line 541, in inference
gpt_codes = self.gpt.generate(
File "/usr/local/lib/python3.10/site-packages/TTS/tts/layers/xtts/gpt.py", line 590, in generate
gen = self.gpt_inference.generate(
File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/transformers/generation/utils.py", line 1575, in generate
result = self._sample(
File "/usr/local/lib/python3.10/site-packages/transformers/generation/utils.py", line 2697, in _sample
outputs = self(
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/TTS/tts/layers/xtts/gpt_inference.py", line 94, in forward
emb = emb + self.pos_embedding.get_fixed_embedding(
File "/usr/local/lib/python3.10/site-packages/TTS/tts/layers/xtts/gpt.py", line 40, in get_fixed_embedding
return self.emb(torch.tensor([ind], device=dev)).unsqueeze(0)
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 162, in forward
return F.embedding(
File "/usr/local/lib/python3.10/site-packages/torch/nn/functional.py", line 2233, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

To Reproduce

Code:
tts = TTS(model_name="tts_models/multilingual/multi-dataset/xtts_v2")

tts.tts_to_file(
text=full_text,
speaker_wav=speaker_wav_path,
language=target_language,
file_path=output_audio_path,
speed=speed_tts,
split_sentences=True
)

Expected behavior

No response

Logs

Environment

# Coqui TTS (XTTS v2)
TTS==0.22.0
torch==2.1.0  # Or the version best suited for your GPU/CPU
CPU

Additional context

No response

@qqwjq1981 qqwjq1981 added the bug Something isn't working label Apr 1, 2025
@eginhard
Copy link
Contributor

eginhard commented Apr 2, 2025

Can you try our fork (available via pip install coqui-tts)? This repo is not maintained anymore.

The following code runs fine for me there:

from TTS.api import TTS

tts = TTS(model_name="tts_models/multilingual/multi-dataset/xtts_v2")

tts.tts_to_file(
    text="Life is so beautiful",
    speaker="Uta Obando",
    language="en",
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants