Batch + VAD issue caused by merge_segments #1270

vkras · 2025-03-25T17:09:17Z

I was investigating an issue when batch mode would miss some phrases in the beginning of the audio. The results were coming back in non-batch mode when VAD filter was enabled.

It comes down to merge_segments function. I'm not 100% what it was designed to do (probably merge segments located close to each other if the results are still within max_speech_duration_s set to a chunk_size for batch mode).

But there are several problems with it:

Padding by vad_options.speech_pad_ms seems excessive because the same padding is applied just by running VAD filter itself, so this edge_padding will basically double it
What's more important is that the code merges segments as soon as you have 3 exceeding max_speech_duration_s (needlessly, even when they are not close to each other). It basically invalidates and ignores VAD results. As soon as there are 2 segments within max_speech_duration_s, it leaves segments without changes. Here are some examples (I added debug statements for Batch mode to transcribe.py):

-- Good example, several first segments are left as they are --

2025-03-25 12:51:38 - DEBUG - VAD filter kept the following audio segments: **[00:00.752 -> 00:03.280], [00:05.040 -> 00:31.872],** [00:31.872 -> 00:36.592], [00:38.096 -> 00:48.112], [00:50.416 -> 01:11.856], [01:14.736 -> 01:33.296], [01:34.672 -> 01:42.768], [01:44.048 -> 01:45.696]
2025-03-25 12:51:38 - DEBUG - Clip segments: **[00:00.752 -> 00:03.280], [00:05.040 -> 00:31.872],** [00:31.872 -> 00:48.112], [00:50.416 -> 01:11.856], [01:14.736 -> 01:42.768], [01:44.048 -> 01:45.696]

--- Bad example, first 3 segments are merged into 1, effective ignoring VAD results even though there is 1.5 sec between 1st and 2nd and 5 sec between 2nd and 3rd ---

2025-03-25 12:47:48 - DEBUG - VAD filter kept the following audio segments: **[00:01.968 -> 00:04.560], [00:06.064 -> 00:19.120], [00:24.400 -> 00:31.184],** [00:32.656 -> 00:54.256], [00:57.936 -> 01:01.104], [01:14.416 -> 01:33.136], [01:57.968 -> 02:11.536], [02:17.328 -> 02:45.904], [02:47.248 -> 02:48.768]
2025-03-25 12:47:48 - DEBUG - Clip segments: **[00:01.968 -> 00:31.184],** [00:32.656 -> 01:01.104], [01:14.416 -> 01:33.136], [01:57.968 -> 02:11.536], [02:17.328 -> 02:45.904], [02:47.248 -> 02:48.768]
2025-03-25 12:47:48 - INFO - VAD filter removed 00:48.720 of audio

Do we even need this function? I tried to change it to dummy function and return input right away and it seems like it solved my issue with missing transcription. That's because in this case same effective audio is processed, the only difference is that batch mode limits segment side with chunk_size and non-batch does not.

Update: dropping the function fixed one place but broke another. In All cases non-batch method with VAD produced correct transcript.

The text was updated successfully, but these errors were encountered:

MahmoudAshraf97 · 2025-03-26T11:57:18Z

Hi and thanks for the investigation, I'm aware that the VAD filter in batch mode is not actually filtering anything, it's only used for segmentation, the correct way to do it is to merge the speech segments only and later correct the timestamps to account for the removed silence, this is the approach currently used in the non-batch mode, I already experimented with implementing this but found that it made no difference in WER, but it should be implemented nonetheless because my testing is not extensive, I'm willing to review and accept any PRs that implement this

Purfview · 2025-03-26T15:09:00Z

But there are several problems with it..

IMO, that's expected behaviour.

I'm not 100% what it was designed to do..

It's to cut audio to chunks.

Do we even need this function?

Without it performance can be much slower, so that would negate whole idea behind batched mode.
But unmerged can have positive effect on some things, for example on multilingual option.

In All cases non-batch method with VAD produced correct transcript.

It's expected that batched mode produce more errors, you sacrifice a bit of transcription quality for speed.

vkras · 2025-03-26T20:17:33Z

First of all, I appreciate you looking into this.
"It's to cut audio to chunks."
Not really, the audio is cut into chunks by running VAD with max_speech_duration_s=chunk_length by the first call:
active_segments = get_speech_timestamps(audio, vad_parameters)

I think I understand now: merge_segments is designed to merge smaller chunks after VAD to create a smaller number of larger clips (closer to 30 sec) for more efficient parallel processing.

So in perfect world, 2 changes would improve quality and performance:

silence would be dropped from snippets before it's sent to processing
merge_segments should be looking into speech segment duration instead of start/stop. In this case it's possible to merge into 30 sec even segments that are not merged now.

Purfview · 2025-03-26T20:32:08Z

Not really, the audio is cut into

I meant it as a whole, whole point of it is to get chunks. That's why you can't disable VAD in batched.

So in perfect world, 2 changes would improve quality and performance:

I'm sure it would decrease quality of the timestamps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch + VAD issue caused by merge_segments #1270

Batch + VAD issue caused by merge_segments #1270

vkras commented Mar 25, 2025 •

edited by MahmoudAshraf97

Loading

MahmoudAshraf97 commented Mar 26, 2025

Purfview commented Mar 26, 2025 •

edited

Loading

vkras commented Mar 26, 2025

Purfview commented Mar 26, 2025

Batch + VAD issue caused by merge_segments #1270

Batch + VAD issue caused by merge_segments #1270

Comments

vkras commented Mar 25, 2025 • edited by MahmoudAshraf97 Loading

MahmoudAshraf97 commented Mar 26, 2025

Purfview commented Mar 26, 2025 • edited Loading

vkras commented Mar 26, 2025

Purfview commented Mar 26, 2025

vkras commented Mar 25, 2025 •

edited by MahmoudAshraf97

Loading

Purfview commented Mar 26, 2025 •

edited

Loading