Potentially better alternative for summing final audio #74

WrongProtocol · 2025-02-21T16:29:45Z

Team,

While working on a version of YuE that incorporates Flow-Matching, it was helpful to draw out a flow chart of the current implementation. Through this exercise and listening to each of the output wav files, I noticed something peculiar.

Currently, to create the final output, the system utilizes the combined reconstruction of codec_model , and it Sums (via the post-process) it with the combined output of the vocoder. That seemed odd.

The instrumental from the vocoder appears to be lower quality than the instrumental coming out of the decoding of stage2.
While vocos does amazing with vocals, it seems to have poor performance with instrumentals.
As expected, the vocal from Vocos is superior to the vocal from the decoding of stage2.

The current post-process is combining two combined outputs together, then attempting to retain the low-freq information from the stage2-decoded reconstruction, and Sum it with the high-freq information from the vocos reconstruction. However, this creates a 'dirty' end result with a lot of unnecessary artifacts due to summing two combined audio files.

I propose the final output be the Instrumental stem from stage2 decoding, summed with the vocal stem from Vocos. The attached image shows this in a clear drawing.

Thank you for all of your hard work on this project. I have gladly dedicated many nights to learning from your work. It is much appreciated.

Thanks,
Carmine (aka WrongProtocol)

a43992899 · 2025-02-22T00:31:49Z

Thanks! We will test this out soon!

WrongProtocol · 2025-02-22T03:36:16Z

Upon further experimentation, I was able to achieve the best overall quality by doing the following:

high-pass the vocoder's itrack.mp3 at 4000hz
sum it with the stage2decoder's instrumental to create a final_inst.
sum final_inst with the vocoder's vtrack.

That'll retain the high-end energy of the vocoder's instrumental without introducing all of the ugly low-freq artifacts that the vocoder was adding, and without the noisiness of doubling up two different combined mixes.

It could benefit from a little compression or RMS matching (like in the existing postprocess) to "glue" the inst+vox together.

It's just a suggestion. This has been a fun project to explore!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potentially better alternative for summing final audio #74

Potentially better alternative for summing final audio #74

WrongProtocol commented Feb 21, 2025

a43992899 commented Feb 22, 2025

WrongProtocol commented Feb 22, 2025 •

edited

Loading

Potentially better alternative for summing final audio #74

Potentially better alternative for summing final audio #74

Comments

WrongProtocol commented Feb 21, 2025

a43992899 commented Feb 22, 2025

WrongProtocol commented Feb 22, 2025 • edited Loading

WrongProtocol commented Feb 22, 2025 •

edited

Loading