Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potentially better alternative for summing final audio #74

Open
WrongProtocol opened this issue Feb 21, 2025 · 2 comments
Open

Potentially better alternative for summing final audio #74

WrongProtocol opened this issue Feb 21, 2025 · 2 comments

Comments

@WrongProtocol
Copy link

Team,

While working on a version of YuE that incorporates Flow-Matching, it was helpful to draw out a flow chart of the current implementation. Through this exercise and listening to each of the output wav files, I noticed something peculiar.

Currently, to create the final output, the system utilizes the combined reconstruction of codec_model , and it Sums (via the post-process) it with the combined output of the vocoder. That seemed odd.

The instrumental from the vocoder appears to be lower quality than the instrumental coming out of the decoding of stage2.
While vocos does amazing with vocals, it seems to have poor performance with instrumentals.
As expected, the vocal from Vocos is superior to the vocal from the decoding of stage2.

The current post-process is combining two combined outputs together, then attempting to retain the low-freq information from the stage2-decoded reconstruction, and Sum it with the high-freq information from the vocos reconstruction. However, this creates a 'dirty' end result with a lot of unnecessary artifacts due to summing two combined audio files.

I propose the final output be the Instrumental stem from stage2 decoding, summed with the vocal stem from Vocos. The attached image shows this in a clear drawing.

Image

Thank you for all of your hard work on this project. I have gladly dedicated many nights to learning from your work. It is much appreciated.

Thanks,
Carmine (aka WrongProtocol)

@a43992899
Copy link
Collaborator

Thanks! We will test this out soon!

@WrongProtocol
Copy link
Author

WrongProtocol commented Feb 22, 2025

Upon further experimentation, I was able to achieve the best overall quality by doing the following:

  1. high-pass the vocoder's itrack.mp3 at 4000hz
  2. sum it with the stage2decoder's instrumental to create a final_inst.
  3. sum final_inst with the vocoder's vtrack.

That'll retain the high-end energy of the vocoder's instrumental without introducing all of the ugly low-freq artifacts that the vocoder was adding, and without the noisiness of doubling up two different combined mixes.

It could benefit from a little compression or RMS matching (like in the existing postprocess) to "glue" the inst+vox together.

It's just a suggestion. This has been a fun project to explore!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants