You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While working on a version of YuE that incorporates Flow-Matching, it was helpful to draw out a flow chart of the current implementation. Through this exercise and listening to each of the output wav files, I noticed something peculiar.
Currently, to create the final output, the system utilizes the combined reconstruction of codec_model , and it Sums (via the post-process) it with the combined output of the vocoder. That seemed odd.
The instrumental from the vocoder appears to be lower quality than the instrumental coming out of the decoding of stage2.
While vocos does amazing with vocals, it seems to have poor performance with instrumentals.
As expected, the vocal from Vocos is superior to the vocal from the decoding of stage2.
The current post-process is combining two combined outputs together, then attempting to retain the low-freq information from the stage2-decoded reconstruction, and Sum it with the high-freq information from the vocos reconstruction. However, this creates a 'dirty' end result with a lot of unnecessary artifacts due to summing two combined audio files.
I propose the final output be the Instrumental stem from stage2 decoding, summed with the vocal stem from Vocos. The attached image shows this in a clear drawing.
Thank you for all of your hard work on this project. I have gladly dedicated many nights to learning from your work. It is much appreciated.
Thanks,
Carmine (aka WrongProtocol)
The text was updated successfully, but these errors were encountered:
Upon further experimentation, I was able to achieve the best overall quality by doing the following:
high-pass the vocoder's itrack.mp3 at 4000hz
sum it with the stage2decoder's instrumental to create a final_inst.
sum final_inst with the vocoder's vtrack.
That'll retain the high-end energy of the vocoder's instrumental without introducing all of the ugly low-freq artifacts that the vocoder was adding, and without the noisiness of doubling up two different combined mixes.
It could benefit from a little compression or RMS matching (like in the existing postprocess) to "glue" the inst+vox together.
It's just a suggestion. This has been a fun project to explore!
Team,
While working on a version of YuE that incorporates Flow-Matching, it was helpful to draw out a flow chart of the current implementation. Through this exercise and listening to each of the output wav files, I noticed something peculiar.
Currently, to create the final output, the system utilizes the combined reconstruction of codec_model , and it Sums (via the post-process) it with the combined output of the vocoder. That seemed odd.
The instrumental from the vocoder appears to be lower quality than the instrumental coming out of the decoding of stage2.
While vocos does amazing with vocals, it seems to have poor performance with instrumentals.
As expected, the vocal from Vocos is superior to the vocal from the decoding of stage2.
The current post-process is combining two combined outputs together, then attempting to retain the low-freq information from the stage2-decoded reconstruction, and Sum it with the high-freq information from the vocos reconstruction. However, this creates a 'dirty' end result with a lot of unnecessary artifacts due to summing two combined audio files.
I propose the final output be the Instrumental stem from stage2 decoding, summed with the vocal stem from Vocos. The attached image shows this in a clear drawing.
Thank you for all of your hard work on this project. I have gladly dedicated many nights to learning from your work. It is much appreciated.
Thanks,
Carmine (aka WrongProtocol)
The text was updated successfully, but these errors were encountered: