-
Notifications
You must be signed in to change notification settings - Fork 887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] whisper vs. ort-wasm-simd-threaded.wasm #161
Comments
I believe this is due to how default HF spaces are hosted (which block usage of SharedArrayBuffer). Here's another thread discussing this: microsoft/onnxruntime#9681 (comment). I would be interested in seeing what performance benefits we could get though. cc'ing @fs-eire @josephrocca for some help too. |
good spot! Simply adding:
...will indeed invoke wasm-simd-threaded.wasm. However I do not see multiple workers spawned (as I would expect) nor any performance improvements. |
@jozefchutka So, to confirm, it loaded the If it is true, then can you check the value of Regarding Hugging Face spaces, I've opened an issue here for COEP/COOP header support: huggingface/huggingface_hub#1525 And in the meantime you can use the service worker hack on Spaces, mentioned here: https://github.com/orgs/community/discussions/13309#discussioncomment-3844940 |
if multi-thread features are available, ort-web will spawn [CPU-core-number / 2] (up-to 4) threads, if if |
@josephrocca , @fs-eire following is printed: I have also tried to explicitly set numThreads to 4 but same result. Something interesting to mention:
|
for onnxruntime-web, it's import { env } from 'onnxruntime-web';
env.wasm.numThreads = 4; for transformers.js I believe it is exposed through different way |
I believe @jozefchutka was doing it correctly:
I.e., |
|
I've been able to do some more testing on this and I am not seeing any performance improvements either... 🤔 |
The code is here: this code run only once when trying to create the first inference session |
Right, so we should be seeing a performance improvement by simply having loaded |
Yes. if you see it is loaded but no worker threads spawn, that is likely to be a bug. |
Yes, this is something mentioned above by @jozefchutka:
transformers.js does not do anything extra when it comes to threading, so I do believe this is an issue with onnxruntime-web. Please let me know if there's anything I can do to help debug/test |
@xenova Unless I misunderstand, you or @jozefchutka might need to provide a minimal example here. I don't see the problem of worker threads appearing too late (i.e. after inference) in this fairly minimal demo, for example: https://josephrocca.github.io/openai-clip-js/onnx-image-demo.html That's using the latest ORT Web version, and has Edit: Oooh, unless this is something that specifically occurs when ORT Web is loaded from within a web worker? I haven't tested that yet, since I've just been using use the [0] Just a heads-up: For some reason I had to manually refresh the page the first time I loaded it just now - the service worker that adds the COOP/COEP headers didn't refresh automatically like it's supposed to. |
if you use ort.env.wasm.proxy flag, the proxy worker will be spawn immediately. this is different worker to the workers created for multithread computation |
Should we see performance improvements even if the batch size is 1? Could you maybe explain how work is divided among threads @fs-eire? Regarding a demo, @jozefchutka would you mind sharing the code you were referring to above? My testing was done inside my whisper-web application, which is quite large and had a lot of bloat around it. |
Here is a demo: worker.js import { env, pipeline } from "https://cdn.jsdelivr.net/npm/@xenova/[email protected]/dist/transformers.min.js";
env.allowLocalModels = false;
//env.backends.onnx.wasm.numThreads = 4;
const file = "tos.pcm";
const model = "Xenova/whisper-base.en";
const buffer = new Float32Array(await (await fetch(file)).arrayBuffer());
const pipe = await pipeline("automatic-speech-recognition", model);
const t0 = performance.now();
const result = await pipe(buffer, {
chunk_length_s: 30,
stride_length_s: 5,
return_timestamps: true});
for(let {text, timestamp} of result.chunks)
console.log(`${timestamp[0]} -> ${timestamp[1]} ${text}`);
console.log(performance.now() - t0); demo.html <script>
new Worker("worker.js", {type:"module"});
</script> a script to generate .pcm file: ffmpeg -i tos.mp4 -filter_complex [0:1]aformat=channel_layouts=mono,aresample=16000[aout] -map [aout] -c:a pcm_f32le -f data tos.pcm changing value of |
@fs-eire Any updates on this maybe? 😅 Is there perhaps an issue with spawning workers from a worker itself? Here's a 60-second audio file for testing, if you need it: ted_60.wav |
@jozefchutka Can you maybe test with @josephrocca's previous suggestion?
|
@xenova , I observe no difference in performance or extra threads/workers running when tested with |
@jozefchutka Did you try not using a worker.js file, and just keeping all transformers.js logic in the UI thread (but still using proxy=true). |
This is a version without my worker test.html: <script type="module">
import { env, pipeline } from "https://cdn.jsdelivr.net/npm/@xenova/[email protected]/dist/transformers.min.js";
env.allowLocalModels = false;
env.backends.onnx.wasm.proxy = true;
const file = "tos.pcm";
const model = "Xenova/whisper-base.en";
const buffer = new Float32Array(await (await fetch(file)).arrayBuffer());
const pipe = await pipeline("automatic-speech-recognition", model);
const t0 = performance.now();
const result = await pipe(buffer, {
chunk_length_s: 30,
stride_length_s: 5,
return_timestamps: true});
for(let {text, timestamp} of result.chunks)
console.log(`${timestamp[0]} -> ${timestamp[1]} ${text}`);
console.log(performance.now() - t0);
</script> With this script, I can see 4 workers opened, however |
Are you sure it's not just downloading the model? Can you check your network tab? I'll test this though. |
I've done a bit of benchmarking and there does not seem to be any speedup when using threads. url: https://xenova-whisper-testing.hf.space/ consistently takes 3.8 seconds. I do see the threads spawn though. Also, using the proxy just freezes everything after spawning 6 threads. @jozefchutka am I missing something? Is this also what you see? |
@xenova thats same as what I have observed |
I just tried this with a simple app and works fine for me. |
Do you see speedups too? 👀
@guschmue It does seem to load this file when running this demo, but no performance improvements (all 3.7 seconds) I am still using v1.14.0, so if something changed since then, I can update and check |
While looking into https://cdn.jsdelivr.net/npm/@xenova/[email protected]/dist/transformers.js I can see a reference to ort-wasm-simd-threaded.wasm however that one never seem to be loaded for whisper/automatic-speech-recognition ( https://huggingface.co/spaces/Xenova/whisper-web ) while it always use ort-wasm-simd.wasm . I wonder if there is a way to enable or enforce threaded wasm and so improve transcription speed?
The text was updated successfully, but these errors were encountered: