-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Brainstorming Features #2
Comments
Web Worker Performance
MISLEADING RESULTS - the benchmark was all wrong so nothing in here made sense. A new benchmark is in the making and I will share results once it's completed. A work in progress can be tested live ~~It's amazng this library is ~10X slower on the same thread to perform a structured clone but up to 100X faster in complex edit conclusions |
MISLEADING RESULTS - the benchmark was all wrong so nothing in here made sense. A new benchmark is in the making and I will share results once it's completed. A work in progress can be tested live edit conclusions |
To whoever is following: an issue has been updated with these details in Chromium repository because I find it absurd it's so slow compared to other browsers, and that slower than my own user-land library to create a buffer view of any JS serializable type. |
I'm not sure how you created the benchmarks you've shared above, but my own benchmarking (in Chrome) indicates that using this library with
![]() |
@jorroll your test doesn't seem to be using the transfer option on the I found the original benchmark too clever and a bit hard to understand, I am seeing of refactoring it to double-check the results. btw the number on first load is similar to Andreas numbers, but subsequent refreshes show a smaller albeit considerable gap. ![]() |
It is. See the ![]() |
Also, strange. I didn't change the stackblitz example at all, but now it appears to be working again 🤷. Hopefully you can open up the link and view the console logs to see the results. |
I kind of hardcoded an alternative benchmark trying to avoid unnecessary things. coding/decoding both sides, seems like structure clone is faster, or we may be missing something ![]() worker2.html <!doctype html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<script type="module">
import { decode, encode } from "../src/index.js";
import { data } from "./data.js";
const worker = new Worker(new URL("./worker2.js", import.meta.url), {
type: "module",
});
let promise;
worker.addEventListener("message", (event) => {
const { type, data } = event.data;
switch (type) {
case "structured-clone": {
promise.resolve(data);
break;
}
case "buffered-clone": {
promise.resolve(decode(data));
break;
}
case "buffered-clone-echo": {
promise.resolve(decode(data));
break;
}
case "buffered-clone-recursion-all": {
promise.resolve(decode(data, { recursion: "all" }));
break;
}
case "buffered-clone-recursion-some": {
promise.resolve(decode(data, { recursion: "some" }));
break;
}
}
});
function get(type, data) {
promise = Promise.withResolvers();
switch (type) {
case "structured-clone": {
worker.postMessage({ type, data });
break;
}
case "buffered-clone": {
worker.postMessage({ type, data: encode(data) });
break;
}
case "buffered-clone-echo": {
worker.postMessage({ type, data: encode(data) });
break;
}
case "buffered-clone-recursion-all": {
worker.postMessage({
type,
data: encode(data, { recursion: "all" }),
});
break;
}
case "buffered-clone-recursion-some": {
worker.postMessage({
type,
data: encode(data, { recursion: "some" }),
});
break;
}
}
return promise.promise;
}
async function test(type, data) {
// warm
for (let i = 1; i < 5; i++) {
await get(type, { 0: [data] });
}
// time
const iterations = 10;
const start = performance.now();
for (let i = 0; i < iterations; i++) {
await get(type, { 0: [data] });
}
const total = performance.now() - start;
console.log(
type + " - total",
+total.toFixed(2),
"avg",
+(total / iterations).toFixed(2),
);
}
async function main() {
await test("structured-clone", data);
await test("buffered-clone", data);
await test("buffered-clone-echo", data);
await test("buffered-clone-recursion-all", data);
await test("buffered-clone-recursion-some", data);
console.log("--- with alternative data ---");
await test("structured-clone", dataAlternative);
await test("buffered-clone", dataAlternative);
await test("buffered-clone-echo", dataAlternative);
await test("buffered-clone-recursion-all", dataAlternative);
await test("buffered-clone-recursion-some", dataAlternative);
}
main().catch((error) => console.error("error running tests", error));
var dataAlternative = {
_id: "5973782bdb9a930533b05cb2",
isActive: true,
balance: "$1,446.35",
age: 32,
eyeColor: "green",
name: "Lsadafe Kegwewgewg",
gender: "gwgwegweg",
company: "Gewwewgewg",
email: "[email protected]",
phone: "+0 (000) 000-0000",
friends: [
{
id: 0,
name: "Casaasdn Ssadsad",
},
{
id: 1,
name: "Fsdasdasd Msasd",
},
{
id: 2,
name: "Csadf M3434g43g",
},
],
favoriteFruit: "banana",
};
</script>
</head>
</html> worker2.js import { encode, decode } from "../src/index.js";
self.addEventListener("message", (event) => {
const { data, type } = event.data;
switch (type) {
case "structured-clone": {
self.postMessage({ type, data });
break;
}
case "buffered-clone": {
const encodedData = encode(decode(data));
self.postMessage({ type, data: encodedData }, [encodedData.buffer]);
break;
}
case "buffered-clone-echo": {
self.postMessage({ type, data }, [data.buffer]);
break;
}
case "buffered-clone-recursion-all": {
const encodedData = encode(decode(data, { recursion: "all" }), {
recursion: "all",
});
self.postMessage({ type, data: encodedData }, [encodedData.buffer]);
break;
}
case "buffered-clone-recursion-some": {
const encodedData = encode(decode(data, { recursion: "some" }), {
recursion: "some",
});
self.postMessage({ type, data: encodedData }, [encodedData.buffer]);
break;
}
default: {
throw new Error(`unknown message type ${type}`);
}
}
}); |
Ok, I think I know what’s the issue … basically the first round trip adds the Worker initialization time + network. I swear I’ve inverted order at some point and still see similar results but it makes sense, the benchmark is wrong. Now it’s 2am here but tomorrow I will create a better benchmark that:
This way all things should be closer to a real-world use case and the network time won’t affect neither cases … will post results right after hoping there will still be better perf via buffered clone, otherwise I’ll go back to the drawing board to see if there’s any extra room for better perf. Thanks everyone for chiming in and double checking results: appreciated and happy to get this story right 🙏 |
The test I shared on stackblitz factors this in. Before each test run, a single round-trip request is made and awaited before beginning benching. async function test(type: 'query' | 'encoded-query') {
// First execute a query to warm up the javascript interpreter
await get(type);
const allStart = performance.now();
let iterations = 0;
for (let i = 1; i < 5; i++) {
iterations++;
const start = performance.now();
const data = await get(type);
const total = performance.now() - start;
// console.log(`get ${type} took ${total}ms`, data);
}
const allTotal = performance.now() - allStart;
console.log(
`${type} tests took ${allTotal}ms; avg of ${allTotal / iterations}ms`
);
} |
@joroll I didn’t check ‘cause I was on my phone but that was indeed an obvious mistake of mine in the bench … I’ve rushed conclusions, it was my fault: I’ll create a way better benchmark that will cover all use cases I’m interested in/after, thanks 🙏 |
Ah gotcha. I thought you were referring to the benchmark I shared when you said "the first round trip adds the Worker initialization time + network". |
@jorroll and others, a new benchmark is on its way but I can confirm the previous one was just plain wrong. Current results suggest that while buffered-clone is in fact not faster than structured-clone for a single roundtrip (this is without decoding, it's just encoding once per each benchmark and transfer it to the worker and transfer it back) you can see that in a triple worker roundtrip scenario, which covers a shared worker that distributes from a leading main worker use case, where the worker in the leading main thread would encode once, transfer results to main, which transfers results to worker, which transfers results to the other tab that asked for results, the buffered-clone approach is indeed ideal because it doesn't sum up every single intermediate Right now, for that specific use case I'd call it a win as that was the initial discussion but missing from the benchmark are all other use cases I am interested in:
I will update this issue once I have concrete results. edit conclusions |
can you expand the trible worker roundtrip scenario a bit? from main thread to main, to worker, to different tab? befire you said shared worker? its a bit dense. just want to make sure i understand thr scenario 🙂 |
OK, everything I wanted to test is currently live and summarized as such (or visit that page and open devtools to see your own results): ROUNDTRIP
SIMPLE SERIALIZATION
RECURSIVE SERIALIZATION
COMPLEX SERIALIZATION
DECODE COMPLEX DATA
My personal conclusions around this library and benefits or "not there yet" performance are the following:
Ultimately, there are two final things I'd like to explore:
|
@serapath it's a long discussion mentioned in the F.A.Q. section of this library: DallasHoff/sqlocal#39 (comment) TL;DR there are circumstances where things can work only out of a dynamic Worker (not shared, one Worker per each tab/window). In such scenario, if you want to share the same persistent data across multiple tabs/windows, the graph looks like this:
The scenario sounds convoluted (it is!) but it's actually the only way SQLite out of sync OPFS (Origin Private File System) operations can deliver its scaling value, so that instead of loading in RAM the whole DB in a Shared Worker, you just open it like SQLite would do on a regular File System, using only the needed RAM for any query keeping data integrity safe and sound in case your browser crashes or you kill it or whatnot. In this scenario there are always minimum 3 roundtrip needed for the data, unless one uses a SharedArrayBuffer, but the latter requires awkward, non consistent, headers across all major browsers, the former just works. In that scenario, Worker on leading creates the buffer, it transfers it to the leading tab, which transfers it to the Shared Worker, which transfers it to the requesting port ... and only there the data is decoded. This is the reason this library exists but I was hoping to make it more of an official thing if performance was there ... which is not (yet) the case. I hope I've answered your question. |
This is incorrect. The first tab opened in the browser is elected the leader tab and spawns a dedicated worker as well as a shared worker. The shared worker requests a port to the dedicated worker from the leader tab. The leader tab forwards this request to the dedicated worker which creates a new message channel and transfers one of the ports back to the shared worker. Now the shared worker can communicate directly with the dedicated worker (bypassing the leader tab). Now say a second tab is opened. The second tab sends a request to the shared worker asking for a port to the dedicated worker. The shared worker forwards this request to the dedicated worker, which creates a new message channel and transfers one of the ports back to the requesting tab. The second tab can now communicate directly with the dedicated worker. Etc. So after initialization, each tab can communicate directly with the dedicated worker. |
The problem with Shared Workers is that android chromium (over 50% of web usage) doesn't support them. They also (apparently) don't support sharedarraybuffer in chromium. Therefore it seems like any solution for this stuff needs to avoid shared workers and instead just use normal dedicated workers, which can use web locks and, hopefully, sharedarraybuffer to do efficient, leader-elected stuff across tabs. I really do think sharedarraybuffer is the key here - you seemingly just pass a reference once to each worker/main thread via message, and they can all share the same memory space without any of the cloning, copying, messaging dance. I shared a lot of excellent links about them in this comment in the parent thread of this effort Some of them go into details about how to use them with web locks, mutex, atomics etc to avoid issues that can arise from multiple concurrent writers The issue with them, as Andrea mentioned, is the cors headers. But surely that's not truly insurmountable? Moreover, it doesn't seem like there's good/performant alternatives. It can behave more like a performance polyfill - if people can't set the necessary headers, then it could fall back to whatever dance is more widely supported. |
I see ... then it was less convoluted, thanks to MessageChannel API ? That's interesting, so there's no benefit in having buffers transferable over multiple workers? I thought that was the whole point of the other discussion 🤔 |
not sure what you are talking about, my daily work is based on SAB and it's very well supported in Chrome/ium.
Yeah, we are on the same page ... the way SAB works though, is through buffers ... reason I want to add a benchmark that does that "potentially recursive data stringified via one lib that handles it then convert it into a buffer than populates the SAB so that the receiving part can decode that buffer and move on" which is what I also work with daily on PyScript to allow synchronous interactions in Python through Workers. This library goal was to tackle that ugly dance, make it officially a buffer, fill the SAB, make it easy-peasy to decode on the receiver part, but bear with me, benchmarks around this are not there yet (they will).
Exactly ... again, reason this effort exists! We already fallback to just ArrayBuffer via sabayon and I have already explored the ability to have SAB polyfilled (related MR here) but the moment somebody mentioned SAB are not a 1:1 thing but should be available everywhere is the moment I took a step backward as that'd be an extremely convoluted polyfill we don't actually need on PyScript (they pay me for this stuff, I can stretch it a bit, but there are other priorities too). I've recently pushed a branch that uses |
I was saying that the SAB apparently doesn't work in Shared Workers in chromium - I think that was brought up in the previous sqlite discussion. Regardless shared workers really aren't viable til android chromium supports them. Anyway, this is all beyond my capabilities to make any real contributions. I'm mostly just following along. I hope you can figure something out that works well! |
@nickchomey correct, and I have a polyfill for it but it doesn’t work with MessageChannel so the shortcut explained before where the leading tab creates ports that communicate directly can’t be used and the fallback is that triple harakiri I’ve explained. Reason the poly can’t work easily is:
If SharedWorker had better support things would be different but while it needs less problematic headers, it cannot be delivered as part of a library via CDN: its file must be part of the assets within the domain or it won’t bootstrap. It feels like all these new wonderful APIs are overall practically useless in a cross env/browser scenario which is a pity 😢 |
to whom it might concern, I have a branch that in NodeJS goes down to 3X slowdown (when it's hot) compared to structuredClone but I am afraid I've exhausted the amount of possible patterns to use to make it any better. The latest uses optionally a resizable ArrayBuffer that grows while encoding that that's deadly slow compared to an empty array Latest MR is here: #5 |
edit ... and then again, because structured-clone fails with that recursive data the benchmark was misleading and results are currently very similar on my machine, with structured-clone on average slightly faster ... this is driving me crazy! (I think I need to fix structured clone first then eventually compare stuff) To whom it might concern, of course I've kept testing and debugging flame graphs and the current state is reasonably good enough but, most importantly, this module now is practically perfect for SharedArrayBuffer operations, as demoed live in this test page When it's hot, it's closer to 1ms than 2ms or 3ms and the comparison VS structured-clone/json can be found in here: https://github.com/WebReflection/buffered-clone/blob/main/test/sab/index.js#L32-L81 Key Differences
Current StateWhile I still believe if we had this module native performance concerns will just fade away, I am super excited to have finally found a great replacement for the other polyfill in coincident and other projects based on SAB and/or structured-clone/json in general, because it shows that when less operations over memory are desired, it's a winning solution:
At this point missing:
That's the update, thank you for reading, happy to answer questions, if any. |
FYI structured-clone now fixed with both ArrayBuffer and DateView ability, the only "gotcha" in there is that buffers a not really recursive, rather duplicated, but as that's a common expectation I didn't spend much time optimizing that aspect ... yet I've also improved further buffered-clone performance in the making of that fix but also the latest SPECIFCATIONS anyone is welcome to help with, at least around the current Encode part ... decoding coming soon! |
we are here with regard to standards approach that ignored a need for a |
Sending transferrable data across threads via
I made an off-discussion comment that
I added that I thought using transferrable objects could be a big performance and then clarified that this optimization would need to be added to the sqlite library.
Honestly, all the discussion that resulted from my remark about |
This project idea is to define that buffer in a way that is more general purpose and understood or implemented across multiple PLs, similar to JSON, faster than BSON and with less shenanigans than JSON (recursive, more compact for repeated rows and so on). My current tests see it as fast or faster than structuredClone (cold run over deeply nested big data) but while almost everything has been specd’ I’m not happy about the number conversion (as string) … it works fine but it feels wrong and is not fully portable across PLs. Thanks for all the sharing though, even if this might never be the answer for SQLite, it certainly is the way forward for me and my libraries once all fine tuned things land and numbers are a real representation of … numbers: fast and always correct |
You're probably aware, but there are a number of libraries which technically provide this functionality (e.g. MessagePack, Google's Protocol Buffers). This being said, existing libraries focus on passing data from a server to a client or between servers. Scenarios where minimizing payload size is a priority. I'm not aware of an existing data structure which is performance optimized for transferring arbitrary data across browser threads though. |
this works also for servers, but it's easy on the client side too ... I know it didn't (quite) exist while investigating before creating the json branch out of the structured-clone polyfill, but I never thought about filling SharedArrayBuffer directly and decode at speedlight on the other side before and I think buffered-clone has potentials. For comparison, this lib is compatible with anything JSON compatible, which is ... everything. It doesn't require specialized syntax or utilities, it works with all JS types out of the box + it's super simple in decoding steps + it's recursive out of the box, no special extensions needed ... it really is like structuredClone but as buffer, although it could be more widely usable, e.g. for Rust or WASM exchanges. |
closing this as |
Now that this module is fully covered and production ready, I’d like to write down thoughts around its application and how this can improve further to better tackle all use cases.
Recursion
Update it's in and it works wonderfully
It’s wonderful and
super fastbut it inevitably retains a lot in memory. This is not really an issue in JS, those references will be kept in memory anyway until the returned reference exists, but it makes lower level serialization (WASM, C, Zig) more convoluted for little benefits when:In other libraries recursion is only for non primitives but here there is a chance to fine tune its capabilities:
The question is if the implementation should be bitwise based, so that grouping cases or excluding these can be ad hoc, or generic enough which feels more appropriate.
Once generic: should the default be more relaxed (some) or does it matter, since at deciding time, unless is none, it either works or throw if none is expected?
Hooks while encoding
Having a
toBufferedClone
symbol feels the right thing to do but should I think about a way to also decode the returning value later?This could be a method passed as option in both encoding and decoding, where the different data could be transformed in and back.
Performance
Update ... conclusions
Need to test:
it's up to 100X faster on... and it is currently as fast on single pass, and once hot, but X times faster on multiple pass of the same data aroundpostMessage
over complex dataThe text was updated successfully, but these errors were encountered: