Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Brainstorming Features #2

Closed
WebReflection opened this issue Jan 20, 2025 · 31 comments
Closed

Brainstorming Features #2

WebReflection opened this issue Jan 20, 2025 · 31 comments

Comments

@WebReflection
Copy link
Owner

WebReflection commented Jan 20, 2025

Now that this module is fully covered and production ready, I’d like to write down thoughts around its application and how this can improve further to better tackle all use cases.

Recursion


Update it's in and it works wonderfully


It’s wonderful and super fast but it inevitably retains a lot in memory. This is not really an issue in JS, those references will be kept in memory anyway until the returned reference exists, but it makes lower level serialization (WASM, C, Zig) more convoluted for little benefits when:

  • data is created out of SQLite. It’s true that strings repetition is common, each row key, but it’s also true that there could be thousand different strings or numbers as value … is it worth to track all of them in memory?
  • streamed data. With a pre-defined static buffer it’s all good but with data streamed on demand, which is potentially huge, the amount of RAM needed to track recursion might be huge and not friendly with constrained devices (MicroPython to name a use case). The same goes for decoding such data. If the stream is huge, there is no way to free memory while decoding each part of it until the stream ends.
  • cases where recursion is neither expected nor desired for complex data. With JSON that’s not possible but in the other hand, this is the reason flatted has million downloads per week.

In other libraries recursion is only for non primitives but here there is a chance to fine tune its capabilities:

  • all means recursion for anything as it is now, except Boolean, null and empty strings
  • some could mean only non primitive data
  • none could mean no recursion expected, will eventually throw out of max stack because if anything is retained in memory to track non primitives there is little gain … worth it?

The question is if the implementation should be bitwise based, so that grouping cases or excluding these can be ad hoc, or generic enough which feels more appropriate.

Once generic: should the default be more relaxed (some) or does it matter, since at deciding time, unless is none, it either works or throw if none is expected?

Hooks while encoding

Having a toBufferedClone symbol feels the right thing to do but should I think about a way to also decode the returning value later?

This could be a method passed as option in both encoding and decoding, where the different data could be transformed in and back.

Performance


Update ... conclusions


Need to test:

  • raw JSON comparison - update ✅ done ... and it currently sucks
  • postMessage - update ✅ done, it's up to 100X faster on postMessage over complex data ... and it is currently as fast on single pass, and once hot, but X times faster on multiple pass of the same data around
  • worth writing a native node thing too to see how much this would improve performance if it was a native API?
@WebReflection
Copy link
Owner Author

WebReflection commented Jan 20, 2025

Web Worker Performance

Well, I'd be damned ... when I've heard "structuredClone is very slow cross workers" I wasn't expecting this slower compared to bufferedClone ... here some benchmark you can test live (open the console to see results)


MISLEADING RESULTS - the benchmark was all wrong so nothing in here made sense.

A new benchmark is in the making and I will share results once it's completed.

A work in progress can be tested live


~~It's amazng this library is ~10X slower on the same thread to perform a structured clone but up to 100X faster in complex postMessage based applications 🥳~~


edit conclusions

@WebReflection
Copy link
Owner Author

WebReflection commented Jan 20, 2025

OK, for correctness I'd like to also add WebKit and Firefox results ... yeah, we're down to 2X up to 10X there, but it's clear there is something really weird in postMessage and structuredClone combination, as it's super fast inline, it's deadly slow for anything else.


MISLEADING RESULTS - the benchmark was all wrong so nothing in here made sense.

A new benchmark is in the making and I will share results once it's completed.

A work in progress can be tested live


edit conclusions

@WebReflection
Copy link
Owner Author

To whoever is following: an issue has been updated with these details in Chromium repository because I find it absurd it's so slow compared to other browsers, and that slower than my own user-land library to create a buffer view of any JS serializable type.

@jorroll
Copy link

jorroll commented Jan 20, 2025

I'm not sure how you created the benchmarks you've shared above, but my own benchmarking (in Chrome) indicates that using this library with postMessage is slower than postMessage without using this library @WebReflection.

Image

@titoBouzout
Copy link

@jorroll your test doesn't seem to be using the transfer option on the postMessage.

I found the original benchmark too clever and a bit hard to understand, I am seeing of refactoring it to double-check the results. btw the number on first load is similar to Andreas numbers, but subsequent refreshes show a smaller albeit considerable gap.

Image

@jorroll
Copy link

jorroll commented Jan 20, 2025

@jorroll your test doesn't seem to be using the transfer option on the postMessage.

It is. See the worker.ts file for "encoded-query". Note that the test only encodes the response from the worker to the main thread. I didn't bother encoding the request from the main thread to the worker. This shouldn't affect the conclusion though.

Image

@jorroll
Copy link

jorroll commented Jan 20, 2025

Also, strange. I didn't change the stackblitz example at all, but now it appears to be working again 🤷. Hopefully you can open up the link and view the console logs to see the results.

@titoBouzout
Copy link

I kind of hardcoded an alternative benchmark trying to avoid unnecessary things. coding/decoding both sides, seems like structure clone is faster, or we may be missing something

Image

worker2.html

<!doctype html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <script type="module">
      import { decode, encode } from "../src/index.js";
      import { data } from "./data.js";

      const worker = new Worker(new URL("./worker2.js", import.meta.url), {
        type: "module",
      });

      let promise;

      worker.addEventListener("message", (event) => {
        const { type, data } = event.data;

        switch (type) {
          case "structured-clone": {
            promise.resolve(data);
            break;
          }
          case "buffered-clone": {
            promise.resolve(decode(data));
            break;
          }
          case "buffered-clone-echo": {
            promise.resolve(decode(data));
            break;
          }
          case "buffered-clone-recursion-all": {
            promise.resolve(decode(data, { recursion: "all" }));
            break;
          }
          case "buffered-clone-recursion-some": {
            promise.resolve(decode(data, { recursion: "some" }));
            break;
          }
        }
      });

      function get(type, data) {
        promise = Promise.withResolvers();
        switch (type) {
          case "structured-clone": {
            worker.postMessage({ type, data });
            break;
          }
          case "buffered-clone": {
            worker.postMessage({ type, data: encode(data) });
            break;
          }
          case "buffered-clone-echo": {
            worker.postMessage({ type, data: encode(data) });
            break;
          }
          case "buffered-clone-recursion-all": {
            worker.postMessage({
              type,
              data: encode(data, { recursion: "all" }),
            });
            break;
          }
          case "buffered-clone-recursion-some": {
            worker.postMessage({
              type,
              data: encode(data, { recursion: "some" }),
            });
            break;
          }
        }
        return promise.promise;
      }

      async function test(type, data) {
        // warm
        for (let i = 1; i < 5; i++) {
          await get(type, { 0: [data] });
        }

        // time
        const iterations = 10;

        const start = performance.now();

        for (let i = 0; i < iterations; i++) {
          await get(type, { 0: [data] });
        }

        const total = performance.now() - start;

        console.log(
          type + " - total",
          +total.toFixed(2),
          "avg",
          +(total / iterations).toFixed(2),
        );
      }

      async function main() {
        await test("structured-clone", data);
        await test("buffered-clone", data);
        await test("buffered-clone-echo", data);
        await test("buffered-clone-recursion-all", data);
        await test("buffered-clone-recursion-some", data);

        console.log("--- with alternative data ---");
        await test("structured-clone", dataAlternative);
        await test("buffered-clone", dataAlternative);
        await test("buffered-clone-echo", dataAlternative);
        await test("buffered-clone-recursion-all", dataAlternative);
        await test("buffered-clone-recursion-some", dataAlternative);
      }

      main().catch((error) => console.error("error running tests", error));

      var dataAlternative = {
        _id: "5973782bdb9a930533b05cb2",
        isActive: true,
        balance: "$1,446.35",
        age: 32,
        eyeColor: "green",
        name: "Lsadafe Kegwewgewg",
        gender: "gwgwegweg",
        company: "Gewwewgewg",
        email: "[email protected]",
        phone: "+0 (000) 000-0000",
        friends: [
          {
            id: 0,
            name: "Casaasdn Ssadsad",
          },
          {
            id: 1,
            name: "Fsdasdasd Msasd",
          },
          {
            id: 2,
            name: "Csadf M3434g43g",
          },
        ],
        favoriteFruit: "banana",
      };
    </script>
  </head>
</html>

worker2.js

import { encode, decode } from "../src/index.js";

self.addEventListener("message", (event) => {
  const { data, type } = event.data;

  switch (type) {
    case "structured-clone": {
      self.postMessage({ type, data });
      break;
    }
    case "buffered-clone": {
      const encodedData = encode(decode(data));

      self.postMessage({ type, data: encodedData }, [encodedData.buffer]);
      break;
    }
    case "buffered-clone-echo": {
      self.postMessage({ type, data }, [data.buffer]);
      break;
    }
    case "buffered-clone-recursion-all": {
      const encodedData = encode(decode(data, { recursion: "all" }), {
        recursion: "all",
      });

      self.postMessage({ type, data: encodedData }, [encodedData.buffer]);
      break;
    }
    case "buffered-clone-recursion-some": {
      const encodedData = encode(decode(data, { recursion: "some" }), {
        recursion: "some",
      });

      self.postMessage({ type, data: encodedData }, [encodedData.buffer]);
      break;
    }
    default: {
      throw new Error(`unknown message type ${type}`);
    }
  }
});

@WebReflection
Copy link
Owner Author

Ok, I think I know what’s the issue … basically the first round trip adds the Worker initialization time + network. I swear I’ve inverted order at some point and still see similar results but it makes sense, the benchmark is wrong.

Now it’s 2am here but tomorrow I will create a better benchmark that:

  • creates a worker and waits for its first interacion
  • post data as is, change it on the other side somewhere, post back and validate changes
  • encode data and transfer it, decode and change it somewhere, encode and transfer it back
  • do the same per each method twice or up to 5 times to see if there’s any cold VS hot perf improvement

This way all things should be closer to a real-world use case and the network time won’t affect neither cases … will post results right after hoping there will still be better perf via buffered clone, otherwise I’ll go back to the drawing board to see if there’s any extra room for better perf.

Thanks everyone for chiming in and double checking results: appreciated and happy to get this story right 🙏

@jorroll
Copy link

jorroll commented Jan 21, 2025

Ok, I think I know what’s the issue … basically the first round trip adds the Worker initialization time + network.

The test I shared on stackblitz factors this in. Before each test run, a single round-trip request is made and awaited before beginning benching.

async function test(type: 'query' | 'encoded-query') {
  // First execute a query to warm up the javascript interpreter
  await get(type);

  const allStart = performance.now();
  let iterations = 0;

  for (let i = 1; i < 5; i++) {
    iterations++;
    const start = performance.now();
    const data = await get(type);
    const total = performance.now() - start;
    // console.log(`get ${type} took ${total}ms`, data);
  }

  const allTotal = performance.now() - allStart;

  console.log(
    `${type} tests took ${allTotal}ms; avg of ${allTotal / iterations}ms`
  );
}

@WebReflection
Copy link
Owner Author

@joroll I didn’t check ‘cause I was on my phone but that was indeed an obvious mistake of mine in the bench … I’ve rushed conclusions, it was my fault: I’ll create a way better benchmark that will cover all use cases I’m interested in/after, thanks 🙏

@jorroll
Copy link

jorroll commented Jan 21, 2025

@joroll I didn’t check ‘cause I was on my phone but that was indeed an obvious mistake of mine in the bench … I’ve rushed conclusions, it was my fault: I’ll create a way better benchmark that will cover all use cases I’m interested in/after, thanks 🙏

Ah gotcha. I thought you were referring to the benchmark I shared when you said "the first round trip adds the Worker initialization time + network".

@WebReflection
Copy link
Owner Author

WebReflection commented Jan 21, 2025

@jorroll and others, a new benchmark is on its way but I can confirm the previous one was just plain wrong.

Current results suggest that while buffered-clone is in fact not faster than structured-clone for a single roundtrip (this is without decoding, it's just encoding once per each benchmark and transfer it to the worker and transfer it back) you can see that in a triple worker roundtrip scenario, which covers a shared worker that distributes from a leading main worker use case, where the worker in the leading main thread would encode once, transfer results to main, which transfers results to worker, which transfers results to the other tab that asked for results, the buffered-clone approach is indeed ideal because it doesn't sum up every single intermediate postMessage dance.

Right now, for that specific use case I'd call it a win as that was the initial discussion but missing from the benchmark are all other use cases I am interested in:

  • JSON stringify over big results VS encode (the flatted case)
  • StructuredClone Polyfill Extras stringify operation (the coincident use case)

I will update this issue once I have concrete results.


edit conclusions

@serapath
Copy link

can you expand the trible worker roundtrip scenario a bit?

from main thread to main, to worker, to different tab?

befire you said shared worker?

its a bit dense. just want to make sure i understand thr scenario 🙂

@WebReflection
Copy link
Owner Author

OK, everything I wanted to test is currently live and summarized as such (or visit that page and open devtools to see your own results):


ROUNDTRIP

  • Structured Clone
    • cold run : 0.553955078125 ms
    • hot run 1: 0.18505859375 ms
  • Structured Clone: double
    • cold run : 0.623779296875 ms
    • hot run 1: 0.31689453125 ms
  • Structured Clone: triple
    • cold run : 0.73486328125 ms
    • hot run 1: 0.508056640625 ms
  • Buffered Clone
    • cold run : 1.360107421875 ms
    • hot run 1: 0.7900390625 ms
    • hot run 2: 0.419921875 ms
    • hot run 3: 0.79296875 ms
    • hot run 4: 0.4912109375 ms
    • hot run 5: 0.324951171875 ms
  • Buffered Clone: double
    • cold run : 0.369873046875 ms
    • hot run 1: 0.2890625 ms
    • hot run 2: 0.39794921875 ms
    • hot run 3: 0.251953125 ms
    • hot run 4: 0.22900390625 ms
    • hot run 5: 0.337890625 ms
  • Buffered Clone: triple
    • cold run : 0.799072265625 ms
    • hot run 1: 0.365966796875 ms
    • hot run 2: 0.35498046875 ms
    • hot run 3: 0.412177734375 ms
    • hot run 4: 0.399833984375 ms

SIMPLE SERIALIZATION

  • JSON
    • cold run : 0.84521484375 ms
    • hot run 1: 0.366943359375 ms
  • Flatted
    • cold run : 2.656005859375 ms
    • hot run 1: 2.071044921875 ms
    • hot run 2: 1.372802734375 ms
    • hot run 3: 1.31298828125 ms
    • hot run 4: 1.10400390625 ms
    • hot run 5: 1.18896484375 ms
  • ungap structured-clone/json
    • cold run : 3.31884765625 ms
    • hot run 1: 3.23486328125 ms
    • hot run 2: 2.455078125 ms
    • hot run 3: 1.863037109375 ms
    • hot run 4: 1.0009765625 ms
    • hot run 5: 1.556884765625 ms
  • Buffered Clone
    • cold run : 6.451904296875 ms
    • hot run 1: 4.571044921875 ms
    • hot run 2: 3.81298828125 ms
    • hot run 3: 4.069091796875 ms
    • hot run 4: 3.962890625 ms
    • hot run 5: 4.031005859375 ms

RECURSIVE SERIALIZATION

  • Flatted
    • cold run : 2.719970703125 ms
    • hot run 1: 1.631103515625 ms
    • hot run 2: 1.5322265625 ms
    • hot run 3: 1.677978515625 ms
  • ungap structured-clone/json
    • cold run : 3.06591796875 ms
    • hot run 1: 3.2939453125 ms
  • Buffered Clone
    • cold run : 5.31103515625 ms
    • hot run 1: 4.10888671875 ms

COMPLEX SERIALIZATION

  • Buffered Clone
    • cold run : 1.950927734375 ms
    • hot run 1: 0.817138671875 ms
    • hot run 2: 0.752197265625 ms
    • hot run 3: 0.739013671875 ms

DECODE COMPLEX DATA

  • Buffered Clone
    • cold run : 1.0771484375 ms
    • hot run 1: 0.510009765625 ms
    • hot run 2: 0.3310546875 ms
    • hot run 3: 0.39892578125 ms

My personal conclusions around this library and benefits or "not there yet" performance are the following:

  • multiple workers roundtrip for buffers do benefit from skipping structured clone algorithm but the encoding, done via JS, inevitably adds precious ms to the equation. Once that encoding is done though, transfer data from 1 worker to another one or many seems to be nearly free
  • both encode and decode in here use TextEncoder and TextDecoder for all strings except number, bigint or date, where those values are encoded as strings anyway, to never lose precision, but those primitives are knowingly slow out there. I wonder if plugging in a faster utf-16 to utf-8 library would help, but I also wonder if having this API native (or in WASM) would, or should, make performance better. Right now I am not feeling like replacing flatted or ungap structured-clone/json dependencies in favor of buffered-clone but my journey is not ended yet until all things that could make it faster would be tried (although I am lowering priority around this)
  • because decode is not too bad, I'd love to see others using this library specifications to produce streamable buffers out of SQLite and/or any other DB or REST API so that recursion, compactness (this library creates smaller artifacts out of the box) and performance will be there and convenient
  • this benchmark and 100% code coverage of this module, through an extremely nested recursive piece of data, revealed a potential bug in @ungap/structured-clone which is incapable of performing Complex Data serialization that needs investigation because this API is indeed already widely used and adopted also from my team (PyScript) and in coincident it's optionally used to orchestrate a lot of stuff ... bummer!
  • this library specifications + algorithm is rock solid ... I want the SPECIFICATIONS.md file to be completed so that others might want to explore a similar project over native implementations that would surely perform better
  • I wonder if it's worth asking TC39 to considered a BufferedClone primitive with both encode and decode methods as a better, faster, more portable version of the current StructuredClone API which does not allow partial serialization and deserialization and that's an ugly limitation of that API (but I've asked, no interest in adding that so far) ... with this, interoperability between JS and any PL that supports JSON would be a piece of cake so that WASM as well as cross platform buffers will become a possibility

Ultimately, there are two final things I'd like to explore:

  • using encode, passing the resulting buffer into a SharedArrayBuffer as opposite of doing what coincident does right now, which is ungap.stringify(data) then pass that string as Int32Array compatible view over the SAB and then ungap.parse(thatStringFromBuffer) on the other end as I still believe there should be some improvement in here around this dance
  • write a native decode only implementation to see how well or worse that would perform compared to hot-running JS logic

@WebReflection
Copy link
Owner Author

WebReflection commented Jan 21, 2025

@serapath it's a long discussion mentioned in the F.A.Q. section of this library: DallasHoff/sqlocal#39 (comment)

TL;DR there are circumstances where things can work only out of a dynamic Worker (not shared, one Worker per each tab/window). In such scenario, if you want to share the same persistent data across multiple tabs/windows, the graph looks like this:

  • main leading tab, AKA the first tab that opened that domain/page
    • it creates a Worker to access otherwise impossible to use APIs (blocking stuff that won't work on the main)
    • it creates a SharedWorker with its leading port so that future tabs or windows can connect to such shared worker, communicate their non-leading port to it, so that ...
  • the Shared Worker orchestrates all ports: when any port (tab/window) ask to execute that special API/feature only the leading tab has, it forwards the request to that leading tab (main thread) which forwards the request to its Worker, which answers back to the leading tab, which forwards the result to the Shared Worker, which forwards that result back to the requesting port that is not leading
  • if the source of the request was from a Worker that wants to make that query looks and feel synchronous, through Atomics and a SharedArrayBuffer, the dance passes over 4 steps, as the port needs to fill the SAB (whenever that happens or for whatever reason) before data can be decoded

The scenario sounds convoluted (it is!) but it's actually the only way SQLite out of sync OPFS (Origin Private File System) operations can deliver its scaling value, so that instead of loading in RAM the whole DB in a Shared Worker, you just open it like SQLite would do on a regular File System, using only the needed RAM for any query keeping data integrity safe and sound in case your browser crashes or you kill it or whatnot.

In this scenario there are always minimum 3 roundtrip needed for the data, unless one uses a SharedArrayBuffer, but the latter requires awkward, non consistent, headers across all major browsers, the former just works.

In that scenario, Worker on leading creates the buffer, it transfers it to the leading tab, which transfers it to the Shared Worker, which transfers it to the requesting port ... and only there the data is decoded.

This is the reason this library exists but I was hoping to make it more of an official thing if performance was there ... which is not (yet) the case.

I hope I've answered your question.

@jorroll
Copy link

jorroll commented Jan 21, 2025

  • the Shared Worker orchestrates all ports: when any port (tab/window) ask to execute that special API/feature only the leading tab has, it forwards the request to that leading tab (main thread) which forwards the request to its Worker, which answers back to the leading tab, which forwards the result to the Shared Worker, which forwards that result back to the requesting port that is not leading

This is incorrect. The first tab opened in the browser is elected the leader tab and spawns a dedicated worker as well as a shared worker. The shared worker requests a port to the dedicated worker from the leader tab. The leader tab forwards this request to the dedicated worker which creates a new message channel and transfers one of the ports back to the shared worker. Now the shared worker can communicate directly with the dedicated worker (bypassing the leader tab). Now say a second tab is opened. The second tab sends a request to the shared worker asking for a port to the dedicated worker. The shared worker forwards this request to the dedicated worker, which creates a new message channel and transfers one of the ports back to the requesting tab. The second tab can now communicate directly with the dedicated worker. Etc. So after initialization, each tab can communicate directly with the dedicated worker.

@nickchomey
Copy link

nickchomey commented Jan 21, 2025

The problem with Shared Workers is that android chromium (over 50% of web usage) doesn't support them. They also (apparently) don't support sharedarraybuffer in chromium.

Therefore it seems like any solution for this stuff needs to avoid shared workers and instead just use normal dedicated workers, which can use web locks and, hopefully, sharedarraybuffer to do efficient, leader-elected stuff across tabs.

I really do think sharedarraybuffer is the key here - you seemingly just pass a reference once to each worker/main thread via message, and they can all share the same memory space without any of the cloning, copying, messaging dance.

I shared a lot of excellent links about them in this comment in the parent thread of this effort
DallasHoff/sqlocal#39 (comment)

Some of them go into details about how to use them with web locks, mutex, atomics etc to avoid issues that can arise from multiple concurrent writers

The issue with them, as Andrea mentioned, is the cors headers. But surely that's not truly insurmountable? Moreover, it doesn't seem like there's good/performant alternatives. It can behave more like a performance polyfill - if people can't set the necessary headers, then it could fall back to whatever dance is more widely supported.

@WebReflection
Copy link
Owner Author

@jorroll

Now the shared worker can communicate directly with the dedicated worker (bypassing the leader tab)

I see ... then it was less convoluted, thanks to MessageChannel API ? That's interesting, so there's no benefit in having buffers transferable over multiple workers? I thought that was the whole point of the other discussion 🤔

@WebReflection
Copy link
Owner Author

@nickchomey

They also (apparently) don't support sharedarraybuffer in chromium.

not sure what you are talking about, my daily work is based on SAB and it's very well supported in Chrome/ium.

I really do think sharedarraybuffer is the key here - you seemingly just pass a reference once to each worker/main thread via message, and they can all share the same memory space without any of the cloning, copying, messaging dance.

Yeah, we are on the same page ... the way SAB works though, is through buffers ... reason I want to add a benchmark that does that "potentially recursive data stringified via one lib that handles it then convert it into a buffer than populates the SAB so that the receiving part can decode that buffer and move on" which is what I also work with daily on PyScript to allow synchronous interactions in Python through Workers.

This library goal was to tackle that ugly dance, make it officially a buffer, fill the SAB, make it easy-peasy to decode on the receiver part, but bear with me, benchmarks around this are not there yet (they will).

The issue with them, as Andrea mentioned, is the cors headers. But surely that's not truly insurmountable? Moreover, it doesn't seem like there's good/performant alternatives. It can behave more like a performance polyfill - if people can't set the necessary headers, then it could fall back to whatever dance is more widely supported.

Exactly ... again, reason this effort exists! We already fallback to just ArrayBuffer via sabayon and I have already explored the ability to have SAB polyfilled (related MR here) but the moment somebody mentioned SAB are not a 1:1 thing but should be available everywhere is the moment I took a step backward as that'd be an extremely convoluted polyfill we don't actually need on PyScript (they pay me for this stuff, I can stretch it a bit, but there are other priorities too).

I've recently pushed a branch that uses subarray instead of slice when possible and that doesn't "double push" data all over the place but no matter what I do, perf seems to be slightly better (I have more 0.2 than 0.3 now on the bench) but not good enough ... if anyone would like to help out with the least known trick to make my code faster that actually has a 0.1+ impact on that benchmark page, please do help, thank you 🙏

@nickchomey
Copy link

I was saying that the SAB apparently doesn't work in Shared Workers in chromium - I think that was brought up in the previous sqlite discussion. Regardless shared workers really aren't viable til android chromium supports them.

Anyway, this is all beyond my capabilities to make any real contributions. I'm mostly just following along. I hope you can figure something out that works well!

@WebReflection
Copy link
Owner Author

@nickchomey correct, and I have a polyfill for it but it doesn’t work with MessageChannel so the shortcut explained before where the leading tab creates ports that communicate directly can’t be used and the fallback is that triple harakiri I’ve explained.

Reason the poly can’t work easily is:

  • the current poly tackles only 1:1 relations, SAB on multiple tabs were never considered
  • the least fallback based on Service Worker handles multiple tabs but it uses Workers to bootstrap, not shared workers … adding MessageChannel to the mix is challenging and probably cannot be faster but 2 times, up to X times, slower

If SharedWorker had better support things would be different but while it needs less problematic headers, it cannot be delivered as part of a library via CDN: its file must be part of the assets within the domain or it won’t bootstrap.

It feels like all these new wonderful APIs are overall practically useless in a cross env/browser scenario which is a pity 😢

@WebReflection
Copy link
Owner Author

WebReflection commented Jan 22, 2025

to whom it might concern, I have a branch that in NodeJS goes down to 3X slowdown (when it's hot) compared to structuredClone but I am afraid I've exhausted the amount of possible patterns to use to make it any better.

The latest uses optionally a resizable ArrayBuffer that grows while encoding that that's deadly slow compared to an empty array [] incremented via i++ where each time i is it's current length ... to then return a Uint8Array(thatArray) ... which is faster by all means compared to a properly memory handled growable buffer ... I think I might rest my case around performance because I don't think JS can deliver more than that via JS itself, but at least I can confirm the whole logic is rock solid and "as blazing fast as it can get" in encoding, but I am not excluding the possibility to be more greedy on RAM nd adjust at the end or grow greedily on demand and adjust at the end but honestly none of this would be needed in C, Rust or Zig and I am keen to eventually explore AssemblyScript ability to let me create a WASM module and see how that would perform instead.

Latest MR is here: #5

@WebReflection
Copy link
Owner Author

WebReflection commented Jan 23, 2025

edit ... and then again, because structured-clone fails with that recursive data the benchmark was misleading and results are currently very similar on my machine, with structured-clone on average slightly faster ... this is driving me crazy! (I think I need to fix structured clone first then eventually compare stuff)


To whom it might concern, of course I've kept testing and debugging flame graphs and the current state is reasonably good enough but, most importantly, this module now is practically perfect for SharedArrayBuffer operations, as demoed live in this test page

Image

When it's hot, it's closer to 1ms than 2ms or 3ms and the comparison VS structured-clone/json can be found in here: https://github.com/WebReflection/buffered-clone/blob/main/test/sab/index.js#L32-L81

Key Differences

  • the current structured-clone/json is faster than buffered-clone in most benchmarks, but
    • it fails with the complex data structure so it's not as robust as I've thought it was
    • it produces a string that needs to be encoded so that buffer can be passed along to the SAB
    • once the SAB is notified it needs to decode that thing and then parse it ... this is a 4 steps procedure
    • it needs some shenanigan around the length or the text decoder won't work ... textdecoder also doesn't directly work with SAB so there are possible errors in the making
  • the current buffered-clone encode, grows the SAB, directly set its content and notify such SAB ... once received, it directly decode the SAB and because the total length is embedded already in the protocol/specification, it never fails, even if there are extra 0 bytes at the end

Current State

While I still believe if we had this module native performance concerns will just fade away, I am super excited to have finally found a great replacement for the other polyfill in coincident and other projects based on SAB and/or structured-clone/json in general, because it shows that when less operations over memory are desired, it's a winning solution:

  • recursion
  • complex types
  • direct set for SAB use cases
  • tons of benchmarks around (and btw, BSON is part of the worker benchmark now and it's slower plus it's incompatible with recursion out of the box)

At this point missing:

  • stream-ability to both encode while streaming and decode while reading streams
  • as JSON has toJSON and BSON has toBSON, I want to find a compromise to have a toBufferedClone symbol around because I think we will benefit from it in our PyScript project, 'cause structured-clone also has a toJSON feature we use in PyScript

That's the update, thank you for reading, happy to answer questions, if any.

@WebReflection
Copy link
Owner Author

WebReflection commented Jan 23, 2025

FYI structured-clone now fixed with both ArrayBuffer and DateView ability, the only "gotcha" in there is that buffers a not really recursive, rather duplicated, but as that's a common expectation I didn't spend much time optimizing that aspect ... yet I've also improved further buffered-clone performance in the making of that fix but also the latest SPECIFCATIONS anyone is welcome to help with, at least around the current Encode part ... decoding coming soon!

@WebReflection
Copy link
Owner Author

WebReflection commented Jan 23, 2025

we are here with regard to standards approach that ignored a need for a toJSON like feature:
whatwg/html#7428 (comment)

@jorroll
Copy link

jorroll commented Jan 25, 2025

@WebReflection

Now the shared worker can communicate directly with the dedicated worker (bypassing the leader tab)

I see ... then it was less convoluted, thanks to MessageChannel API ? That's interesting, so there's no benefit in having buffers transferable over multiple workers? I thought that was the whole point of the other discussion 🤔


That's interesting, so there's no benefit in having buffers transferable over multiple workers?

Sending transferrable data across threads via postMessage is faster than sending that data via structuredClone via postMessage. That's the potential benefit.

I thought that was the whole point of the other discussion 🤔

  1. The other issue is a discussion of updating the SQLocal library to support the OPFS-SAH VFS option.
    • The SQLocal maintainer remarked that "My initial idea for using the SAH VFS was to have sqlite-wasm run in a SharedWorker ..." but that this wouldn't work for several reasons (one of which is that coincident, which apparently SQLocal depends on, doesn't run in a SharedWorker).
    • This is probably the comment that misled you @WebReflection. Unfortunately I think this comment was made because the SQLocal maintainer didn't understand how implementing the OPFS-SAH VFS would work (e.g. they didn't understand that sqlite still runs in a dedicated worker when using the OPFS-SAH VFS). The maintainer also implies that a SharedWorker is needed to implement the OPFS-SAH VFS, but in reality you can also use a ServiceWorker instead (though a library might not want to take that approach for other reasons).
  2. In the other thread I called out that every tab has direct access to the sqlite dedicated worker after initialization. Though I didn't describe how this works in as much detail as I did above.

    Because of this, if you want multi-tab support you need to elect a leader tab to host this dedicated worker. This leader tab is responsible for providing MessagePorts to the dedicated worker to other tabs (so that every tab has direct access to the dedicated worker.
    ...
    the only thing a shared worker is used for is getting a MessagePort handle to the dedicated worker hosted by the leader tab. I.e. it’s only used when the leader tab changes or a new tab is initialized. See the wa-SQLite repo for examples.

I made an off-discussion comment that structuredClone can be a source of slowness when sending 1000s of messages across threads (someone else seemed to make a similar remark).

  • every tab is the same distance from the dedicated worker. But post message is super super slow in the browser so, unfortunately (and I can’t believe this is true but we see it in production), round tripping with the server can be faster than accessing the local SQLite database. This can also happen on cheap devices because their SSDs are slow. See a Notion blog post on their adopting SQLite-wasm for Notion. For our case, we see post message slow down and cause issues with high message volume (1000s of msgs) / high payload size. Serializing data between threads is a synchronous, blocking operation.

I added that I thought using transferrable objects could be a big performance and then clarified that this optimization would need to be added to the sqlite library.

  • If I read that correctly, we’re saying that postMessage (and its structuredClone operation) is slower than putting a JSON into a buffer and transfer it, right?

    Not quite. "putting JSON [a string] into a buffer and transfer it" adds an encoding and decoding step that would (probably definitely) be slower than just passing it normally via structuredClone [1]. But in the case of sqlite, I don't expect that the data is starting out as javascript objects. It's starting out as some sort of internal SQLite data structure before being transformed into a javascript object and returned. So in this context, if sqlite transformed it to some sort of ArrayBuffer which was returned instead, possibly that process would take a similar amount of time so you could eliminate the structuredClone cost without adding a JSON encoding cost. For the receiver you would be adding a decoding step, but seems possible that you'd still come out ahead. Obviously this is just speculation on my part.

    • [1]: Here is a blog post that benchmarks structuredClone vs JSON serialization (the conclusion is that encoding + decoding is generally slower than just using postMessage + structuredClone). One limitation of that blog post is that it doesn't look at the number of messages that are being passed. My perception is that the number of messages also has an impact.

Honestly, all the discussion that resulted from my remark about postMessage performance was largely off-topic to the other thread and wasn't something I expected.

@WebReflection
Copy link
Owner Author

But in the case of sqlite, I don't expect that the data is starting out as javascript objects. It's starting out as some sort of internal SQLite data structure before being transformed into a javascript object and returned. So in this context, if sqlite transformed it to some sort of ArrayBuffer which was returned instead, possibly that process would take a similar amount of time so you could eliminate the structuredClone cost without adding a JSON encoding cost. For the receiver you would be adding a decoding step, but seems possible that you'd still come out ahead.

This project idea is to define that buffer in a way that is more general purpose and understood or implemented across multiple PLs, similar to JSON, faster than BSON and with less shenanigans than JSON (recursive, more compact for repeated rows and so on).

My current tests see it as fast or faster than structuredClone (cold run over deeply nested big data) but while almost everything has been specd’ I’m not happy about the number conversion (as string) … it works fine but it feels wrong and is not fully portable across PLs.

Thanks for all the sharing though, even if this might never be the answer for SQLite, it certainly is the way forward for me and my libraries once all fine tuned things land and numbers are a real representation of … numbers: fast and always correct

@jorroll
Copy link

jorroll commented Jan 26, 2025

This project idea is to define that buffer in a way that is more general purpose and understood or implemented across multiple PLs, similar to JSON, faster than BSON and with less shenanigans than JSON (recursive, more compact for repeated rows and so on).

You're probably aware, but there are a number of libraries which technically provide this functionality (e.g. MessagePack, Google's Protocol Buffers). This being said, existing libraries focus on passing data from a server to a client or between servers. Scenarios where minimizing payload size is a priority. I'm not aware of an existing data structure which is performance optimized for transferring arbitrary data across browser threads though.

@WebReflection
Copy link
Owner Author

I'm not aware of an existing data structure which is performance optimized for transferring arbitrary data across browser threads though.

this works also for servers, but it's easy on the client side too ... I know it didn't (quite) exist while investigating before creating the json branch out of the structured-clone polyfill, but I never thought about filling SharedArrayBuffer directly and decode at speedlight on the other side before and I think buffered-clone has potentials.

For comparison, this lib is compatible with anything JSON compatible, which is ... everything. It doesn't require specialized syntax or utilities, it works with all JS types out of the box + it's super simple in decoding steps + it's recursive out of the box, no special extensions needed ... it really is like structuredClone but as buffer, although it could be more widely usable, e.g. for Rust or WASM exchanges.

@WebReflection WebReflection mentioned this issue Jan 28, 2025
5 tasks
@WebReflection
Copy link
Owner Author

closing this as 0.5 landed with tons of changes but there are remaining questions/features to implement or discuss: #9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants