Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tagged template literal with unicode #15047

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pfgithub
Copy link
Contributor

@pfgithub pfgithub commented Nov 8, 2024

What does this PR do?

Fixes #8745, fixes #15492, fixes #15929, fixes #16763, fixes #18115

console.log(String.raw`æ™`); // -> "æ™" instead of "\u00E6\u2122"
console.log(/æ™/.source); // -> "æ™" instead of "\u00E6\u2122"

Changes ascii_only to prefers_ascii. It will try to emit mostly ascii, but if a non-ascii character is encountered in a tagged template, it will emit it. Updates execution to scan the file to see if it contains any non-ascii and if it does, load it as utf-8 instead.

This should be benchmarked to see what the performance cost is.

TODO:

  • Store the file as utf-16 in the runtime transpiler cache (caches large transpiled files)
  • Store the file as utf-16 in the standalone module graph (for bun --compile)
  • Figure out what to do about a comment containing a non-ascii characters. Should they mark the whole file as non-ascii, or should the non-ascii characters be removed from the output, or should the bytes be passed to JavascriptCore as latin-1 if it's only a comment? A non-ascii comment is probably much more likely than a non-ascii character within a tagged template literal.

Alternate solution is to add a polyfill in js_parser.zig or do some injection into jsc's parsing logic:

let _0;
function _taggedTemplateLiteral(e, t = e.slice(0)) {
  return Object.freeze(Object.defineProperties(e, { raw: { value: Object.freeze(t) } }));
}

myfn`abc${def}ghi`
// =>
myfn(_0 ??= _taggedTemplateLiteral(["abc", "ghi"]), def)
// or (without helper):
myfn(_0 ??= Object.freeze(Object.defineProperties(_0 = ["abc", "ghi"], { raw: { value: Object.freeze(_0) } }));

Benchmarks

  • b61002945fceed12b28cfbdedafee778/tmp hyperfine "bun-latest a.js" "bun a.js" "bun-latest no_unicode.js" "bun no_unicode.js" --warmup=5
  • A 478mb // @bun file is slowed down by 6ms (out of 1000ms) (about a 0.6% slowdown)
  • Adding a single unicode character to a // @bun file slows it down by ~1.21x because it has to convert utf-8 to utf-16
    before jsc can parse the file

@robobun
Copy link

robobun commented Nov 8, 2024

Updated 8:22 PM PT - Feb 25th, 2025

@pfgithub, your commit eb119223b251a54c553828f9793fae5a72de8907 passed in Build #12275! 🎉


🧪   try this PR locally:

bunx bun-pr 15047

@pfgithub pfgithub changed the title Fix tagged template literal Fix tagged template literal with unicode Nov 8, 2024
@pfgithub pfgithub marked this pull request as ready for review November 8, 2024 20:14
@pfgithub pfgithub marked this pull request as draft November 9, 2024 02:08
@pfgithub pfgithub force-pushed the pfg/fix-tagged-template-literal branch from ea67f97 to 66eae13 Compare November 12, 2024 00:03
@cirospaciari
Copy link
Member

Maybe related: #17657

@pfgithub pfgithub force-pushed the pfg/fix-tagged-template-literal branch from 7f7102e to eb11922 Compare February 26, 2025 02:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment