`wasm-opt -Oz` takes an inordinate amount of time #7319

wingo · 2025-02-24T13:07:50Z

Firstly: yay, thank you for fixing the control-flow values issue in the parser! Binaryen can now work on Hoot's binaries. Thank you thank you!

I noticed a performance bug that you may be interested in, for -Oz.

optimization level	bytes	real time	user time
original	5485027	-	-
`-O0`	5835468	0.74s	1.18s
`-O1`	4810648	2.96s	65s
`-O2`	4664753	5.34s	104s
`-Os`	4378817	806s	1081s
`-Oz`	4278988	804s	1103s

This is a 32-logical-cpu system. As you can see, -Os / -Oz don't parallelize very well, and takes a bit too long to get useful results.

I can provide the test file, should that be of interest, though github doesn't seem to want to attach it. Enabled features are --enable-bulk-memory --enable-multivalue --enable-reference-types --enable-gc --enable-tail-call --enable-exception-handling.

The text was updated successfully, but these errors were encountered:

wingo · 2025-02-24T13:17:41Z

Ah, I could gzip it:
test3.wasm.gz
All rights disclaimed on that file, and I make no claim that it is good code ;)

kripken · 2025-02-24T20:19:43Z

Interesting... this is slow mainly on simplify-locals and code-folding. The former is a pass I've never seen slow before... profiling shows the issue there is effect computation as it tries to reorder locals.

There is no single function with this slowdown, but many that each take a second or so. I looked at $4690 for example. It has almost 1,000 locals (and an equal number of scratch locals that I believe are from our IR builder, handling tuples or stacky code). The code then looks like this:

            (local.set $9
             (block (result (ref eq))
              (local.set $scratch_1567
               (ref.as_non_null
                (table.get $2
                 (i32.add
                  (local.get $2)
                  (i32.const 5)
                 )
                )
               )
              )
              (local.set $10
               (block (result (ref eq))
                (local.set $scratch_1566
                 (ref.as_non_null
                  (table.get $2
                   (i32.add
                    (local.get $2)
                    (i32.const 6)
                   )
                  )
                 )
                )
                (local.set $11
                 (block (result (ref eq))
                  (local.set $scratch_1565
                   (ref.as_non_null
                    (table.get $2
                     (i32.add
                      (local.get $2)
                      (i32.const 7)
                     )
                    )
                   )
                  )
                    [...]
                  )
                  (local.get $scratch_1565)
                 )
                )
                (local.get $scratch_1566)
               )
              )
              (local.get $scratch_1567)
             )

So it is doing these table gets, casting them, and setting them in scratch locals, as the code gets more and more nested. Then it sets the scratch locals to the originals.

After that, there is some reasonable-looking code with calls and struct sets. Then lots of table.sets, seemingly around 700 (one per local?).

The optimizer can't remove a table.set of a table.get of the same value because we have calls here. But perhaps the language can optimize this at a higher level?

(perhaps this code is what you mean by "I make no claim that it is good code ;)" 😄 )

We do have a pass that un-nests such code, merge-blocks. Unfortunately it normally runs after simplify-locals. I am not sure if we can reorder it, or if we want to add another early run of that pass, because that would have downsides on other code, but you can get the benefits manually: running

wasm-opt --remove-unused-names --merge-blocks -Os

makes that pass 4x faster than plain -Os.

However, this does not help with --code-folding, so I will look into that too.

kripken · 2025-02-24T20:31:35Z

--code-folding may be getting stuck on a single function, $374. It looks like

(func $374 (param $0 i32) (result funcref)
 (block $block12186
  (block $block12185
    ..
      br_table
    ..
    (return
     (ref.func $804)
    )
   )
   (return
    (ref.func $804)
   )
  )

12,186 nested blocks with a br_table that leads to returns, many of them (as in the two I pasted) identical, so code-folding sees a lot of work it could do, and gets right to it...

Could this function be replaced with a call_indirect on a table with the same ref.funcs?

kripken · 2025-02-24T21:12:16Z

Oh, and you can skip code-folding for now using --skip-pass=code-folding. On my four-core machine, the speed is then much better, faster than naive -O1 actually, using

wasm-opt test3.wasm -all --remove-unused-names --merge-blocks -Os --skip-pass=code-folding

wingo · 2025-02-26T14:30:23Z

Hey thanks for taking a look! This comment is just for info, no actionable items here.

There is no single function with this slowdown, but many that each take a second or so. I looked at $4690 for example. It has almost 1,000 locals (and an equal number of scratch locals that I believe are from our IR builder, handling tuples or stacky code). The code then looks like this:
        (local.set $9
         (block (result (ref eq))
          (local.set $scratch_1567
           (ref.as_non_null
            (table.get $2
             (i32.add
              (local.get $2)
              (i32.const 5)
             )
            )
           )
          )
          [...]
          (local.get $scratch_1567)
         )
        )

Hmm, I think the block and scratch locals are probably an artifact of stack desugaring; but yes, there can be a ridiculous number of locals in these functions. Basically the hoot compiler backend ends up emitting operations against named SSA-flavored locals, which each get a wasm local. We should re-stackify most of them away. What's more, when there is a non-tail call, we save all live (needed) locals to an explicit stack, and restore them to locals after the call returns. Things could be better on our side ;)

After that, there is some reasonable-looking code with calls and struct sets. Then lots of table.sets, seemingly around 700 (one per local?).

The optimizer can't remove a table.set of a table.get of the same value because we have calls here. But perhaps the language can optimize this at a higher level?

This would probably be saving the live vars before another call... which, we need to improve the code we generate here. It's part of the CPS conversion: you have a function F that make nested calls to G and H. Say F has 1000 variables, and 800 are live at G. So you save 800 values, and restore 800 values after G returns. Then you call H immediately, and assume for simplicity that the same 800 values are live after H returns. So then you have to save 800 values again, for no reason!

In practice this is quadratic, because the number of live variables depends on the size of your function, and the number of function calls also depends on the size of your function. We only realized this recently and will need to take a different tack :)

(perhaps this code is what you mean by "I make no claim that it is good code ;)" 😄 )

:-)

wingo · 2025-02-26T14:47:35Z

--code-folding may be getting stuck on a single function, $374. It looks like

(func $374 (param $0 i32) (result funcref)
(block $block12186
(block $block12185
..
br_table
..
(return
(ref.func $804)
)
)
(return
(ref.func $804)
)
)

12,186 nested blocks with a br_table that leads to returns, many of them (as in the two I pasted) identical, so code-folding sees a lot of work it could do, and gets right to it...

I think they are not actually the same? There is only one (ref.func $804) AFAICS. There is (ref.func $8040), (ref.func $8041), etc...

Could this function be replaced with a call_indirect on a table with the same ref.funcs?

Hmm, well, sorta. You could use a table and an element section. We don't actually call these functions, though; this function is actually a side table, only meant to be invoked if there is an error: it is exported and allows the embedder to enumerate the functions in the compilation unit, so that it can build a funcref -> i32 id map. (I would really love to be able to do this from wasm itself but alas, funcrefs cannot be compared or hashed, and you can't attach any other side-band data to a funcref!) Once you have the id, you can look up a function's name, its source location, etc. This is used to print Scheme-level backtraces. But, perhaps a lazily instantiated table would work just as well.

But then you have the opposite case: mapping an i32 ID to, say, a source location. There you don't want to instantiate a table of all the answers and then do table.get: you just want to instantiate the one answer, and br_table seems not unreasonable for that; and there you will have some duplication. In any case I suspect an algorithmic problem in binaryen here; the code seems not unreasonable to me.

kripken · 2025-02-26T17:28:09Z

Thanks for the details about the codegen here, very interesting!

I think they are not actually the same?

Oh, sorry, I was reading the partially-optimized code - basically the code as the pass code-folding did. At that point, some functions were deduplicated. Here is how the end of the function looks at that time:

          (return
           (ref.func $621)
          )
         )
         (return
          (ref.func $619)
         )
        )
        (return
         (ref.func $621)
        )
       )
       (return
        (ref.func $619)
       )
      )
      (return
       (ref.func $621)
      )
     )
     (return
      (ref.func $829)
     )
    )
    (return
     (ref.func $829)
    )
   )
   (return
    (ref.func $829)
   )
  )  
  (ref.null nofunc)
 )     
 (func $377 (type $0) (param $0 i32) (param $1 (ref eq)) (param $2 (ref eq)) (param $3 (ref eq))

We don't actually call these functions

Oh, I see, yes, this returns the ref.funcs...

But I don't follow this:

There you don't want to instantiate a table of all the answers and then do table.get: you just want to instantiate the one answer, and br_table seems not unreasonable for that

What do you mean by "instantiate"? When I compare the existing code to this:

(table $lookup $10 $20 $30 $40 ...)

(func $374 (param $0 i32) (result funcref)
  (table.get $lookup
    (local.get $0)
  )
)

then the table.get approach seems better in size. The table is large, but it is smaller than the br_table, and more efficient to store as well.

In any case I suspect an algorithmic problem in binaryen here

Yeah, I agree, that pass seems to end up doing quadratic comparisons. I'll look into it more when I have time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`wasm-opt -Oz` takes an inordinate amount of time #7319

`wasm-opt -Oz` takes an inordinate amount of time #7319

wingo commented Feb 24, 2025

wingo commented Feb 24, 2025

kripken commented Feb 24, 2025

kripken commented Feb 24, 2025

kripken commented Feb 24, 2025

wingo commented Feb 26, 2025

wingo commented Feb 26, 2025

kripken commented Feb 26, 2025

wasm-opt -Oz takes an inordinate amount of time #7319

wasm-opt -Oz takes an inordinate amount of time #7319

Comments

wingo commented Feb 24, 2025

wingo commented Feb 24, 2025

kripken commented Feb 24, 2025

kripken commented Feb 24, 2025

kripken commented Feb 24, 2025

wingo commented Feb 26, 2025

wingo commented Feb 26, 2025

kripken commented Feb 26, 2025

`wasm-opt -Oz` takes an inordinate amount of time #7319

`wasm-opt -Oz` takes an inordinate amount of time #7319