Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tracking] 1M TPS benchmarks #13130

Open
15 of 33 tasks
Trisfald opened this issue Mar 14, 2025 · 5 comments
Open
15 of 33 tasks

[Tracking] 1M TPS benchmarks #13130

Trisfald opened this issue Mar 14, 2025 · 5 comments
Assignees

Comments

@Trisfald
Copy link
Contributor

Trisfald commented Mar 14, 2025

Tracking issues for all tasks related to benchmarks, for the 1 million TPS initiative.

Benchmarks

State generation

Create genesis, adjust nodes configuration, and build a suitable initial database state.

  • Tool to generate uniform state locally, for 50 shards
    • Minimal state (a few accounts): synth-bm from CRT
    • Large state
  • Tool to generate uniform state in forknet, for 50 shards
    • Minimal state (a few accounts)
    • Large state

Traffic generation

Generate transactions to stress the network.

  • Native token transfers
    • Evaluate existing tools made by CRT
    • Tool to generate intra-shard traffic locally
    • Tool to generate cross-shard traffic locally
    • Tool to generate intra-shard traffic in forknet
    • Tool to generate cross-shard traffic in forknet
  • Fungible token transfers:
    TODO

Benchmark setup

  • Automation to run local benchmarks
    • Support for native token transfers
    • Support for fungible token transfers
  • Automation to run multi-node benchmarks (forknet)
    • Support for native token transfers
    • Support for fungible token transfers

Benchmark runs

  • Native token transfers
    • Intra-shard traffic
      • Single node benchmark
      • Multi node local benchmark
      • Forknet benchmark
    • Cross-shard traffic
      • Single node benchmark
      • Multi node local benchmark
      • Forknet benchmark

Issues found

  • High priority

    • [Low number of shards] Client actor bottleneck: TPS is suboptimal due to the single threaded design of client actor. See also ClientActor ingress saturates at ~5K TPS #12963.
    • [High number of shards] Chunk production / endorsement bottleneck: the chain TPS are limited by the appearance of chunk misses, that create a snowball effect and keep increasing in number.
    • RPC nodes are unable to keep up with the network, because they track all shards. In a real network they also need to sustain the extra load of accepting user transactions. Even without sending any transaction to the RPC node, and with memtries, I have observed TX application speed of 15k-8k TPS while the chain can go above 40k TPS
  • Medium priority

    • The bigger the state, the lower the TPS. This affects single shard performance. Which is expected, but perhaps not to this extent.
      1k accounts -> 4k TPS
      100k accounts -> 2.7k TPS
    • [ ] The chain breaks at relatively low TPS due to exponentially growing block times, caused by chunk misses because of lack of endorsements. This is a side effect of the client actor bottleneck: endorsements are not processed in time. Does not happen if proper gas limits are in place.
  • Low priority

    • Sending transactions directly to chunk validators doesn't work, if they don't track all shards, due to timeouts. Workaround: use RPC nodes with memtries can help a little. For bigger load tests, transactions must be sent through alternative means (transaction injection)
@Trisfald Trisfald self-assigned this Mar 14, 2025
@Trisfald
Copy link
Contributor Author

Results for native token transfers: intra-shard traffic, single node

Hardware: GCP n2d-standard-16
Binary: master 06 Feb 2025

1 CP, 1 shard 4000 TPS
             CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   12.10    0.00    3.23    0.21    0.00    0.48    0.00    0.00    0.00   83.98

1 CP, 5 shards 3989 TPS
             CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   14.53    0.00    4.03    0.33    0.00    1.79    0.00    0.00    0.00   79.32

1 CP, 10 shards 3806 TPS
             CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   17.60    0.00    4.56    0.22    0.00    2.75    0.00    0.00    0.00   74.86

1 CP, 50 shards 2243 TPS
             CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   15.01    0.00    3.94    0.37    0.00    1.73    0.00    0.00    0.00   78.96

This's the best I could do, any higher TPS breaks the chain because of 'repeated chunk misses' which lead to 'extremely high block time'.

PR to reproduce the benchmark, including all needed setup and configuration: #12918

@Trisfald
Copy link
Contributor Author

Results for native token transfers: intra-shard traffic, multi-node localnet

Hardware: GCP n2d-standard-16
Binary: master 06 Feb 2025

3 CP + 1 RPC, 10 shards 2733 TPS
             CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   25.53    0.00    5.16   19.60    0.00    1.96    0.00    0.00    0.00   47.75

5 CP + 1 RPC, 10 shards 2346 TPS
             CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   27.06    0.00    5.99   21.55    0.00    1.49    0.00    0.00    0.00   43.92

Performance is worse than a single node because of the IO bottleneck caused by the presence of multiple rocks DB on the same disk.

This setup is too different from how a real node operates, and it doesn't help to reproduce any real production issue linked to performance. For these reasons, I won't go further in investigating localnet TPS.

@Trisfald
Copy link
Contributor Author

Results for native token transfers: intra-shard traffic, forknet (5 CP / 5 Shards)

Hardware: GCP n2d-standard-16
Binary: master 01 Mar 2025

Draft PR with the scripts I used to run the benchmark.

Outcome

I was able to sustain 4k TPS for a long time, with no issue in the chain.

The peak was about 4.4k TPS.

In this scenario, the first bottleneck I found is the RPC node used to send transactions.
The RPC node is overloaded because two factors:

  • It tracks all shards. Even with memtries, it has trouble in keeping up with the chain.
  • It has to receive, pre-process and forward user transactions. It's a non negligible effort that steals compute resources from chunk application.

Observations

The RPC load follow the pattern highlighted in this extract from the logs:

2025-03-04T15:49:02.978465Z  INFO stats: #    4805 9GMnTfecis43iCdbRQiRPay3pvG51fmKGdDcXEoY8kDQ 5 validators 5 peers ⬇ 29.8 kB/s ⬆ 27.4 kB/s 1.00 bps 0 gas/s CPU: 2%, Mem: 401 MB
2025-03-04T15:49:12.980250Z  INFO stats: #    4814 7jjgFg3ikHqWogV8LoqreDMpx68cFi1w9nYBXq992vhk 5 validators 5 peers ⬇ 29.4 kB/s ⬆ 27.0 kB/s 0.90 bps 0 gas/s CPU: 2%, Mem: 400 MB
2025-03-04T15:49:22.980479Z  INFO stats: #    4823 9nNPHmZqnSWEJZNMPMg4gdtUjghUZU4VfgLeFih3Saij 5 validators 5 peers ⬇ 29.3 kB/s ⬆ 52.1 kB/s 0.90 bps 0 gas/s CPU: 20%, Mem: 406 MB
2025-03-04T15:49:32.980985Z  INFO stats: #    4831 FF5HHaormf43vPscqMsPm4acCGJG6RM5VUmojrewdpPp 5 validators 5 peers ⬇ 1.37 MB/s ⬆ 325 kB/s 0.80 bps 1.84 Pgas/s CPU: 250%, Mem: 816 MB
2025-03-04T15:49:42.980799Z  INFO stats: #    4840 GAH5j46U4tPX4NdJW918DQA8XiUJNk4mYjXDYfbaZ11D 5 validators 5 peers ⬇ 2.60 MB/s ⬆ 498 kB/s 0.90 bps 2.96 Pgas/s CPU: 355%, Mem: 1.08 GB
2025-03-04T15:49:52.980770Z  INFO stats: #    4849 3SEtYbPDKYcX9SiMdpCMEspvsJpgVj1kgeWaqDuWGszS 5 validators 5 peers ⬇ 4.65 MB/s ⬆ 791 kB/s 0.90 bps 3.01 Pgas/s CPU: 369%, Mem: 860 MB
2025-03-04T15:50:02.980831Z  INFO stats: #    4858 7jprVdLDsVo5R4T8xknZsfJdvTwRiriMdJGtCwJ3s21v 5 validators 5 peers ⬇ 6.70 MB/s ⬆ 1.08 MB/s 0.90 bps 3.08 Pgas/s CPU: 353%, Mem: 1.02 GB
2025-03-04T15:50:12.980851Z  INFO stats: #    4866 FBG9hhM388XxFRuk9RELgfcS1hZdpCw6nmtdTDrB1FFJ 5 validators 5 peers ⬇ 8.85 MB/s ⬆ 1.37 MB/s 0.80 bps 2.81 Pgas/s CPU: 390%, Mem: 1.38 GB
2025-03-04T15:50:22.982118Z  INFO stats: #    4874 4kmqqfJe4rmtP7xAXwiSGnZGtRgSS25X1rkcE9FXDKTa 5 validators 5 peers ⬇ 10.8 MB/s ⬆ 1.66 MB/s 0.80 bps 2.65 Pgas/s CPU: 398%, Mem: 1.16 GB
2025-03-04T15:50:32.981468Z  INFO stats: #    4883 HPhVYrbzw9SctejdaWyFdjeT9HbuHCa9ZA4whBog6Pi9 5 validators 5 peers ⬇ 12.0 MB/s ⬆ 1.75 MB/s 0.90 bps 3.04 Pgas/s CPU: 487%, Mem: 1.55 GB
2025-03-04T15:50:42.982420Z  INFO stats: #    4893 Bkmoea5DJP91rWBkBKG15tGK6g8Nt3A33XZDAJ4Tqyg5 5 validators 5 peers ⬇ 12.3 MB/s ⬆ 1.78 MB/s 1.00 bps 3.25 Pgas/s CPU: 416%, Mem: 1.71 GB
2025-03-04T15:50:53.053446Z  INFO stats: #    4901 Downloading blocks 25.00% (3 left; at 4901) 5 peers ⬇ 12.2 MB/s ⬆ 1.78 MB/s 0.79 bps 2.65 Pgas/s CPU: 432%, Mem: 1.61 GB
2025-03-04T15:51:03.112567Z  INFO stats: #    4912 3rxPQZRMiNmQ6pFz3itkNJUp1LXgLXxFHDztoaNAhUTR 5 validators 5 peers ⬇ 12.1 MB/s ⬆ 1.78 MB/s 1.09 bps 3.52 Pgas/s CPU: 410%, Mem: 2.09 GB
2025-03-04T15:51:13.112629Z  INFO stats: #    4920 54NUSzRmoEEkB2WGGiaWhmiijCdAU3ueHu8EtPs2VXL5 5 validators 5 peers ⬇ 12.0 MB/s ⬆ 1.79 MB/s 0.80 bps 2.61 Pgas/s CPU: 634%, Mem: 1.93 GB
2025-03-04T15:51:23.151430Z  INFO stats: #    4929 Downloading blocks 80.00% (2 left; at 4929) 5 peers ⬇ 11.9 MB/s ⬆ 1.79 MB/s 0.90 bps 2.80 Pgas/s CPU: 417%, Mem: 2.15 GB

At the start, there's no load. Once I start to send transactions, CPU and network traffic increase a lot. The node is not able to keep up and goes into block catchup. While transactions are sent, the max Pgas/s is about 3.

When user transactions stop, the node has more room to process blocks and max Pgas/s reaches 5.

2025-03-04T16:06:16.535412Z  INFO stats: #    5714 Downloading blocks 72.41% (40 left; at 5714) 5 peers ⬇ 8.11 MB/s ⬆ 1.62 MB/s 1.70 bps 5.14 Pgas/s CPU: 545%, Mem: 5.82 GB
2025-03-04T16:06:26.816436Z  INFO stats: #    5732 Downloading blocks 79.35% (32 left; at 5732) 5 peers ⬇ 10.7 MB/s ⬆ 1.32 MB/s 1.75 bps 5.34 Pgas/s CPU: 428%, Mem: 5.93 GB
2025-03-04T16:06:36.817623Z  INFO stats: #    5774 4nBuST2JcVjEwB2LLw4NEtjBkSgScr2eLuBZv9HvKmSW 5 validators 5 peers ⬇ 13.0 MB/s ⬆ 1.02 MB/s 4.20 bps 5.41 Pgas/s CPU: 558%, Mem: 5.83 GB

Conclusions

It might be possible to squeeze some more TPS by applying client optimizations from CTR (estimate 20%-40%) and by having multiple RPC nodes to spread the transaction load (estimate 30%-50%).

Even with this, we'll hit a new bottleneck at 8-10k TPS, which is almost entirely independent from the number of shards.

I think we must take steps to scale horizontally submitting transactions, and later also reading their results. Ideas:

  • Shard RPC nodes
  • Allow transactions to be sent to the ChunkProducer that tracks the shard: not user friendly, but it could work
    • Either by properly fixing the RPC interface on CPs, or by injecting transactions into a lower level. However, the second solution is only a temporary workaround.

@Trisfald
Copy link
Contributor Author

Results for native token transfers: intra-shard traffic, forknet, no RPC (5 CP / 5 Shards)

Hardware: GCP n2d-standard-16 and n2d-standard-8
Binary: master 10 Mar 2025

Outcome

TPS with n2d-standard-16 machines: 12k
TPS with n2d-standard-8 machines: 11.5k

Grafana link to one benchmark run.

Observations

  • With default configuration every Chunk Producer validates witnesses from all other shards, which potentially slow down the network.
  • The computation of metrics such as TRANSACTION_PROCESSED_SUCCESSFULLY_TOTAL takes into account transaction processing from both proper chunk application and state witness validation. RPC nodes always apply all transactions.
  • Using a less powerful machine did not degrade TPS significantly
  • I used 600ms block time as it gave me a 10% TPS increase over the default 1.3s

@Trisfald
Copy link
Contributor Author

Results for native token transfers: intra-shard traffic, forknet, no RPC (10 CP / 10 Shards)

Hardware: GCP n2d-standard-8
Binary: multiple versions of master ~13 March

Config

  • Block time: 600ms
  • Unlimited setup (no state witness limits, no bandwidth scheduler, no gas limit, no congestion control)

Outcome

Validator mandates Client perf optimization TPS
68 no 17k
3 no 20k
2 no 20.8k
2 yes 30k

Grafana link to one benchmark run.

Observation

  • RPC node lags behind a lot, it can't follow the chain at all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant