[Bug] 'headscale' commands unusable under load #2491

arduino43 · 2025-03-19T11:31:37Z

Is this a support request?

This is not a support request

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

I've noticed that Headscale becomes nearly unusable after 300+ clients; I have 554 after switching to a more powerful system and this is now completely maxed out. All clients are identical hardware specs running Debian.

Headscale server (dedicated)
CPU: AMD EPYC 7313
Memory: 128GB
Network: 5Gbps
Headscale version : v0.25.1

1.) Running headscale cli results in "Cannot get nodes: context deadline exceeded" 9/10 times. The server is sitting avg 45% CPU usage with no traffic; the clients connected are for management only, only a few Mb per day is passed to each client.

2.) After 300+ nodes, pinging becomes finicky and results in 1/2 of the nodes not responding. They do respond once systems are removed.

3.) I have 1 very simple ACL to allow admins access to all nodes, and only one node is admin.

4.) config is basic, only addition being "node_update_check_interval: 90s" in an attempt to minimize load, but im not seeing much difference enabled/disabled.

I did see a few issues regarding CPU usage, but most were resolved by updates. I do realize this is a large number of clients, however I was expecting much lower load and intermittent issues with no traffic passing.

Expected Behavior

System runs without issue

Steps To Reproduce

1.) Add clients to servers,after +300 system stops functioning correctly

Environment

- OS: Debian 12
- Headscale version: v0.25.1
- Tailscale version: 1.80.3

Runtime environment

Headscale is behind a (reverse) proxy
Headscale runs in a container

Debug information

Node

kradalby · 2025-03-20T08:29:43Z

After 300+ nodes, pinging becomes finicky and results in 1/2 of the nodes not responding. They do respond once systems are removed

Headscale just isnt made for this, throwing more hardware at the problem only works to a certain point.

After some discussions in discord, I wrote up "Scaling / How many clients does Headscale support?".

But well, if you say 300 is the limit, then my example with 1000 might be too much.

Running headscale cli results in "Cannot get nodes: context deadline exceeded" 9/10 times. The server is sitting avg 45% CPU usage with no traffic; the clients connected are for management only, only a few Mb per day is passed to each client.

I'll try to break this up:

Cannot get nodes: context deadline exceeded: The server is probably pretty busy, and you hit some lock which the CLI has to wait for, and it takes longer than the gRPC timeout. It does not look like we are exposing an option to configure that, but a longer timeout might give you an answer "eventually". PRs welcome to make it configurable.

The server is sitting avg 45% CPU usage with no traffic: Traffic isnt really relevant, that should go node to node, It might be spinning on some continuous small change that needs to go to the clients. CPU usage isnt really something you can associate with the internal state of the app, it might be stuck on a lock or similar.

only a few Mb per day is passed to each client: Not that relevant since the traffic goes directly between the clients.

node_update_check_interval

This option does not exist anymore.

I have 1 very simple ACL to allow admins access to all nodes, and only one node is admin.

There are no particular optimalisations for ACLs, so it should not matter too much. Surprisingly, when we potentially start adding them, then simpler might be worse for performance, but thats something we can only say in the future.

I would say this isnt as much a bug as "not a feature", at least not yet.

arduino43 added the bug Something isn't working label Mar 19, 2025

kradalby added the performance label Mar 24, 2025

aergus-tng mentioned this issue Mar 28, 2025

Policy rework tracking bug #2416

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] 'headscale' commands unusable under load #2491

[Bug] 'headscale' commands unusable under load #2491

arduino43 commented Mar 19, 2025 •

edited

Loading

kradalby commented Mar 20, 2025

[Bug] 'headscale' commands unusable under load #2491

[Bug] 'headscale' commands unusable under load #2491

Comments

arduino43 commented Mar 19, 2025 • edited Loading

Is this a support request?

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Runtime environment

Debug information

kradalby commented Mar 20, 2025

arduino43 commented Mar 19, 2025 •

edited

Loading