Skip to content

[Bug] 'headscale' commands unusable under load #2491

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 4 tasks
arduino43 opened this issue Mar 19, 2025 · 1 comment
Open
2 of 4 tasks

[Bug] 'headscale' commands unusable under load #2491

arduino43 opened this issue Mar 19, 2025 · 1 comment
Labels
bug Something isn't working performance

Comments

@arduino43
Copy link

arduino43 commented Mar 19, 2025

Is this a support request?

  • This is not a support request

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

I've noticed that Headscale becomes nearly unusable after 300+ clients; I have 554 after switching to a more powerful system and this is now completely maxed out. All clients are identical hardware specs running Debian.

Headscale server (dedicated)
CPU: AMD EPYC 7313
Memory: 128GB
Network: 5Gbps
Headscale version : v0.25.1

1.) Running headscale cli results in "Cannot get nodes: context deadline exceeded" 9/10 times. The server is sitting avg 45% CPU usage with no traffic; the clients connected are for management only, only a few Mb per day is passed to each client.

2.) After 300+ nodes, pinging becomes finicky and results in 1/2 of the nodes not responding. They do respond once systems are removed.

3.) I have 1 very simple ACL to allow admins access to all nodes, and only one node is admin.

4.) config is basic, only addition being "node_update_check_interval: 90s" in an attempt to minimize load, but im not seeing much difference enabled/disabled.

I did see a few issues regarding CPU usage, but most were resolved by updates. I do realize this is a large number of clients, however I was expecting much lower load and intermittent issues with no traffic passing.

Expected Behavior

System runs without issue

Steps To Reproduce

1.) Add clients to servers,after +300 system stops functioning correctly

Environment

- OS: Debian 12
- Headscale version: v0.25.1
- Tailscale version: 1.80.3

Runtime environment

  • Headscale is behind a (reverse) proxy
  • Headscale runs in a container

Debug information

Node

@arduino43 arduino43 added the bug Something isn't working label Mar 19, 2025
@kradalby
Copy link
Collaborator

After 300+ nodes, pinging becomes finicky and results in 1/2 of the nodes not responding. They do respond once systems are removed

Headscale just isnt made for this, throwing more hardware at the problem only works to a certain point.

After some discussions in discord, I wrote up "Scaling / How many clients does Headscale support?".

But well, if you say 300 is the limit, then my example with 1000 might be too much.

Running headscale cli results in "Cannot get nodes: context deadline exceeded" 9/10 times. The server is sitting avg 45% CPU usage with no traffic; the clients connected are for management only, only a few Mb per day is passed to each client.

I'll try to break this up:

Cannot get nodes: context deadline exceeded: The server is probably pretty busy, and you hit some lock which the CLI has to wait for, and it takes longer than the gRPC timeout. It does not look like we are exposing an option to configure that, but a longer timeout might give you an answer "eventually". PRs welcome to make it configurable.

The server is sitting avg 45% CPU usage with no traffic: Traffic isnt really relevant, that should go node to node, It might be spinning on some continuous small change that needs to go to the clients. CPU usage isnt really something you can associate with the internal state of the app, it might be stuck on a lock or similar.

only a few Mb per day is passed to each client: Not that relevant since the traffic goes directly between the clients.

node_update_check_interval

This option does not exist anymore.

I have 1 very simple ACL to allow admins access to all nodes, and only one node is admin.

There are no particular optimalisations for ACLs, so it should not matter too much. Surprisingly, when we potentially start adding them, then simpler might be worse for performance, but thats something we can only say in the future.

I would say this isnt as much a bug as "not a feature", at least not yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance
Projects
None yet
Development

No branches or pull requests

2 participants