You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've noticed that Headscale becomes nearly unusable after 300+ clients; I have 554 after switching to a more powerful system and this is now completely maxed out. All clients are identical hardware specs running Debian.
Headscale server (dedicated)
CPU: AMD EPYC 7313
Memory: 128GB
Network: 5Gbps
Headscale version : v0.25.1
1.) Running headscale cli results in "Cannot get nodes: context deadline exceeded" 9/10 times. The server is sitting avg 45% CPU usage with no traffic; the clients connected are for management only, only a few Mb per day is passed to each client.
2.) After 300+ nodes, pinging becomes finicky and results in 1/2 of the nodes not responding. They do respond once systems are removed.
3.) I have 1 very simple ACL to allow admins access to all nodes, and only one node is admin.
4.) config is basic, only addition being "node_update_check_interval: 90s" in an attempt to minimize load, but im not seeing much difference enabled/disabled.
I did see a few issues regarding CPU usage, but most were resolved by updates. I do realize this is a large number of clients, however I was expecting much lower load and intermittent issues with no traffic passing.
Expected Behavior
System runs without issue
Steps To Reproduce
1.) Add clients to servers,after +300 system stops functioning correctly
But well, if you say 300 is the limit, then my example with 1000 might be too much.
Running headscale cli results in "Cannot get nodes: context deadline exceeded" 9/10 times. The server is sitting avg 45% CPU usage with no traffic; the clients connected are for management only, only a few Mb per day is passed to each client.
I'll try to break this up:
Cannot get nodes: context deadline exceeded: The server is probably pretty busy, and you hit some lock which the CLI has to wait for, and it takes longer than the gRPC timeout. It does not look like we are exposing an option to configure that, but a longer timeout might give you an answer "eventually". PRs welcome to make it configurable.
The server is sitting avg 45% CPU usage with no traffic: Traffic isnt really relevant, that should go node to node, It might be spinning on some continuous small change that needs to go to the clients. CPU usage isnt really something you can associate with the internal state of the app, it might be stuck on a lock or similar.
only a few Mb per day is passed to each client: Not that relevant since the traffic goes directly between the clients.
node_update_check_interval
This option does not exist anymore.
I have 1 very simple ACL to allow admins access to all nodes, and only one node is admin.
There are no particular optimalisations for ACLs, so it should not matter too much. Surprisingly, when we potentially start adding them, then simpler might be worse for performance, but thats something we can only say in the future.
I would say this isnt as much a bug as "not a feature", at least not yet.
Is this a support request?
Is there an existing issue for this?
Current Behavior
I've noticed that Headscale becomes nearly unusable after 300+ clients; I have 554 after switching to a more powerful system and this is now completely maxed out. All clients are identical hardware specs running Debian.
Headscale server (dedicated)
CPU: AMD EPYC 7313
Memory: 128GB
Network: 5Gbps
Headscale version : v0.25.1
1.) Running headscale cli results in "Cannot get nodes: context deadline exceeded" 9/10 times. The server is sitting avg 45% CPU usage with no traffic; the clients connected are for management only, only a few Mb per day is passed to each client.
2.) After 300+ nodes, pinging becomes finicky and results in 1/2 of the nodes not responding. They do respond once systems are removed.
3.) I have 1 very simple ACL to allow admins access to all nodes, and only one node is admin.
4.) config is basic, only addition being "node_update_check_interval: 90s" in an attempt to minimize load, but im not seeing much difference enabled/disabled.
I did see a few issues regarding CPU usage, but most were resolved by updates. I do realize this is a large number of clients, however I was expecting much lower load and intermittent issues with no traffic passing.
Expected Behavior
System runs without issue
Steps To Reproduce
1.) Add clients to servers,after +300 system stops functioning correctly
Environment
Runtime environment
Debug information
Node
The text was updated successfully, but these errors were encountered: