-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add zeebe overview dashboard #9067
Conversation
Dashboard shows a general overview of the Zeebe system. It shows metrics like: * Partition topology and healthness * Current processing, exporting, incoming requests and backpressure * Resource consumption like cpu, memory and disk * Process instance execution latency is also shown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Zelldon for taking the initiative. It is high time we have something like this. Here are my comments.
Since this is supposed to be a small dashboard, we can remove the collapsible section.
Remove following panels. All of them can be in the detailed panels.
- Current Events - I don't think for a first level support this is interesting.
- Exporting per partition - Same here.
- Number of log segments - Pvc disk usage already gives some indication of this value. So this is only needed for further investigation.
- Processing Queue Size - I guess a first level support engineer cannot make sense of this value.
- Process Instance Execution Time - Optional. Not sure if it is useful to have it in the overview. The latencies depend on the process. So probably not very useful as first level health check.
Rename PVC Disk Usage -> (Zeebe) Brokers Disk Usage
For "Request handled by Gateway", I would use a similar query as in grpc requests handled per sec
in the grpc panel. I think it is useful to know how many requests are successful or not.
Apply review hints and remove panels like: * current events * number of segments * processing queue * process latency Kepts exporters since I think it makes sense to see it. Reorder panels and removed row.
Thanks @deepthidevaki for the review :) I was unsure about the row, but I removed it now.
Yeah I thought might be interesting for internal load and to see what is processed, but yeah maybe the processing is enough for now. We can add it also later again, if we see the need.
I will keep it. I think this is important since you can verify whether anything is currently exporting, useful for detect issues with operate and not showing data etc.
Yeah was probably my own curiosity :D Agreed, I removed it.
Probably right. I removed it.
I removed it as well :) See the new panels: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ 🚀
As a follow up, we may have to update the troubleshooting guide to link to this dashboard. It would be also good to test it during the next Gameday - if the information in this dashboard is useful and enough for an overview.
Jep totally agree :) First I need to get this into cloud and updating the throubleshooting I have also on my todo list :D 👍 |
bors r+ |
9067: Add zeebe overview dashboard r=Zelldon a=Zelldon ## Description This feature was already part of several discussions and plannings, but we had never time for it. Again it came up in the most recent gameday https://confluence.camunda.com/display/ZEEBE/2022-03-23+Game+Day + #8998 that people like Support Engineers should have a dashboard which shows only a good overview without overwhelming to much. Yesterday at the air port I had some time to play around with it and made up a dashboard with panels which I would normally check to detect issues. It shows metrics like: * Partition topology and healthness * Current processing, exporting, incoming requests and backpressure * Resource consumption like cpu, memory and disk * Process instance execution latency is also shownion   Tbh I'm a bit proud of the first table, since I was finally able to combine the partition topology (roles) and healthiness in one table 💪 You can take a look here http://34.77.165.228/d/NzsO1mUnk/zeebe-overview?orgId=1&refresh=10s&var-DS_PROMETHEUS=Prometheus&var-namespace=All&var-pod=All&var-partition=All&from=now-15m&to=now <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> I think this covers #8998 but we can discuss this `@deepthidevaki` If you think we need more here. Co-authored-by: Christopher Zell <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
Late but still wanted to comment 😄
Build failed: |
bors r+ |
9067: Add zeebe overview dashboard r=Zelldon a=Zelldon ## Description This feature was already part of several discussions and plannings, but we had never time for it. Again it came up in the most recent gameday https://confluence.camunda.com/display/ZEEBE/2022-03-23+Game+Day + #8998 that people like Support Engineers should have a dashboard which shows only a good overview without overwhelming to much. Yesterday at the air port I had some time to play around with it and made up a dashboard with panels which I would normally check to detect issues. It shows metrics like: * Partition topology and healthness * Current processing, exporting, incoming requests and backpressure * Resource consumption like cpu, memory and disk * Process instance execution latency is also shownion   Tbh I'm a bit proud of the first table, since I was finally able to combine the partition topology (roles) and healthiness in one table 💪 You can take a look here http://34.77.165.228/d/NzsO1mUnk/zeebe-overview?orgId=1&refresh=10s&var-DS_PROMETHEUS=Prometheus&var-namespace=All&var-pod=All&var-partition=All&from=now-15m&to=now <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> I think this covers #8998 but we can discuss this `@deepthidevaki` If you think we need more here. Co-authored-by: Christopher Zell <[email protected]>
Build failed: |
Description
This feature was already part of several discussions and plannings, but we had never time for it. Again it came up in the most recent gameday https://confluence.camunda.com/display/ZEEBE/2022-03-23+Game+Day + #8998 that people like Support Engineers should have a dashboard which shows only a good overview without overwhelming to much.
Yesterday at the air port I had some time to play around with it and made up a dashboard with panels which I would normally check to detect issues.
It shows metrics like:
Tbh I'm a bit proud of the first table, since I was finally able to combine the partition topology (roles) and healthiness in one table 💪
You can take a look here http://34.77.165.228/d/NzsO1mUnk/zeebe-overview?orgId=1&refresh=10s&var-DS_PROMETHEUS=Prometheus&var-namespace=All&var-pod=All&var-partition=All&from=now-15m&to=now
Related issues
I think this covers #8998 but we can discuss this @deepthidevaki If you think we need more here.
Definition of Done
Not all items need to be done depending on the issue and the pull request.
Code changes:
backport stable/1.3
) to the PR, in case that fails you need to create backports manually.Testing:
Documentation: