-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding EC troubleshooting page #3057
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for replicated-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
✅ Deploy Preview for replicated-docs-upgrade ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
|
||
For more information, see [NVIDIA GPU Operator](/vendor/embedded-using#nvidia-gpu-operator) in _Using Embedded Cluster_. | ||
|
||
### Calico networking issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pulled this from https://community.replicated.com/t/troubleshooting-calico-networking-issues/1458
This was posted back in November. Not sure how relevant this still is (some of it sounded like it might have been taken care of by preflight checks)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nvanthao , it looks like you were the author of the Community article about troubleshooting calico networking issues (I used your article to create this troubleshooting info)
When you get the time, would you be able to take a look at this "Calico networking issues" section and let me know if the information here is still accurate and up-to-date? Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this LGTM, only one feedback in Incorrect kernel parameter values
section, this check is now part of EC host preflight. I wonder if we should include the below command as part of troubleshoot process.
sudo ./APP_SLUG install run-preflights --license license.yaml
import Tabs from '@theme/Tabs'; | ||
import TabItem from '@theme/TabItem'; | ||
|
||
# Troubleshooting Embedded Cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new topic with sections on:
- Generate support bundles
- New View logs (EC logs, EC Operator pod logs, k0s logs)
- Access the cluster
- New Troubleshoot errors (error/issue-specific troubleshooting steps. can iterate on this to add specific advise over time)
|
||
To view installation logs for Embedded Cluster: | ||
|
||
1. SSH onto a controller node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SSH onto a controller node.
I put "controller node" for all the procedures on viewing different types of logs. I could see there being use cases for wanting to see logs on worker nodes, but thought controller node was a good default. Not really sure if this is important/worth calling out
|
||
### View Embedded Cluster Operator Pod Logs | ||
|
||
The Embedded Cluster Operator is used for reporting purposes as well as some clean up operations. The `embedded-cluster-operator` pod contains logs that can be useful for troubleshooting, including the reconciliation loop for Embedded Cluster installation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ada mentioned operator pod logs can be helpful: https://replicated.slack.com/archives/C036BTS7JCE/p1741596421942829?thread_ts=1741379898.921919&cid=C036BTS7JCE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how much reconciliation the operator does now. i would have salah read this page to be sure it's right too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
This section provides troubleshooting advice for common errors. | ||
|
||
### Installation failure when NVIDIA GPU Operator is included as Helm extension |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pulled from the note in the NVIDIA operator section: https://docs.replicated.com/vendor/embedded-using#nvidia-gpu-operator
1. Type `exit` or **Ctrl + D** to exit the shell. | ||
:::note | ||
If you encounter a typical workflow where your customers have to use the Embedded Cluster shell, reach out to Alex Parker at [email protected]. These workflows might be candidates for additional Admin Console functionality. | ||
::: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ just pulled this intro content out of the partial so that it doesn't have to show up in the Troubleshooting topic as well
|
||
<SupportBundleIntro/> | ||
|
||
<EmbeddedClusterSupportBundle/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ Removed this section from the Using Embedded Cluster topic. This info would now appear:
- In the new Troubleshooting EC topic
- Generating Support Bundles for EC in the Preflights/SB section of the docs
@@ -22,7 +20,7 @@ To generate a support bundle: | |||
|
|||
Where `APP_SLUG` is the unique slug for the application. | |||
|
|||
### For Versions Earlier Than 1.17.0 | |||
### Generate a Bundle For Versions Earlier Than 1.17.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're probably not far out enough yet, but we could take this out at some point in the not too distant future
|
||
### View Embedded Cluster Operator Pod Logs | ||
|
||
The Embedded Cluster Operator is used for reporting purposes as well as some clean up operations. The `embedded-cluster-operator` pod contains logs that can be useful for troubleshooting, including the reconciliation loop for Embedded Cluster installation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how much reconciliation the operator does now. i would have salah read this page to be sure it's right too.
journalctl -u k0scontroller | ||
``` | ||
|
||
## Access the Cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it weird that you've run this command several times above, and then we mention it down here? no strong opinion, just asking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good question, I didn't mind repeating it because I was working with the assumption that people wouldn't necessarily be reading through this topic in order and end-to-end, but instead jumping around to different procedures based on what they need.
(And then if we do end up removing the procedures on generating a bundle in earlier versions & viewing EC Operator pod logs, I think this will actually be the only section that shows how to access the cluster)
Co-authored-by: Alex Parker <[email protected]>
New Troubleshooting Embedded Cluster topic: https://deploy-preview-3057--replicated-docs.netlify.app/vendor/embedded-troubleshooting