Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding EC troubleshooting page #3057

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
Open

Adding EC troubleshooting page #3057

wants to merge 19 commits into from

Conversation

paigecalvert
Copy link
Contributor

@paigecalvert paigecalvert commented Feb 17, 2025

@replicated-ci replicated-ci added type::docs Improvements or additions to documentation type::feature labels Feb 17, 2025
Copy link

netlify bot commented Feb 17, 2025

Deploy Preview for replicated-docs ready!

Name Link
🔨 Latest commit 6ad68f9
🔍 Latest deploy log https://app.netlify.com/sites/replicated-docs/deploys/67dc4fafa2395f0008cfb38f
😎 Deploy Preview https://deploy-preview-3057--replicated-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link

netlify bot commented Feb 17, 2025

Deploy Preview for replicated-docs-upgrade ready!

Name Link
🔨 Latest commit 6ad68f9
🔍 Latest deploy log https://app.netlify.com/sites/replicated-docs-upgrade/deploys/67dc4faf8fd6980008a46763
😎 Deploy Preview https://deploy-preview-3057--replicated-docs-upgrade.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.


For more information, see [NVIDIA GPU Operator](/vendor/embedded-using#nvidia-gpu-operator) in _Using Embedded Cluster_.

### Calico networking issues
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulled this from https://community.replicated.com/t/troubleshooting-calico-networking-issues/1458

This was posted back in November. Not sure how relevant this still is (some of it sounded like it might have been taken care of by preflight checks)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nvanthao , it looks like you were the author of the Community article about troubleshooting calico networking issues (I used your article to create this troubleshooting info)

When you get the time, would you be able to take a look at this "Calico networking issues" section and let me know if the information here is still accurate and up-to-date? Thank you!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this LGTM, only one feedback in Incorrect kernel parameter values section, this check is now part of EC host preflight. I wonder if we should include the below command as part of troubleshoot process.

sudo ./APP_SLUG install run-preflights --license license.yaml

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

# Troubleshooting Embedded Cluster
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new topic with sections on:

  • Generate support bundles
  • New View logs (EC logs, EC Operator pod logs, k0s logs)
  • Access the cluster
  • New Troubleshoot errors (error/issue-specific troubleshooting steps. can iterate on this to add specific advise over time)


To view installation logs for Embedded Cluster:

1. SSH onto a controller node.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SSH onto a controller node.

I put "controller node" for all the procedures on viewing different types of logs. I could see there being use cases for wanting to see logs on worker nodes, but thought controller node was a good default. Not really sure if this is important/worth calling out


### View Embedded Cluster Operator Pod Logs

The Embedded Cluster Operator is used for reporting purposes as well as some clean up operations. The `embedded-cluster-operator` pod contains logs that can be useful for troubleshooting, including the reconciliation loop for Embedded Cluster installation.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how much reconciliation the operator does now. i would have salah read this page to be sure it's right too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


This section provides troubleshooting advice for common errors.

### Installation failure when NVIDIA GPU Operator is included as Helm extension
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulled from the note in the NVIDIA operator section: https://docs.replicated.com/vendor/embedded-using#nvidia-gpu-operator

1. Type `exit` or **Ctrl + D** to exit the shell.
:::note
If you encounter a typical workflow where your customers have to use the Embedded Cluster shell, reach out to Alex Parker at [email protected]. These workflows might be candidates for additional Admin Console functionality.
:::
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ just pulled this intro content out of the partial so that it doesn't have to show up in the Troubleshooting topic as well


<SupportBundleIntro/>

<EmbeddedClusterSupportBundle/>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ Removed this section from the Using Embedded Cluster topic. This info would now appear:

@paigecalvert paigecalvert marked this pull request as ready for review March 10, 2025 20:58
@paigecalvert paigecalvert requested a review from a team as a code owner March 10, 2025 20:58
@@ -22,7 +20,7 @@ To generate a support bundle:

Where `APP_SLUG` is the unique slug for the application.

### For Versions Earlier Than 1.17.0
### Generate a Bundle For Versions Earlier Than 1.17.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're probably not far out enough yet, but we could take this out at some point in the not too distant future


### View Embedded Cluster Operator Pod Logs

The Embedded Cluster Operator is used for reporting purposes as well as some clean up operations. The `embedded-cluster-operator` pod contains logs that can be useful for troubleshooting, including the reconciliation loop for Embedded Cluster installation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how much reconciliation the operator does now. i would have salah read this page to be sure it's right too.

journalctl -u k0scontroller
```

## Access the Cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it weird that you've run this command several times above, and then we mention it down here? no strong opinion, just asking

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good question, I didn't mind repeating it because I was working with the assumption that people wouldn't necessarily be reading through this topic in order and end-to-end, but instead jumping around to different procedures based on what they need.

(And then if we do end up removing the procedures on generating a bundle in earlier versions & viewing EC Operator pod logs, I think this will actually be the only section that shows how to access the cluster)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type::docs Improvements or additions to documentation type::feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants