Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding EC troubleshooting page #3057

Merged
merged 23 commits into from
Mar 25, 2025
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 45 additions & 0 deletions docs/partials/embedded-cluster/_shell-command.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
To access the cluster and use other included binaries:

1. SSH onto a controller node.

:::note
You cannot run the `shell` command on worker nodes.
:::

1. Use the Embedded Cluster shell command to start a shell with access to the cluster:

```
sudo ./APP_SLUG shell
```
Where `APP_SLUG` is the unique slug for the application.

The output looks similar to the following:
```
__4___
_ \ \ \ \ Welcome to APP_SLUG debug shell.
<'\ /_/_/_/ This terminal is now configured to access your cluster.
((____!___/) Type 'exit' (or CTRL+d) to exit.
\0\0\0\0\/ Happy hacking.
~~~~~~~~~~~
root@alex-ec-2:/home/alex# export KUBECONFIG="/var/lib/embedded-cluster/k0s/pki/admin.conf"
root@alex-ec-2:/home/alex# export PATH="$PATH:/var/lib/embedded-cluster/bin"
root@alex-ec-2:/home/alex# source <(kubectl completion bash)
root@alex-ec-2:/home/alex# source /etc/bash_completion
```

The appropriate kubeconfig is exported, and the location of useful binaries like kubectl and Replicated’s preflight and support-bundle plugins is added to PATH.

1. Use the available binaries as needed.

**Example**:

```bash
kubectl version
```
```
Client Version: v1.29.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.1+k0s
```

1. Type `exit` or **Ctrl + D** to exit the shell.
5 changes: 3 additions & 2 deletions docs/partials/support-bundles/_ec-support-bundle-intro.mdx
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
Embedded Cluster includes a default support bundle spec that collects both host- and cluster-level information.
Embedded Cluster includes a default support bundle spec that collects both host- and cluster-level information:

The host-level information is useful for troubleshooting failures related to host configuration like DNS, networking, or storage problems. Cluster-level information includes details about the components provided by Replicated, such as the Admin Console and Embedded Cluster operator that manage install and upgrade operations. If the cluster has not installed successfully and cluster-level information is not available, then it is excluded from the bundle.
* The host-level information is useful for troubleshooting failures related to host configuration like DNS, networking, or storage problems.
* Cluster-level information includes details about the components provided by Replicated, such as the Admin Console and Embedded Cluster Operator that manage install and upgrade operations. If the cluster has not installed successfully and cluster-level information is not available, then it is excluded from the bundle.

In addition to the host- and cluster-level details provided by the default Embedded Cluster spec, support bundles generated for Embedded Cluster installations also include app-level details provided by any custom support bundle specs that you included in the application release.
6 changes: 2 additions & 4 deletions docs/partials/support-bundles/_generate-bundle-ec.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,4 @@
There are different steps to generate a support bundle depending on the version of Embedded Cluster installed.

### For Versions 1.17.0 and Later
### Generate a Bundle For Versions 1.17.0 and Later

For Embedded Cluster 1.17.0 and later, you can run the Embedded Cluster `support-bundle` command to generate a support bundle.

Expand All @@ -22,7 +20,7 @@ To generate a support bundle:

Where `APP_SLUG` is the unique slug for the application.

### For Versions Earlier Than 1.17.0
### Generate a Bundle For Versions Earlier Than 1.17.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're probably not far out enough yet, but we could take this out at some point in the not too distant future


For Embedded Cluster versions earlier than 1.17.0, you can generate a support bundle from the shell using the kubectl support-bundle plugin.

Expand Down
222 changes: 222 additions & 0 deletions docs/vendor/embedded-troubleshooting.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,222 @@
import SupportBundleIntro from "../partials/support-bundles/_ec-support-bundle-intro.mdx"
import EmbeddedClusterSupportBundle from "../partials/support-bundles/_generate-bundle-ec.mdx"
import ShellCommand from "../partials/embedded-cluster/_shell-command.mdx"
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

# Troubleshooting Embedded Cluster
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new topic with sections on:

  • Generate support bundles
  • New View logs (EC logs, EC Operator pod logs, k0s logs)
  • Access the cluster
  • New Troubleshoot errors (error/issue-specific troubleshooting steps. can iterate on this to add specific advise over time)


This topic provides information about troubleshooting Replicated Embedded Cluster installations. For more information about Embedded Cluster, including built-in extensions and architecture, see [Embedded Cluster Overview](/vendor/embedded-overview).

## Troubleshoot with Support Bundles

This section includes information about how to collect support bundles for Embedded Cluster installations. For more information about support bundles, see [About Preflight Checks and Support Bundles](/vendor/preflight-support-bundle-about).

### About the Default Embedded Cluster Support Bundle Spec

<SupportBundleIntro/>

<EmbeddedClusterSupportBundle/>

## View Logs

You can view logs for both Embedded Cluster and the systemd k0s service running in the cluster to help troubleshoot Embedded Cluster deployments.

### View Installation Logs for Embedded Cluster

To view installation logs for Embedded Cluster:

1. SSH onto a controller node.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SSH onto a controller node.

I put "controller node" for all the procedures on viewing different types of logs. I could see there being use cases for wanting to see logs on worker nodes, but thought controller node was a good default. Not really sure if this is important/worth calling out


1. Navigate to `/var/log/embedded-cluster` and open the `.log` file to view logs.

### View Embedded Cluster Operator Pod Logs

The Embedded Cluster Operator is used for reporting purposes as well as some clean up operations. The `embedded-cluster-operator` pod contains logs that can be useful for troubleshooting, including the reconciliation loop for Embedded Cluster installation.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how much reconciliation the operator does now. i would have salah read this page to be sure it's right too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


To view Embedded Cluster Operator pod logs:

1. SSH onto a controller node.

1. Use the Embedded Cluster shell command to start a shell with access to the cluster:

```
sudo ./APP_SLUG shell
```
Where `APP_SLUG` is the unique slug for the application.

1. Get the name of the Embedded Cluster Operator pod from the `embedded-cluster` namespace:

```bash
kubectl get pod -n embedded-cluster
```
The output contains the name of the Operator pod.

1. Use the `kubectl logs` command to view logs for the Operator pod. For more information about this command, including supported options to filter logs, see [kubectl logs](https://kubernetes.io/docs/reference/kubectl/generated/kubectl_logs) in the Kubernetes documentation.

**Example:**

```bash
kubectl logs embedded-operator-123456abcd -n embedded-clsuter --limit-bytes=1000
```

### View k0s Logs

You can use the journalctl command line tool to access logs for systemd services, including k0s.

To use journalctl to view k0s logs:

1. SSH onto a controller node.

1. Use journalctl to view logs for the k0s systemd service that was deployed by Embedded Cluster.

**Example:**

```bash
journalctl -u k0scontroller
```

## Access the Cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it weird that you've run this command several times above, and then we mention it down here? no strong opinion, just asking

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good question, I didn't mind repeating it because I was working with the assumption that people wouldn't necessarily be reading through this topic in order and end-to-end, but instead jumping around to different procedures based on what they need.

(And then if we do end up removing the procedures on generating a bundle in earlier versions & viewing EC Operator pod logs, I think this will actually be the only section that shows how to access the cluster)


When troubleshooting, it can be useful to list the cluster and view logs using the kubectl command line tool. For additional suggestions related to troubleshooting applications, see [Troubleshooting Applications](https://kubernetes.io/docs/tasks/debug/debug-application/) in the Kubernetes documentation.

<ShellCommand/>

## Troubleshoot Errors

This section provides troubleshooting advice for common errors.

### Installation failure when NVIDIA GPU Operator is included as Helm extension
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulled from the note in the NVIDIA operator section: https://docs.replicated.com/vendor/embedded-using#nvidia-gpu-operator


#### Symptom

A release that includes that includes the NVIDIA GPU Operator as a Helm extensions fails to install.

#### Cause

If there are any containerd services on the host, the NVIDIA GPU Operator will generate an invalid containerd config, causing the installation to fail.

#### Solution

Remove any existing containerd services that are running on the host (such as those deployed by Docker) before attempting to install the release with Embedded Cluster.

For more information, see [NVIDIA GPU Operator](/vendor/embedded-using#nvidia-gpu-operator) in _Using Embedded Cluster_.

### Calico networking issues

#### Symptom

Symptoms of Calico networking issues can include:

* The pod is stuck in a CrashLoopBackOff state with failed health checks:

```
Warning Unhealthy 6h51m (x3 over 6h52m) kubelet Liveness probe failed: Get "http://<ip:port>/readyz": dial tcp <ip:port>: connect: no route to host
Warning Unhealthy 6h51m (x19 over 6h52m) kubelet Readiness probe failed: Get "http://<ip:port>/readyz": dial tcp <ip:port>: connect: no route to host
....
Unhealthy pod/registry-dc699cbcf-pkkbr Readiness probe failed: Get "https://<ip:port>/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Unhealthy pod/registry-dc699cbcf-pkkbr Liveness probe failed: Get "https://<ip:port>/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
...
```

* The pod log contains an I/O timeout:

```
server APIs: config.k8ssandra.io/v1beta1: Get \"https://***HIDDEN***:443/apis/config.k8ssandra.io/v1beta1\": dial tcp ***HIDDEN***:443: i/o timeout"}
```

#### Cause

Reasons can include:

* Pod CIDR and service CIDR overlap with the host network CIDR.

* Incorrect kernel parameters values.

* VXLAN traffic getting dropped. By default, Calico uses VXLAN as the overlay networking protocol, with Always mode. This mode encapsulates all pod-to-pod traffic in VXLAN packets. If for some reasons, the VXLAN packets get filtered by the network, the pod will not able to communicate with other pods.

#### Solution

<Tabs>
<TabItem value="overlap" label="Pod CIDR and service CIDR overlap with the host network CIDR" default>
To troubleshoot pod CIDR and service CIDR overlapping with the host network CIDR:
1. Run the following command to verify the pod and service CIDR:
```
cat /etc/k0s/k0s.yaml | grep -i cidr
podCIDR: 10.244.0.0/17
serviceCIDR: 10.244.128.0/17
```
The default pod CIDR is 10.244.0.0/16 and service CIDR is 10.96.0.0/12.

1. View pod network interfaces excluding Calico interfaces, and ensure there are no overlapping CIDRs.
```
ip route | grep -v cali
default via 10.152.0.1 dev ens4 proto dhcp src 10.152.0.4 metric 100
10.152.0.1 dev ens4 proto dhcp scope link src 10.152.0.4 metric 100
blackhole 10.244.101.192/26 proto 80
169.254.169.254 via 10.152.0.1 dev ens4 proto dhcp src 10.152.0.4 metric 100
```

1. Reset and reboot the installation:

```bash
sudo ./APP_SLUG reset
```
Where `APP_SLUG` is the unique slug for the application.

For more information, see [Reset a Node](/vendor/embedded-using#reset-a-node) in _Using Embedded Cluster_.

1. Reinstall the application with different CIDRs using the `--cidr` flag:

```bash
sudo ./APP_SLUG install --license license.yaml --cidr 172.16.136.0/16
```

For more information, see [Embedded Cluster Install Options](/reference/embedded-cluster-install).
</TabItem>
<TabItem value="kernel" label="Incorrect kernel parameter values">
To troubleshoot incorrect kernel parameter values:
1. Use sysctl to verify that these parameters are set correctly:

```bash
net.ipv4.conf.default.arp_filter = 0
net.ipv4.conf.default.arp_ignore = 0
net.ipv4.ip_forward = 1
```

1. If the values are not set correctly, run the following command to set them:

```
sysctl -w net.ipv4.conf.default.arp_filter=0
sysctl -w net.ipv4.conf.default.arp_ignore=0
sysctl -w net.ipv4.ip_forward=1


echo "net.ipv4.conf.default.arp_filter=0" >> /etc/sysctl.conf
echo "net.ipv4.conf.default.arp_ignore=0" >> /etc/sysctl.conf
echo "net.ipv4.ip_forward=1" >> /etc/sysctl.conf

sysctl -p
```

1. Reset and reboot the installation:

```bash
sudo ./APP_SLUG reset
```
Where `APP_SLUG` is the unique slug for the application.
For more information, see [Reset a Node](/vendor/embedded-using#reset-a-node) in _Using Embedded Cluster_.

1. Re-run the installation.
</TabItem>
<TabItem value="vxlan" label="VXLAN traffic dropped">

As a temporary troubleshooting measure, set the mode to CrossSubnet and see if the issue persists. This mode only encapsulates traffic between pods across different subnets with VXLAN.

```bash
kubectl patch ippool default-ipv4-ippool --type=merge -p '{"spec": {"vxlanMode": "CrossSubnet"}}'
```

If this resolves the connectivity issues, there is likely an underlying network configuration problem with VXLAN traffic that should be addressed.
</TabItem>
</Tabs>
64 changes: 6 additions & 58 deletions docs/vendor/embedded-using.mdx
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
import UpdateOverview from "../partials/embedded-cluster/_update-overview.mdx"
import SupportBundleIntro from "../partials/support-bundles/_ec-support-bundle-intro.mdx"
import EmbeddedClusterSupportBundle from "../partials/support-bundles/_generate-bundle-ec.mdx"
import EcConfig from "../partials/embedded-cluster/_ec-config.mdx"
import ShellCommand from "../partials/embedded-cluster/_shell-command.mdx"

# Using Embedded Cluster

Expand Down Expand Up @@ -153,58 +152,13 @@ For more information about updating, see [Performing Updates with Embedded Clust

## Access the Cluster

With Embedded Cluster, end-users are rarely supposed to need to use the CLI. Typical workflows, like updating the application and the cluster, are driven through the Admin Console.
With Embedded Cluster, end users rarely need to use the CLI. Typical workflows, like updating the application and the cluster, can be done through the Admin Console. Nonetheless, there are times when vendors or their customers need to use the CLI for development or troubleshooting.

Nonetheless, there are times when vendors or their customers need to use the CLI for development or troubleshooting.

To access the cluster and use other included binaries:

1. SSH onto a controller node.

1. Use the Embedded Cluster shell command to start a shell with access to the cluster:

```
sudo ./APP_SLUG shell
```

The output looks similar to the following:
```
__4___
_ \ \ \ \ Welcome to APP_SLUG debug shell.
<'\ /_/_/_/ This terminal is now configured to access your cluster.
((____!___/) Type 'exit' (or CTRL+d) to exit.
\0\0\0\0\/ Happy hacking.
~~~~~~~~~~~
root@alex-ec-2:/home/alex# export KUBECONFIG="/var/lib/embedded-cluster/k0s/pki/admin.conf"
root@alex-ec-2:/home/alex# export PATH="$PATH:/var/lib/embedded-cluster/bin"
root@alex-ec-2:/home/alex# source <(kubectl completion bash)
root@alex-ec-2:/home/alex# source /etc/bash_completion
```

The appropriate kubeconfig is exported, and the location of useful binaries like kubectl and Replicated’s preflight and support-bundle plugins is added to PATH.

:::note
You cannot run the `shell` command on worker nodes.
:::

1. Use the available binaries as needed.

**Example**:

```bash
kubectl version
```
```
Client Version: v1.29.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.1+k0s
```

1. Type `exit` or **Ctrl + D** to exit the shell.
:::note
If you encounter a typical workflow where your customers have to use the Embedded Cluster shell, reach out to Alex Parker at [email protected]. These workflows might be candidates for additional Admin Console functionality.
:::
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ just pulled this intro content out of the partial so that it doesn't have to show up in the Troubleshooting topic as well


:::note
If you encounter a typical workflow where your customers have to use the Embedded Cluster shell, reach out to Alex Parker at [email protected]. These workflows might be candidates for additional Admin Console functionality.
:::
<ShellCommand/>

## Reset a Node

Expand Down Expand Up @@ -268,9 +222,3 @@ When the containerd options are configured as shown above, the NVIDIA GPU Operat
:::note
If you include the NVIDIA GPU Operator as a Helm extension, remove any existing containerd services that are running on the host (such as those deployed by Docker) before attempting to install the release with Embedded Cluster. If there are any containerd services on the host, the NVIDIA GPU Operator will generate an invalid containerd config, causing the installation to fail.
:::

## Troubleshoot with Support Bundles

<SupportBundleIntro/>

<EmbeddedClusterSupportBundle/>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ Removed this section from the Using Embedded Cluster topic. This info would now appear:

1 change: 1 addition & 0 deletions sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,7 @@ const sidebars = {
},
'enterprise/embedded-manage-nodes',
'enterprise/updating-embedded',
'vendor/embedded-troubleshooting',
'enterprise/embedded-tls-certs',
'vendor/embedded-disaster-recovery',
],
Expand Down