import NodeAgentMemLimit from "../partials/snapshots/_node-agent-mem-limit.mdx"
When a snapshot fails, a support bundle will be collected and stored automatically. Because this is a point-in-time collection of all logs and system state at the time of the failed snapshot, this is a good place to view the logs.
If Velero is crashing and not starting, some common causes are:
You see the following error message from Velero when trying to configure a snapshot.
time="2020-04-10T14:22:24Z" level=info msg="Checking existence of namespace" logSource="pkg/cmd/server/server.go:337" namespace=velero
time="2020-04-10T14:22:24Z" level=info msg="Namespace exists" logSource="pkg/cmd/server/server.go:343" namespace=velero
time="2020-04-10T14:22:27Z" level=info msg="Checking existence of Velero custom resource definitions" logSource="pkg/cmd/server/server.go:372"
time="2020-04-10T14:22:31Z" level=info msg="All Velero custom resource definitions exist" logSource="pkg/cmd/server/server.go:406"
time="2020-04-10T14:22:31Z" level=info msg="Checking that all backup storage locations are valid" logSource="pkg/cmd/server/server.go:413"
An error occurred: some backup storage locations are invalid: backup store for location "default" is invalid: rpc error: code = Unknown desc = NoSuchBucket: The specified bucket does not exist
status code: 404, request id: BEFAE2B9B05A2DCF, host id: YdlejsorQrn667ziO6Xr6gzwKJJ3jpZzZBMwwMIMpWj18Phfii6Za+dQ4AgfzRcxavQXYcgxRJI=
If the cloud access credentials are invalid or do not have access to the location in the configuration, Velero will crashloop. The Velero logs will be included in a support bundle, and the message will look like this.
Replicated recommends that you validate the access key / secret or service account json.
You see the following error message when Velero is starting:
time="2020-04-10T14:12:42Z" level=info msg="Checking existence of namespace" logSource="pkg/cmd/server/server.go:337" namespace=velero
time="2020-04-10T14:12:42Z" level=info msg="Namespace exists" logSource="pkg/cmd/server/server.go:343" namespace=velero
time="2020-04-10T14:12:44Z" level=info msg="Checking existence of Velero custom resource definitions" logSource="pkg/cmd/server/server.go:372"
time="2020-04-10T14:12:44Z" level=info msg="All Velero custom resource definitions exist" logSource="pkg/cmd/server/server.go:406"
time="2020-04-10T14:12:44Z" level=info msg="Checking that all backup storage locations are valid" logSource="pkg/cmd/server/server.go:413"
An error occurred: some backup storage locations are invalid: backup store for location "default" is invalid: Backup store contains invalid top-level directories: [other-directory]
This error message is caused when Velero is attempting to start, and it is configured to use a reconfigured or re-used bucket.
When configuring Velero to use a bucket, the bucket cannot contain other data, or Velero will crash.
Configure Velero to use a bucket that does not contain other data.
If the node-agent Pod is crashing and not starting, some common causes are:
You see the following error in the node-agent logs.
time="2023-11-16T21:29:44Z" level=info msg="Starting metric server for node agent at address []" logSource="pkg/cmd/cli/nodeagent/server.go:229"
time="2023-11-16T21:29:44Z" level=fatal msg="Failed to start metric server for node agent at []: listen tcp :80: bind: permission denied" logSource="pkg/cmd/cli/nodeagent/server.go:236"
This is a result of a known issue in Velero 1.12.0 and 1.12.1 where the port is not set correctly when starting the metrics server. This causes the metrics server to fail to start with a permission denied
error in environments that do not run MinIO and have Host Path, NFS, or internal storage destinations configured. When the metrics server fails to start, the node-agent Pod crashes. For more information about this issue, see the GitHub issue details.
Replicated recommends that you either upgrade to Velero 1.12.2 or later, or downgrade to a version earlier than 1.12.0.
You see a backup error that includes a timeout message when attempting to create a snapshot. For example:
Error backing up item
timed out after 12h0m0s
This error message appears when the node-agent (restic) Pod operation timeout limit is reached. In Velero v1.4.2 and later, the default timeout is 240 minutes.
Restic is an open-source backup tool. Velero integrates with Restic to provide a solution for backing up and restoring Kubernetes volumes. For more information about the Velero Restic integration, see File System Backup in the Velero documentation.
Use the kubectl Kubernetes command-line tool to patch the Velero deployment to increase the timeout:
Velero version 1.10 and later:
kubectl patch deployment velero -n velero --type json -p '[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--fs-backup-timeout=TIMEOUT_LIMIT"}]'
Velero versions less than 1.10:
kubectl patch deployment velero -n velero --type json -p '[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--restic-timeout=TIMEOUT_LIMIT"}]'
Replace TIMEOUT_LIMIT
with a length of time for the node-agent (restic) Pod operation timeout in hours, minutes, and seconds. Use the format 0h0m0s
. For example, 48h30m0s
.
:::note
The timeout value reverts back to the default value if you rerun the velero install
command.
:::
The node-agent (restic) Pod is killed by the Linux kernel Out Of Memory (OOM) killer or snapshots are failing with errors simlar to:
pod volume backup failed: ... signal: killed
Velero sets default limits for the velero Pod and the node-agent (restic) Pod during installation. There is a known issue with Restic that causes high memory usage, which can result in failures during snapshot creation when the Pod reaches the memory limit.
For more information, see the Restic backup — OOM-killed on raspberry pi after backing up another computer to same repo issue in the restic GitHub repository.
You see the following error in Velero logs:
Error backing up item...Warning: at least one source file could not be read
There are file changes between Restic's initial scan of the volume and during the backup to Restic store.
To resolve this issue, do one of the following:
- Use hooks to export data to an EmptyDir volume and include that in the backup instead of the primary PVC volume. See Configuring Backup and Restore Hooks for Snapshots.
- Freeze the file system to ensure all pending disk I/O operations have completed prior to taking a snapshot. For more information, see Hook Example with fsfreeze in the Velero documentation.
In the Replicated KOTS Admin Console, you see an Application failed to restore error message that indicates the port number for a static NodePort is already in use. For example:
View a larger version of this image
There is a known issue in Kubernetes versions earlier than version 1.19 where using a static NodePort for services can collide in multi-primary high availability setups when recreating the services. For more information about this known issue, see kubernetes/kubernetes#85894.
This issue is fixed in Kubernetes version 1.19. To resolve this issue, upgrade to Kubernetes version 1.19 or later.
For more infromation about the fix, see kubernetes/kubernetes#89937.
In the Admin Console, you see at least one volume restore progress bar frozen at 0%. Example Admin Console display:
You can confirm this is the same issue by running kubectl get pods -n <application namespace>
, and you should see at least one pod stuck in initialization:
NAME READY STATUS RESTARTS AGE
example-mysql-0 0/1 Init:0/2 0 4m15s #<- the offending pod
example-nginx-77b878b4f-zwv2h 3/3 Running 0 4m15s
We have seen this issue with Velero version 1.5.4 and opened up this issue with the project to inspect the root cause: vmware-tanzu/velero#3686. However we have not experienced this using Velero 1.6.0 or later.
Upgrade Velero to 1.9.0. You can upgrade using Replicated kURL. Or, to follow the Velero upgrade instructions, see Upgrading to Velero 1.9 in the Velero documentation.
In the Admin Console, when the partial snapshot restore completes, you see warnings indicating that Endpoint resources were not restored:
The resource restore priority was changed in Velero 1.10.3 and 1.11.0, which leads to this warning when restoring Endpoint resources. For more information about this issue, see the issue details in GitHub.
These warnings do not necessarily mean that the restore itself failed. The endpoints likely do exist as they are created by Kubernetes when the related Service resources were restored. However, to prevent encountering these warnings, use Velero version 1.11.1 or later.