Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods stuck in Initialization #3686

Open
DanStough opened this issue Apr 16, 2021 · 1 comment
Open

Pods stuck in Initialization #3686

DanStough opened this issue Apr 16, 2021 · 1 comment
Labels
Needs info Waiting for information Restic Relates to the restic integration Reviewed Q2 2021

Comments

@DanStough
Copy link

What steps did you take and what happened:

  • Performed a velero backup create <name> -l <my-selector> including a mysql pod with ~200mb of data. Using restic to perform the volume snapshot.
  • Backup completes successfully.
  • Kubectl delete the mysql statefulset
  • Performed a velero restore create --from-backup <name>
  • The mysql pod is stuck indefinitely on the first init container (restic wait). The restic wait logs show it never finds the restore id on the filesystem
  • Inspecting the podvolumerestore resource shows that it is InProgress and has a StartTime . There is an empty Progress field.
  • When I enable debug logs for Restic on this node (it’s only a one node cluster), I can see that this PVR is never enqueued by the restic controller and the Restic command is never executed.

What did you expect to happen:
Volume restore completes successfully.

The output of the following commands will help us better understand what's going on:
(Pasting long output into a GitHub gist or other pastebin is fine.)

  • kubectl logs deployment/velero -n velero
  • velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
apiVersion: velero.io/v1
kind: Backup
metadata:
  annotations:
    kots.io/app-id: 1rFqFP77TgpqjhLvubxmjeTBYbA
    kots.io/app-sequence: "0"
    kots.io/snapshot-requested: "2021-04-16T14:09:44Z"
    kots.io/snapshot-trigger: manual
    preserve: me
    velero.io/source-cluster-k8s-gitversion: v1.19.9
    velero.io/source-cluster-k8s-major-version: "1"
    velero.io/source-cluster-k8s-minor-version: "19"
  creationTimestamp: "2021-04-16T14:09:44Z"
  generateName: qakots-
  generation: 6
  labels:
    velero.io/storage-location: default
  managedFields:
  - apiVersion: velero.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:kots.io/app-id: {}
          f:kots.io/app-sequence: {}
          f:kots.io/snapshot-requested: {}
          f:kots.io/snapshot-trigger: {}
          f:preserve: {}
        f:generateName: {}
      f:spec:
        .: {}
        f:excludedNamespaces: {}
        f:hooks: {}
        f:includeClusterResources: {}
        f:includedNamespaces: {}
        f:labelSelector:
          .: {}
          f:matchLabels:
            .: {}
            f:app: {}
            f:kots.io/app-slug: {}
        f:storageLocation: {}
        f:ttl: {}
      f:status: {}
    manager: kotsadm
    operation: Update
    time: "2021-04-16T14:09:44Z"
  - apiVersion: velero.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:velero.io/source-cluster-k8s-gitversion: {}
          f:velero.io/source-cluster-k8s-major-version: {}
          f:velero.io/source-cluster-k8s-minor-version: {}
        f:labels:
          .: {}
          f:velero.io/storage-location: {}
      f:spec:
        f:defaultVolumesToRestic: {}
        f:hooks:
          f:resources: {}
      f:status:
        f:completionTimestamp: {}
        f:expiration: {}
        f:formatVersion: {}
        f:phase: {}
        f:progress:
          .: {}
          f:itemsBackedUp: {}
          f:totalItems: {}
        f:startTimestamp: {}
        f:version: {}
    manager: velero-server
    operation: Update
    time: "2021-04-16T14:09:58Z"
  name: qakots-kn98v
  namespace: velero
  resourceVersion: "4445"
  selfLink: /apis/velero.io/v1/namespaces/velero/backups/qakots-kn98v
  uid: 93df323f-b053-469a-a55f-e5911c1e2c88
spec:
  defaultVolumesToRestic: false
  excludedNamespaces:
  - exclude-me
  hooks:
    resources:
    - includedResources:
      - pods
      labelSelector:
        matchLabels:
          app: example
          component: nginx
      name: test-hook
      post:
      - exec:
          command:
          - /bin/uname
          - -a
          container: nginx
          onError: Fail
          timeout: 30s
      pre:
      - exec:
          command:
          - /bin/uname
          - -a
          container: nginx
          onError: Fail
          timeout: 30s
  includeClusterResources: true
  includedNamespaces:
  - qakots-additional-namespace
  - default
  - qakots-additional-namespace
  labelSelector:
    matchLabels:
      app: example
      kots.io/app-slug: qakots
  storageLocation: default
  ttl: 720h0m0s
status:
  completionTimestamp: "2021-04-16T14:09:58Z"
  expiration: "2021-05-16T14:09:44Z"
  formatVersion: 1.1.0
  phase: Completed
  progress:
    itemsBackedUp: 18
    totalItems: 18
  startTimestamp: "2021-04-16T14:09:44Z"
  version: 1
  • velero backup logs <backupname>
  • velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
apiVersion: velero.io/v1
kind: Restore
metadata:
  creationTimestamp: "2021-04-16T14:11:40Z"
  generation: 2
  managedFields:
  - apiVersion: velero.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        .: {}
        f:backupName: {}
        f:hooks: {}
        f:includeClusterResources: {}
        f:restorePVs: {}
      f:status: {}
    manager: kotsadm
    operation: Update
    time: "2021-04-16T14:11:40Z"
  - apiVersion: velero.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        f:excludedResources: {}
      f:status:
        f:phase: {}
        f:startTimestamp: {}
    manager: velero-server
    operation: Update
    time: "2021-04-16T14:11:41Z"
  name: qakots-kn98v
  namespace: velero
  resourceVersion: "4895"
  selfLink: /apis/velero.io/v1/namespaces/velero/restores/qakots-kn98v
  uid: d8134b22-2e74-4e2e-a1dc-8b5d0403b73f
spec:
  backupName: qakots-kn98v
  excludedResources:
  - nodes
  - events
  - events.events.k8s.io
  - backups.velero.io
  - restores.velero.io
  - resticrepositories.velero.io
  hooks: {}
  includeClusterResources: true
  restorePVs: true
status:
  phase: InProgress
  startTimestamp: "2021-04-16T14:11:41Z"
  • velero restore logs <restorename>
    Logs for restore "qakots-kn98v" are not available until it's finished processing. Please wait until the restore has a phase of Completed or Failed and try again.

Anything else you would like to add:

  • I’ve been looking through the source and I’m not really sure how it gets into InProgress without running the Restic command, which I don't see for this volume restore.
  • All PVs are bound; reclaim policy is DELETE.
  • k describe podvolumerestore -n velero qakots-kn98v-wjl56
ame:         qakots-kn98v-wjl56
Namespace:    velero
Labels:       velero.io/pod-uid=7ef583c0-426d-4085-9aac-78819ecd067d
              velero.io/restore-name=qakots-kn98v
              velero.io/restore-uid=d8134b22-2e74-4e2e-a1dc-8b5d0403b73f
Annotations:  <none>
API Version:  velero.io/v1
Kind:         PodVolumeRestore
Metadata:
  Creation Timestamp:  2021-04-16T14:11:41Z
  Generate Name:       qakots-kn98v-
  Generation:          3
  Managed Fields:
    API Version:  velero.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:generateName:
        f:labels:
          .:
          f:velero.io/pod-uid:
          f:velero.io/restore-name:
          f:velero.io/restore-uid:
        f:ownerReferences:
          .:
          k:{"uid":"d8134b22-2e74-4e2e-a1dc-8b5d0403b73f"}:
            .:
            f:apiVersion:
            f:controller:
            f:kind:
            f:name:
            f:uid:
      f:spec:
        .:
        f:backupStorageLocation:
        f:pod:
          .:
          f:kind:
          f:name:
          f:namespace:
          f:uid:
        f:repoIdentifier:
        f:snapshotID:
        f:volume:
      f:status:
        .:
        f:progress:
    Manager:      velero-server
    Operation:    Update
    Time:         2021-04-16T14:11:41Z
    API Version:  velero.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        f:phase:
        f:progress:
          f:totalBytes:
        f:startTimestamp:
    Manager:    restic-server
    Operation:  Update
    Time:       2021-04-16T14:11:46Z
  Owner References:
    API Version:     velero.io/v1
    Controller:      true
    Kind:            Restore
    Name:            qakots-kn98v
    UID:             d8134b22-2e74-4e2e-a1dc-8b5d0403b73f
  Resource Version:  4991
  Self Link:         /apis/velero.io/v1/namespaces/velero/podvolumerestores/qakots-kn98v-wjl56
  UID:               3f6a2a83-6510-4f51-b01f-7107163cdfe4
Spec:
  Backup Storage Location:  default
  Pod:
    Kind:           Pod
    Name:           example-mysql-0
    Namespace:      default
    UID:            7ef583c0-426d-4085-9aac-78819ecd067d
  Repo Identifier:  s3:http://rook-ceph-rgw-rook-ceph-store.rook-ceph/velero/restic/default
  Snapshot ID:      3c5577c3
  Volume:           datadir
Status:
  Phase:  InProgress
  Progress:
    Total Bytes:    120755174
  Start Timestamp:  2021-04-16T14:11:45Z
Events:             <none>
  • Restic Logs:
time="2021-04-16T14:12:02Z" level=info msg="Setting log-level to INFO"
time="2021-04-16T14:12:02Z" level=info msg="Starting Velero restic server v1.5.4 (525705bceb8895b9da2cf2a1d1a79b99d74723cb)" logSource="pkg/cmd/cli/restic/server.go:83"
2021-04-16T14:12:03.022Z	INFO	controller-runtime.metrics	metrics server is starting to listen	{"addr": ":8080"}
time="2021-04-16T14:12:03Z" level=info msg="Starting controllers" logSource="pkg/cmd/cli/restic/server.go:208"
time="2021-04-16T14:12:03Z" level=info msg="Controllers starting..." logSource="pkg/cmd/cli/restic/server.go:250"
time="2021-04-16T14:12:03Z" level=info msg="Starting metric server for restic at address [:8085]" logSource="pkg/cmd/cli/restic/server.go:199"
time="2021-04-16T14:12:03Z" level=info msg="Starting controller" controller=pod-volume-backup logSource="pkg/controller/generic_controller.go:76"
time="2021-04-16T14:12:03Z" level=info msg="Waiting for caches to sync" controller=pod-volume-backup logSource="pkg/controller/generic_controller.go:81"
time="2021-04-16T14:12:03Z" level=info msg="Starting controller" controller=pod-volume-restore logSource="pkg/controller/generic_controller.go:76"
time="2021-04-16T14:12:03Z" level=info msg="Waiting for caches to sync" controller=pod-volume-restore logSource="pkg/controller/generic_controller.go:81"
2021-04-16T14:12:03.076Z	INFO	controller-runtime.manager	starting metrics server	{"path": "/metrics"}
time="2021-04-16T14:12:03Z" level=info msg="Caches are synced" controller=pod-volume-restore logSource="pkg/controller/generic_controller.go:85"
time="2021-04-16T14:12:03Z" level=info msg="Restore starting" controller=pod-volume-restore logSource="pkg/controller/pod_volume_restore_controller.go:280" name=qakots-kn98v-lq7w4 namespace=velero restore=velero/qakots-kn98v
time="2021-04-16T14:12:03Z" level=info msg="Caches are synced" controller=pod-volume-backup logSource="pkg/controller/generic_controller.go:85"
time="2021-04-16T14:12:05Z" level=info msg="Restore completed" controller=pod-volume-restore logSource="pkg/controller/pod_volume_restore_controller.go:345" name=qakots-kn98v-lq7w4 namespace=velero restore=velero/qakots-kn98v

Environment:

  • Velero version (use velero version):
Client:
	Version: v1.5.4
	Git commit: 525705bceb8895b9da2cf2a1d1a79b99d74723cb
Server:
	Version: v1.5.4
  • Velero features (use velero client config get features):
    features: <NOT SET>
  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.9", GitCommit:"9dd794e454ac32d97cde41ae10be801ae98f75df", GitTreeState:"clean", BuildDate:"2021-03-18T01:09:28Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.9", GitCommit:"9dd794e454ac32d97cde41ae10be801ae98f75df", GitTreeState:"clean", BuildDate:"2021-03-18T01:00:06Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes installer & version: kubeadm - v1.19.7
  • Cloud provider or hardware configuration: GCP n1-standard-4 (4cpu and 15GB ram)
  • OS (e.g. from /etc/os-release): Ubuntu-18 LTS

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@zubron
Copy link
Contributor

zubron commented Apr 19, 2021

Hi @DanStough - Sorry you're having issues with this. Out of curiosity, did the restic pod restart at all? I ask because the start time of the PodVolumeRestore (PVR) seems to precede the timestamp of the first line in the restic pod logs. I don't think the current Velero and Velero restic servers handle items that are in progress after restart. It's possible that restic pod started to process the PVR but then the pod was restarted and now the PVR is "in progress" but is not actually being processed by the new pod.

@zubron zubron added Needs info Waiting for information Restic Relates to the restic integration labels Apr 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs info Waiting for information Restic Relates to the restic integration Reviewed Q2 2021
Projects
None yet
Development

No branches or pull requests

3 participants