|
| 1 | +--- |
| 2 | +title: Machine/Instance lifecycle |
| 3 | +authors: |
| 4 | + - @enxebre |
| 5 | +reviewers: |
| 6 | + - @derekwaynecarr |
| 7 | + - @michaelgugino |
| 8 | + - @bison |
| 9 | + - @mrunalp |
| 10 | +approvers: |
| 11 | + - @derekwaynecarr |
| 12 | + - @michaelgugino |
| 13 | + - @bison |
| 14 | + - @mrunalp |
| 15 | + |
| 16 | +creation-date: 2019-09-09 |
| 17 | +last-updated: 2019-09-09 |
| 18 | +status: implementable |
| 19 | +see-also: |
| 20 | +replaces: |
| 21 | +superseded-by: |
| 22 | +--- |
| 23 | + |
| 24 | +# Machine/Instance lifecycle |
| 25 | + |
| 26 | +## Release Signoff Checklist |
| 27 | + |
| 28 | + - [x] Enhancement is `implementable` |
| 29 | +- [ ] Design details are appropriately documented from clear requirements |
| 30 | +- [ ] Test plan is defined |
| 31 | +- [ ] Graduation criteria for dev preview, tech preview, GA |
| 32 | +- [ ] User-facing documentation is created in [openshift/docs |
| 33 | + |
| 34 | + |
| 35 | +## Summary |
| 36 | + |
| 37 | +Enable unified semantics across any provider to represent the lifecycle of instances backed by machines resources as phases. |
| 38 | + |
| 39 | +## Motivation |
| 40 | + |
| 41 | +Provide the most similar user experience for the machine API across providers. |
| 42 | + |
| 43 | +### Goals |
| 44 | + |
| 45 | +- Provide semantics to convey the lifecycle of machines in a unified manner across any provider. |
| 46 | + |
| 47 | +- Prevent actuators from creating more than one instance for a machine resource during its lifetime. |
| 48 | + |
| 49 | +### Non-Goals |
| 50 | + |
| 51 | +- Introduce breaking changes in the interface for actuators. |
| 52 | + |
| 53 | +## Proposal |
| 54 | + |
| 55 | +- As a user I want to understand at a glance where a machine is at its lifecycle regardless of the cloud. |
| 56 | + |
| 57 | +- As a platform I want to provide the most similar user experience across providers and hide provider specific details. |
| 58 | + |
| 59 | +- As a dev I want to enforce the machine API invariants and reduce the surface of arbitrary specific provider decisions. |
| 60 | + |
| 61 | +This proposes to use unified phases across any provider to convey the lifecycle of machines by leveraging the [existing API](https://github.com/openshift/cluster-api/blob/openshift-4.2-cluster-api-0.1.0/pkg/apis/machine/v1beta1/machine_types.go#L174). |
| 62 | + |
| 63 | +### Implementation Details |
| 64 | + |
| 65 | +#### Phases |
| 66 | + |
| 67 | +The phase of a Machine is a simple, high-level summary of where the machine is in its lifecycle. In the same vein of [pods](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase) and [cluster API Upstream](https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20190610-machine-states-preboot-bootstrapping.md) |
| 68 | + |
| 69 | +MachineSets or other upper level controllers might choose to leverage signaled phases for weighing when choosing machines for scaling down or ignoring failed machines to satisfy replica count. |
| 70 | + |
| 71 | +The phase will be exposed to kubectl additionalPrinterColumns so users can understand the current state of the world for machines at a glance. |
| 72 | + |
| 73 | +The machine controller will set the right phase when its communication flow with the actuator interface meets the following criteria: |
| 74 | + |
| 75 | +##### Provisioning |
| 76 | + |
| 77 | +- Exists() is False. |
| 78 | + |
| 79 | +- Machine has **no** providerID/address. |
| 80 | + |
| 81 | +##### Provisioned |
| 82 | + |
| 83 | +- Exists() is True. |
| 84 | + |
| 85 | +- Machine has **no** status.nodeRef. |
| 86 | + |
| 87 | +##### Running |
| 88 | + |
| 89 | +- Exists() is True. |
| 90 | + |
| 91 | +- Machine has status.nodeRef. |
| 92 | + |
| 93 | +##### Deleting |
| 94 | + |
| 95 | +- Machine has a DeletionTimestamp. |
| 96 | + |
| 97 | +##### Failed |
| 98 | + |
| 99 | +- Create() returns a permanentError type or Exists() is False and machine has a providerID/address. |
| 100 | + |
| 101 | +#### Actuator invariants |
| 102 | + |
| 103 | +- Once the nodeRef is set, it must never be mutated/deleted. |
| 104 | + |
| 105 | +- Once the providerID is set, it must never mutated/deleted. |
| 106 | + |
| 107 | +##### Create() |
| 108 | + |
| 109 | +- It must set providerID and IP addresses. |
| 110 | + |
| 111 | +- It must return a permanentError type for known unrecoverable errors, e.g invalid cloud input or insufficient quota. |
| 112 | + |
| 113 | +##### Update() |
| 114 | + |
| 115 | +- It must not modify cloud infrastructure. |
| 116 | + |
| 117 | +- It should reconcile machine.Status with cloud values. |
| 118 | + |
| 119 | +#### Errors |
| 120 | + |
| 121 | +- A machine entering a "Failed" phase should set the machine.Status.ErrorMessage. |
| 122 | + |
| 123 | +- The machine.Status.ErrorMessage might be bubbled up to the machineSet.Status.ErrorMessage for easier visibility. |
| 124 | + |
| 125 | + |
| 126 | +#### [Implementation example](https://github.com/enxebre/cluster-api-provider-azure/commit/ef9a0dd68918eb6ca4af50765b07bcae4d309aaf) |
| 127 | + |
| 128 | +### Risks and Mitigations |
| 129 | + |
| 130 | +For it to happen in a non disruptive manner for actuators this proposal is intentionally keeping the surface change small by not breaking the create()/exists()/update() actuators interface. |
| 131 | + |
| 132 | +Once this is settled we might consider to simplify the interface. |
| 133 | + |
| 134 | +## Design Details |
| 135 | + |
| 136 | +### Test Plan |
| 137 | + |
| 138 | +Changes will be test driven via the current e2e machine API suite https://github.com/openshift/cluster-api-actuator-pkg/tree/8250b456dec7b2fb06c591738518de1265e84a2c/pkg/e2e. |
| 139 | + |
| 140 | +### Graduation Criteria |
| 141 | + |
| 142 | +### Upgrade / Downgrade Strategy |
| 143 | + |
| 144 | +Operators must revendor the lastest openshift/cluster-api version. The actuator machine controller image is set in the machine-api-operator repo https://github.com/openshift/machine-api-operator/blob/474e14e4965a8c5e6788417c851ccc7fad1acb3a/install/0000_30_machine-api-operator_01_images.configmap.yaml so the upgrades will be driven by the CVO which will fetch the right image version as usual. |
| 145 | + |
| 146 | + |
| 147 | +### Version Skew Strategy |
| 148 | + |
| 149 | +## Implementation History |
| 150 | + |
| 151 | +## Drawbacks |
| 152 | + |
| 153 | +## Alternatives |
| 154 | + |
| 155 | +## Infrastructure Needed |
0 commit comments