add pod initialization logic in diag and follow up some minor reporting changes. #4690

tejal29 · 2020-08-18T11:58:43Z

Description: In this PR,

Functionality Changes

Fail StatusCheck phase fast once a deployment fails.
Since we are failing/quitting status check phase on first failure, I have removed the pending update from failed log line.
All the pending deployments are going to fail and hence [x/total deployment(s) still pending] add to noise.

 - test:deployment/dep8 failed. [8/10 deployment(s) still pending] Error: context deadline expired.
 - test:deployment/dep9 failed. [9/10 deployment(s) still pending] Error: context cancelled
 - test:deployment/dep10 failed. Error: context cancelled

to

 - test:deployment/dep8 failed. Error: context deadline expired.

Finally most Important:
a. Change default polling time from 100 ms to 1 second.

what this means is, skaffold statuscheck will now poll for pods and deployments status after every 1 second.
Pros:
- Less updates as a pod can go from container creating to multiple states quickly.
  Cons:
- if a pod fails within milliseconds instantly, skaffold will report it in CLI and via event API after 1 second.
  This is not noticable for skaffold CLI UI, but if users are relying on event API to fail fast, they will see a lag.
  TODO : Create an issue to document StatusCheck lags/SLO on skaffold.dev.
  b. Change default report time if new changes from 500 milliseconds to 5 seconds.
what this means is, skaffold statuscheck will check for any new updates for a deployment and pod after every 5 seconds. Periodically provide deployment state-change feedback #4670
- If a resource status failed or succeeded, they are reported immediately.
- For all other updates, status check wont report it on CLI and via Event API for max 5 seconds depending on when the update happened within those 5 second poll window.
- Consumers of Event API might see a delay. However remember these are intermediate recoverable error states e.g. "kubectl connection error", or "cluster disk pressure error".

Implementation changes

Add type for status checker, very similar to @dgageot Introduce a StatusChecker type #4605. Kept the changes minimal.
Fetch events and logs in diag package when PodInitializing phase and logs for any running init containers.
- unless all init containers are completed successfully, a pod will be in "PodInitializing" phase.
- Events are fetched, because init containers themselves could be waiting for a free node. In such case, diag package can check if there is a resource congestion and report it.

New output:

Waiting for deployments to stabilize...
 - deployment/nodejs-guestbook-frontend is ready. [2/3 deployment(s) still pending]
 - deployment/nodejs-guestbook-mongodb: container mongo terminated with exit code 100
    - pod/nodejs-guestbook-mongodb-579dc8b69d-rdlpv: container mongo terminated with exit code 100
      > [nodejs-guestbook-mongodb-579dc8b69d-rdlpv mongo] {"t":{"$date":"2020-08-18T11:53:41.140+00:00"},"s":"I",  "c":"-",        "id":20520,   "ctx":"initandlisten","msg":"Stopping further Flow Control ticket acquisitions."}
      > [nodejs-guestbook-mongodb-579dc8b69d-rdlpv mongo] {"t":{"$date":"2020-08-18T11:53:41.141+00:00"},"s":"I",  "c":"CONTROL",  "id":20565,   "ctx":"initandlisten","msg":"Now exiting"}
      > [nodejs-guestbook-mongodb-579dc8b69d-rdlpv mongo] {"t":{"$date":"2020-08-18T11:53:41.141+00:00"},"s":"I",  "c":"CONTROL",  "id":23138,   "ctx":"initandlisten","msg":"Shutting down","attr":{"exitCode":100}}
      Full logs at /var/folders/gk/s778hvkj0lb0zl95ywvw0snr00cfc0/T/skaffold/statuscheck/nodejs-guestbook-mongodb-579dc8b69d-rdlpv.log
 - deployment/nodejs-guestbook-mongodb failed. Error: container mongo terminated with exit code 100.
 - deployment/nodejs-guestbook-backend: waiting for init container init-db-ready
    - pod/nodejs-guestbook-backend-f7cb86d9f-7tmcf: waiting for init container init-db-ready
      > [nodejs-guestbook-backend-f7cb86d9f-7tmcf init-db-ready] Waiting for mongodb to start...
      > [nodejs-guestbook-backend-f7cb86d9f-7tmcf init-db-ready] Waiting for connection for 2 sec.
      Full logs at /var/folders/gk/s778hvkj0lb0zl95ywvw0snr00cfc0/T/skaffold/statuscheck/nodejs-guestbook-backend-f7cb86d9f-7tmcf.log
 - deployment/nodejs-guestbook-backend failed. Error: waiting for init container init-db-ready.
exiting dev mode because first deploy failed: 2/3 deployment(s) failed

Event API
Note: STATUSCHECK_POD_INITIALIZING has a message which is not empty.
Note: The Entry for the event has an explanation instead of string "Resource pod/nodejs-guestbook-backend-65cdf8b7fc-cn6sp status failed with " in original issue

{"result":{"timestamp":"2020-08-18T23:52:33.458189Z","event":{"resourceStatusCheckEvent":{"resource":"pod/nodejs-guestbook-frontend-6f7d5965d6-dqzlv","status":"In Progress","message":"creating container frontend","statusCode":"STATUSCHECK_CONTAINER_CREATING","actionableErr":{"errCode":"STATUSCHECK_CONTAINER_CREATING","message":"creating container frontend"}}},"entry":"Resource pod/nodejs-guestbook-frontend-6f7d5965d6-dqzlv status updated to In Progress"}}
{"result":{"timestamp":"2020-08-18T23:52:33.480657Z","event":{"resourceStatusCheckEvent":{"resource":"pod/nodejs-guestbook-backend-65cdf8b7fc-cn6sp","status":"Failed","err":"waiting for init container init-db-ready to start","

statusCode":"STATUSCHECK_POD_INITIALIZING","actionableErr":{"errCode":"STATUSCHECK_POD_INITIALIZING",
"message":"waiting for init container init-db-ready to start"}}},

"entry":"Resource pod/nodejs-guestbook-backend-65cdf8b7fc-cn6sp status failed with waiting for init container init-db-ready to start"}}

codecov · 2020-08-19T06:39:06Z

Codecov Report

Merging #4690 into master will decrease coverage by 0.07%.
The diff coverage is 45.54%.

@@            Coverage Diff             @@
##           master    #4690      +/-   ##
==========================================
- Coverage   73.45%   73.37%   -0.08%     
==========================================
  Files         343      344       +1     
  Lines       13585    13624      +39     
==========================================
+ Hits         9979     9997      +18     
- Misses       2982     3003      +21     
  Partials      624      624

Impacted Files	Coverage Δ
pkg/diag/validator/pod.go	`1.60% <0.00%> (-0.08%)`	⬇️
pkg/skaffold/runner/runner.go	`0.00% <ø> (ø)`
pkg/skaffold/deploy/status_check.go	`52.25% <25.53%> (-6.81%)`	⬇️
pkg/skaffold/runner/deploy.go	`44.77% <50.00%> (ø)`
pkg/skaffold/deploy/resource/logfile.go	`84.61% <84.61%> (ø)`
pkg/skaffold/deploy/resource/deployment.go	`91.54% <88.00%> (+1.54%)`	⬆️
pkg/diag/validator/resource.go	`58.82% <0.00%> (+5.88%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update abdbe17...c8d794f. Read the comment docs.

tejal29 · 2020-08-19T17:23:57Z

The codecov is low because i added interface and structs.
Most new lines have code coverage.

pkg/diag/validator/pod_test.go

pkg/skaffold/deploy/resource/deployment.go

pkg/skaffold/deploy/resource/logfile.go

pkg/skaffold/deploy/resource/deployment.go

Co-authored-by: Nick Kubala <[email protected]>

nkubala

cool, this is definitely an improvement on the UX. i'd still like to see some sort of visual update here (e.g. scrolling periods) but we can investigate later.

briandealwis

I don't much like that we have STATUSCHECK_CONTEXT_CANCELED — that's an internal detail that shouldn't get reported upward.

pkg/skaffold/deploy/resource/deployment.go

pkg/skaffold/deploy/status_check.go

briandealwis · 2020-08-19T19:37:18Z

pkg/skaffold/deploy/status_check.go

 		}(d)
 	}

 	// Retrieve pending deployments statuses
 	go func() {
-		printDeploymentStatus(ctx, out, deployments, deadline)
+		s.printDeploymentStatus(sCtx, out, deployments)


Since this doesn't add to the WaitGroup, it's possible this code is called after sCtx has been cancelled.

no it should not happen. This WaitGroup is way to wait for all the pollDeployments to finish
There won't be go routine leak by not adding to wait group. Having printDeploymentStatus run in child context of parent should gaurantee this will get cancelled.

pkg/skaffold/deploy/status_check.go

briandealwis · 2020-08-24T15:34:15Z

(oops wrong PR on the kokoro:force)

googlebot added the cla: yes label Aug 18, 2020

pull-request-size bot added the size/L label Aug 18, 2020

add pod initialization logic in diag. better reporting

dd0c117

tejal29 force-pushed the pod_init_better branch 2 times, most recently from 0093f05 to f32f508 Compare August 18, 2020 23:59

mute logs for status check by default

f8a59a8

tejal29 force-pushed the pod_init_better branch from f32f508 to f8a59a8 Compare August 19, 2020 00:00

tejal29 marked this pull request as ready for review August 19, 2020 00:10

tejal29 requested a review from a team as a code owner August 19, 2020 00:10

tejal29 requested a review from dgageot August 19, 2020 00:10

add struct for statusChecker and fix/add tests

4cf5750

pull-request-size bot added size/XL and removed size/L labels Aug 19, 2020

add struct for statusChecker and fix/add tests

e6011e3

tejal29 force-pushed the pod_init_better branch from 406e4a1 to e6011e3 Compare August 19, 2020 06:30

tejal29 and others added 2 commits August 19, 2020 09:40

Merge branch 'master' into pod_init_better

201d3c6

fix windows unit tests

c310ee4

nkubala suggested changes Aug 19, 2020

View reviewed changes

tejal29 and others added 2 commits August 19, 2020 11:55

Apply suggestions from code review

32d8597

Co-authored-by: Nick Kubala <[email protected]>

move to constant

5c74810

nkubala approved these changes Aug 19, 2020

View reviewed changes

briandealwis reviewed Aug 19, 2020

View reviewed changes

tejal29 added 3 commits August 19, 2020 14:33

address brian's comment

fd34c11

Rename to STATUSCHECK_USER_CANCELLED

41a1c6c

fix test usage of StatusCode_STATUSCHECK_CONTEXT_CANCELLED

24bc469

tejal29 added the area/status-check label Aug 22, 2020

briandealwis added the kokoro:force-run forces a kokoro re-run on a PR label Aug 24, 2020

kokoro-team removed the kokoro:force-run forces a kokoro re-run on a PR label Aug 24, 2020

tejal29 and others added 2 commits August 27, 2020 10:00

Merge branch 'master' into pod_init_better

5b35ad1

fix imports

c8d794f

tejal29 mentioned this pull request Aug 27, 2020

Introduce a StatusChecker type #4605

Closed

tejal29 merged commit 53dffe1 into GoogleContainerTools:master Aug 28, 2020

MarlonGamez mentioned this pull request Aug 28, 2020

Add test for using helm setFiles #4735

Merged

fmotrifork mentioned this pull request Sep 4, 2020

skaffold 1.14.0 fishworks/fish-food#927

Merged

tejal29 added deploy/cloud-run cloud-code/cloud-run labels Feb 23, 2021

tejal29 deleted the pod_init_better branch April 15, 2021 07:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add pod initialization logic in diag and follow up some minor reporting changes. #4690

add pod initialization logic in diag and follow up some minor reporting changes. #4690

tejal29 commented Aug 18, 2020 •

edited by briandealwis

Loading

codecov bot commented Aug 19, 2020 •

edited

Loading

tejal29 commented Aug 19, 2020

nkubala left a comment

briandealwis left a comment

briandealwis Aug 19, 2020

tejal29 Aug 19, 2020

briandealwis commented Aug 24, 2020

add pod initialization logic in diag and follow up some minor reporting changes. #4690

add pod initialization logic in diag and follow up some minor reporting changes. #4690

Conversation

tejal29 commented Aug 18, 2020 • edited by briandealwis Loading

codecov bot commented Aug 19, 2020 • edited Loading

Codecov Report

tejal29 commented Aug 19, 2020

nkubala left a comment

Choose a reason for hiding this comment

briandealwis left a comment

Choose a reason for hiding this comment

briandealwis Aug 19, 2020

Choose a reason for hiding this comment

tejal29 Aug 19, 2020

Choose a reason for hiding this comment

briandealwis commented Aug 24, 2020

tejal29 commented Aug 18, 2020 •

edited by briandealwis

Loading

codecov bot commented Aug 19, 2020 •

edited

Loading