Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add pod initialization logic in diag and follow up some minor reporting changes. #4690

Merged
merged 13 commits into from
Aug 28, 2020

Conversation

tejal29
Copy link
Contributor

@tejal29 tejal29 commented Aug 18, 2020

Fixes #4668, #4670

Description: In this PR,

Functionality Changes

  1. Fail StatusCheck phase fast once a deployment fails.
  2. Since we are failing/quitting status check phase on first failure, I have removed the pending update from failed log line.
    All the pending deployments are going to fail and hence [x/total deployment(s) still pending] add to noise.
 - test:deployment/dep8 failed. [8/10 deployment(s) still pending] Error: context deadline expired.
 - test:deployment/dep9 failed. [9/10 deployment(s) still pending] Error: context cancelled
 - test:deployment/dep10 failed. Error: context cancelled

to

 - test:deployment/dep8 failed. Error: context deadline expired.
  1. Finally most Important:
    a. Change default polling time from 100 ms to 1 second.
  • what this means is, skaffold statuscheck will now poll for pods and deployments status after every 1 second.
    Pros:
    • Less updates as a pod can go from container creating to multiple states quickly.
      Cons:
    • if a pod fails within milliseconds instantly, skaffold will report it in CLI and via event API after 1 second.
      This is not noticable for skaffold CLI UI, but if users are relying on event API to fail fast, they will see a lag.
      TODO : Create an issue to document StatusCheck lags/SLO on skaffold.dev.
      b. Change default report time if new changes from 500 milliseconds to 5 seconds.
  • what this means is, skaffold statuscheck will check for any new updates for a deployment and pod after every 5 seconds. Periodically provide deployment state-change feedback #4670
    • If a resource status failed or succeeded, they are reported immediately.
    • For all other updates, status check wont report it on CLI and via Event API for max 5 seconds depending on when the update happened within those 5 second poll window.
    • Consumers of Event API might see a delay. However remember these are intermediate recoverable error states e.g. "kubectl connection error", or "cluster disk pressure error".

Implementation changes

  1. Add type for status checker, very similar to @dgageot Introduce a StatusChecker type #4605. Kept the changes minimal.
  2. Fetch events and logs in diag package when PodInitializing phase and logs for any running init containers.
    • unless all init containers are completed successfully, a pod will be in "PodInitializing" phase.
    • Events are fetched, because init containers themselves could be waiting for a free node. In such case, diag package can check if there is a resource congestion and report it.

New output:

Waiting for deployments to stabilize...
 - deployment/nodejs-guestbook-frontend is ready. [2/3 deployment(s) still pending]
 - deployment/nodejs-guestbook-mongodb: container mongo terminated with exit code 100
    - pod/nodejs-guestbook-mongodb-579dc8b69d-rdlpv: container mongo terminated with exit code 100
      > [nodejs-guestbook-mongodb-579dc8b69d-rdlpv mongo] {"t":{"$date":"2020-08-18T11:53:41.140+00:00"},"s":"I",  "c":"-",        "id":20520,   "ctx":"initandlisten","msg":"Stopping further Flow Control ticket acquisitions."}
      > [nodejs-guestbook-mongodb-579dc8b69d-rdlpv mongo] {"t":{"$date":"2020-08-18T11:53:41.141+00:00"},"s":"I",  "c":"CONTROL",  "id":20565,   "ctx":"initandlisten","msg":"Now exiting"}
      > [nodejs-guestbook-mongodb-579dc8b69d-rdlpv mongo] {"t":{"$date":"2020-08-18T11:53:41.141+00:00"},"s":"I",  "c":"CONTROL",  "id":23138,   "ctx":"initandlisten","msg":"Shutting down","attr":{"exitCode":100}}
      Full logs at /var/folders/gk/s778hvkj0lb0zl95ywvw0snr00cfc0/T/skaffold/statuscheck/nodejs-guestbook-mongodb-579dc8b69d-rdlpv.log
 - deployment/nodejs-guestbook-mongodb failed. Error: container mongo terminated with exit code 100.
 - deployment/nodejs-guestbook-backend: waiting for init container init-db-ready
    - pod/nodejs-guestbook-backend-f7cb86d9f-7tmcf: waiting for init container init-db-ready
      > [nodejs-guestbook-backend-f7cb86d9f-7tmcf init-db-ready] Waiting for mongodb to start...
      > [nodejs-guestbook-backend-f7cb86d9f-7tmcf init-db-ready] Waiting for connection for 2 sec.
      Full logs at /var/folders/gk/s778hvkj0lb0zl95ywvw0snr00cfc0/T/skaffold/statuscheck/nodejs-guestbook-backend-f7cb86d9f-7tmcf.log
 - deployment/nodejs-guestbook-backend failed. Error: waiting for init container init-db-ready.
exiting dev mode because first deploy failed: 2/3 deployment(s) failed

Event API
Note: STATUSCHECK_POD_INITIALIZING has a message which is not empty.
Note: The Entry for the event has an explanation instead of string "Resource pod/nodejs-guestbook-backend-65cdf8b7fc-cn6sp status failed with " in original issue

{"result":{"timestamp":"2020-08-18T23:52:33.458189Z","event":{"resourceStatusCheckEvent":{"resource":"pod/nodejs-guestbook-frontend-6f7d5965d6-dqzlv","status":"In Progress","message":"creating container frontend","statusCode":"STATUSCHECK_CONTAINER_CREATING","actionableErr":{"errCode":"STATUSCHECK_CONTAINER_CREATING","message":"creating container frontend"}}},"entry":"Resource pod/nodejs-guestbook-frontend-6f7d5965d6-dqzlv status updated to In Progress"}}
{"result":{"timestamp":"2020-08-18T23:52:33.480657Z","event":{"resourceStatusCheckEvent":{"resource":"pod/nodejs-guestbook-backend-65cdf8b7fc-cn6sp","status":"Failed","err":"waiting for init container init-db-ready to start","

statusCode":"STATUSCHECK_POD_INITIALIZING","actionableErr":{"errCode":"STATUSCHECK_POD_INITIALIZING",
"message":"waiting for init container init-db-ready to start"}}},

"entry":"Resource pod/nodejs-guestbook-backend-65cdf8b7fc-cn6sp status failed with waiting for init container init-db-ready to start"}}

@tejal29 tejal29 force-pushed the pod_init_better branch 2 times, most recently from 0093f05 to f32f508 Compare August 18, 2020 23:59
@tejal29 tejal29 marked this pull request as ready for review August 19, 2020 00:10
@tejal29 tejal29 requested a review from a team as a code owner August 19, 2020 00:10
@tejal29 tejal29 requested a review from dgageot August 19, 2020 00:10
@codecov
Copy link

codecov bot commented Aug 19, 2020

Codecov Report

Merging #4690 into master will decrease coverage by 0.07%.
The diff coverage is 45.54%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4690      +/-   ##
==========================================
- Coverage   73.45%   73.37%   -0.08%     
==========================================
  Files         343      344       +1     
  Lines       13585    13624      +39     
==========================================
+ Hits         9979     9997      +18     
- Misses       2982     3003      +21     
  Partials      624      624              
Impacted Files Coverage Δ
pkg/diag/validator/pod.go 1.60% <0.00%> (-0.08%) ⬇️
pkg/skaffold/runner/runner.go 0.00% <ø> (ø)
pkg/skaffold/deploy/status_check.go 52.25% <25.53%> (-6.81%) ⬇️
pkg/skaffold/runner/deploy.go 44.77% <50.00%> (ø)
pkg/skaffold/deploy/resource/logfile.go 84.61% <84.61%> (ø)
pkg/skaffold/deploy/resource/deployment.go 91.54% <88.00%> (+1.54%) ⬆️
pkg/diag/validator/resource.go 58.82% <0.00%> (+5.88%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update abdbe17...c8d794f. Read the comment docs.

@tejal29
Copy link
Contributor Author

tejal29 commented Aug 19, 2020

The codecov is low because i added interface and structs.
Most new lines have code coverage.

Copy link
Contributor

@nkubala nkubala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, this is definitely an improvement on the UX. i'd still like to see some sort of visual update here (e.g. scrolling periods) but we can investigate later.

Copy link
Member

@briandealwis briandealwis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't much like that we have STATUSCHECK_CONTEXT_CANCELED — that's an internal detail that shouldn't get reported upward.

}(d)
}

// Retrieve pending deployments statuses
go func() {
printDeploymentStatus(ctx, out, deployments, deadline)
s.printDeploymentStatus(sCtx, out, deployments)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this doesn't add to the WaitGroup, it's possible this code is called after sCtx has been cancelled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no it should not happen. This WaitGroup is way to wait for all the pollDeployments to finish
There won't be go routine leak by not adding to wait group. Having printDeploymentStatus run in child context of parent should gaurantee this will get cancelled.

@briandealwis briandealwis added the kokoro:force-run forces a kokoro re-run on a PR label Aug 24, 2020
@kokoro-team kokoro-team removed the kokoro:force-run forces a kokoro re-run on a PR label Aug 24, 2020
@briandealwis
Copy link
Member

(oops wrong PR on the kokoro:force)

@tejal29 tejal29 merged commit 53dffe1 into GoogleContainerTools:master Aug 28, 2020
@tejal29 tejal29 deleted the pod_init_better branch April 15, 2021 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add final pod status to STATUSCHECK_POD_INITIALIZING output
5 participants