Skip to content

Fix flaky of test_cancel_launch_and_exec_async #5456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

zpoint
Copy link
Collaborator

@zpoint zpoint commented Apr 30, 2025

Resolve #5436

The issue only occurs in the UP state. If we call sky down then, the job won't show as cancelled, and the grep "cancelled" command will fail.

When it's in INIT state, the test succeeds. This failure is more likely to happen in kubernetes tests because the kubernetes cluster provisions too quickly to reach the UP state.

I'm not sure if fixing the test case is the right approach, or if we should modify the sky job system to also cancel jobs for clusters in the UP state when sky down called? @aylei

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Relevant individual tests: /smoke-test --kubernetes -k test_cancel_launch_and_exec_async (CI) or pytest tests/test_smoke.py::test_name (local)

@zpoint
Copy link
Collaborator Author

zpoint commented Apr 30, 2025

/smoke-test --kubernetes -k test_cancel_launch_and_exec_async

@zpoint
Copy link
Collaborator Author

zpoint commented Apr 30, 2025

/smoke-test --kubernetes -k test_cancel_launch_and_exec_async

@zpoint zpoint requested review from aylei and cg505 April 30, 2025 15:13
@zpoint zpoint changed the title fix flaky of test_cancel_launch_and_exec_async Fix flaky of test_cancel_launch_and_exec_async Apr 30, 2025
@cg505
Copy link
Collaborator

cg505 commented Apr 30, 2025

I'm not sure if fixing the test case is the right approach, or if we should modify the sky job system to also cancel jobs for clusters in the UP state when sky down called?

@zpoint It's possible that this was the intended behavior, but I don't know. Can we confirm this behavior in the past (e.g. in 0.8)? Just want to double check that this isn't actually a regression.

Copy link
Collaborator

@cg505 cg505 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise, code changes look good

@zpoint
Copy link
Collaborator Author

zpoint commented May 1, 2025

This test was added by @aylei in #4867 after the client-server architecture change. Confirmed with aylei that INIT state jobs being canceled is the intended behavior.

Since he's OOO this week, we should be good to go with this PR if its a blocker.

@cg505

@zpoint
Copy link
Collaborator Author

zpoint commented May 1, 2025

0.8.x does not have this test.

Copy link
Collaborator

@cg505 cg505 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, thanks for looking into it!

@zpoint
Copy link
Collaborator Author

zpoint commented May 1, 2025

/smoke-test --aws -k test_cancel_launch_and_exec_async

@zpoint
Copy link
Collaborator Author

zpoint commented May 1, 2025

/smoke-test --gcp -k test_cancel_launch_and_exec_async

@zpoint
Copy link
Collaborator Author

zpoint commented May 1, 2025

/smoke-test --azure -k test_cancel_launch_and_exec_async

@zpoint zpoint merged commit 7fa02b5 into skypilot-org:master May 1, 2025
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kubernetes test test_cancel_launch_and_exec_async fail on master
2 participants