Skip to content

[GCP] Remap series-specific disk types #5457

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

JiangJiaWei1103
Copy link
Contributor

Tracking issue

Closes #4705.

Why are the changes needed?

GCP instances have series-specific disk type supports which are summarized as follows:

Machine series pd-standard pd-balanced pd-ssd pd-extreme
N2 V V V V (>= 64 vCPUs)
N1 V V V
N1+GPU V V V
A2 V V V
G2 V V

To make SkyPilot smart at figuring out the best disk type for each specific machine series, we consider remapping disk types for those unsupported, other than raising an error.

What changes were proposed in this pull request?

  • Propagate HIGH disk tier (which maps to pd-ssd) to ULTRA as N2 (with less than 64 vCPUs), N1, A2, and G2 don't support pd-extreme
  • Propagate MEDIUM disk tier (which maps to pd-balanced) to LOW as G2 doesn't support pd-standard

Screenshots

  • sky launch -t n2-standard-32 --disk-tier ultra --dryrun

Screenshot 2025-04-30 at 9 42 55 PM

  • sky launch -t n2-standard-128 --disk-tier ultra --dryrun

Screenshot 2025-04-30 at 9 41 57 PM

  • sky launch -t n1-standard-32 --disk-tier ultra --dryrun

Screenshot 2025-04-30 at 9 43 32 PM

  • sky launch -t g2-standard-4 --disk-tier ultra --dryrun

Screenshot 2025-04-30 at 9 47 47 PM

  • sky launch -t g2-standard-4 --disk-tier low --dryrun

Screenshot 2025-04-30 at 9 47 17 PM

Because we refactor the propagation logic, experiments on A3 series are rerun (see #5351):

  • sky launch -t a3-highgpu-8g --disk-tier ultra --dryrun

Screenshot 2025-04-30 at 9 36 11 PM

  • sky launch -t a3-highgpu-8g --disk-tier low --dryrun

Screenshot 2025-04-30 at 9 40 50 PM

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

Signed-off-by: JiangJiaWei1103 <[email protected]>
@JiangJiaWei1103
Copy link
Contributor Author

@SeungjinYang PTAL, thanks a lot!

@JiangJiaWei1103 JiangJiaWei1103 marked this pull request as draft April 30, 2025 14:22
@SeungjinYang SeungjinYang self-requested a review April 30, 2025 15:09
Copy link
Collaborator

@SeungjinYang SeungjinYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I assume you might not be done with the PR given it is in draft state, but this looks good. I was a bit worried the logic might be a sprawl, but the extra logic added ends up being compensated by the logic sprawl removed in check_disk_tier.

@JiangJiaWei1103
Copy link
Contributor Author

JiangJiaWei1103 commented May 1, 2025

After applying all the suggested changes to the code, we re-ran the experiments and obtained the same results:

  • sky launch -t n2-standard-32 --disk-tier ultra --dryrun

Screenshot 2025-05-01 at 9 28 23 PM

  • sky launch -t n2-standard-128 --disk-tier ultra --dryrun

Screenshot 2025-05-01 at 9 28 55 PM

  • sky launch -t n1-standard-32 --disk-tier ultra --dryrun

Screenshot 2025-05-01 at 9 29 24 PM

  • sky launch -t g2-standard-4 --disk-tier ultra --dryrun

Screenshot 2025-05-01 at 9 29 51 PM

  • sky launch -t g2-standard-4 --disk-tier low --dryrun

Screenshot 2025-05-01 at 9 30 19 PM

  • sky launch -t a3-highgpu-8g --disk-tier ultra --dryrun

Screenshot 2025-05-01 at 9 30 52 PM

  • sky launch -t a3-highgpu-8g --disk-tier low --dryrun

Screenshot 2025-05-01 at 9 31 32 PM

Btw, I currently skip a2's tests because there's no available VMs to provision.

@JiangJiaWei1103 JiangJiaWei1103 marked this pull request as ready for review May 1, 2025 13:39
Copy link
Collaborator

@SeungjinYang SeungjinYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JiangJiaWei1103! Based on how the code is written I'm sure a2 type works with ULTRA works if n1 type works, which you tested.

@SeungjinYang SeungjinYang enabled auto-merge (squash) May 1, 2025 17:23
@SeungjinYang SeungjinYang merged commit a77a95d into skypilot-org:master May 1, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[GCP][Disk] Google Cloud Hyperdisk support
2 participants