-
Notifications
You must be signed in to change notification settings - Fork 640
[GCP][Disk] Google Cloud Hyperdisk support #4705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Great catch @skyshard! This definitely should be added. Would you like to submit a PR for this? |
Hi @Michaelvll, I'd like to pick this up. Thanks! |
Hi @Michaelvll, As far as I know, Lines 1036 to 1046 in c503937
To support new disk types like Also, I have a couple of questions:
I also noticed that disk costs aren’t currently considered during optimization: Lines 957 to 964 in c503937
Thanks a lot! |
Hello @JiangJiaWei1103 ! This is such an interesting problem, thanks for raising it. I took a look at the relevant code and found this:
Notably, when I am wondering if we can pass the instance type all the way through to This way we don't have to add more disk types specific to GCP - and specifying edit: As for your question 2:
There is As for question 1:
I don't actually have a good idea on this. |
tangentially related: found this issue re: question 1 |
Hi @SeungjinYang, Thank you so much for the insightful suggestions! To make sure I fully understand your proposal, I’ve opened a draft PR to illustrate the idea. The core concept is to use both Lines 1034 to 1051 in d645c9b
Running Once confirmed, I’ll close the draft PR and follow up with a proper implementation. Regarding question 2, do you mean we could automatically elevate the disk tier (e.g., from As for question 1, I’d love to dive deeper into that topic. Thanks for sharing the issue link. It’s so helpful to think about incorporating disk cost into optimization! |
Took a look at the PR, direction looks good! As for Q2, I was thinking more on the lines on mapping |
Sounds great! I’ll draft a mapping proposal for both general CPU and GPU instances and share it here. We can then continue the discussion and see if the mapping aligns with your expectations. Thanks! |
Hi @SeungjinYang, As you mentioned, SkyPilot can be more opinionated in selecting a better resource type (e.g., disk type) based on the instance type. Before we decide on the actual mapping between instance types and disk types, I’d like to share some relevant information. According to the GCP doc, Google recommends using Hyperdisk due to several key advantages, such as customizable performance (you can adjust a Hyperdisk volume’s performance without changing its size), as well as superior IOPS and throughput limits. This section also provides guidance on choosing a suitable disk type. One crucial point is:
If I'm not mistaken, this clearly suggests that Hyperdisk is preferred over Persistent Disk whenever it is available. Below is a table summarizing the disk type support for the instance types currently supported by SkyPilot GCP:
Given the diverse combinations, if we aim to subjectively select the most suitable disk type based on a user-specified For N1 and N1+GPU, we can stick with Persistent Disks. Similarly, for A3 Mega and A3 High, it’s reasonable to default to Hyperdisks. However, the mapping for the remaining instance types is still open for discussion. Thanks! |
Looking at this document on hyperdisks - there is one restriction that, for better or worse, makes this design simpler:
Which leaves Edit: ![]() So in cases like G2, we may have to pin HIGH/ULTRA/BEST to As a meta comment, I'm actually quite happy/excited whenever I encounter less-than-intuitive behaviors like this on cloud providers - to me, being able to navigate things like this is where Skypilot should shine. |
It appears that the
ultra
disk which maps topd-extreme
does not work withA3 / A2 / G2
gpu machine types in GCP, so it cannot be used for accelerated ml serving workloads (H200 / H100 / A100 / L4
)They support something called
hyperdisk
instead, but it also varies based on instance type withhyperdisk-ml
having the broadest support:a3 mega, a3 high, a3 edge
hyperdisk-ml, hyperdisk-balanced, hyperdisk-extreme, hyperdisk-throughput, pd-ssd, pd-balanced
a3 ultra
hyperdisk-balanced, hyperdisk-extreme
a2
hyperdisk-ml, pd-ssd, pd-standard, pd-balanced
g2
hyperdisk-ml, hyperdisk-throughput, pd-ssd, pd-balanced
Pricing looks similar to
pd-extreme
: https://cloud.google.com/compute/disks-image-pricing?hl=en#tg1-t0Is this something that can be added?
The text was updated successfully, but these errors were encountered: