Skip to content

[GCP][Disk] Google Cloud Hyperdisk support #4705

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
skyshard opened this issue Feb 12, 2025 · 10 comments · Fixed by #5457
Closed

[GCP][Disk] Google Cloud Hyperdisk support #4705

skyshard opened this issue Feb 12, 2025 · 10 comments · Fixed by #5457
Assignees

Comments

@skyshard
Copy link

It appears that the ultra disk which maps to pd-extreme does not work with A3 / A2 / G2 gpu machine types in GCP, so it cannot be used for accelerated ml serving workloads (H200 / H100 / A100 / L4)

They support something called hyperdisk instead, but it also varies based on instance type with hyperdisk-ml having the broadest support:

instance disk support ref
a3 mega, a3 high, a3 edge hyperdisk-ml, hyperdisk-balanced, hyperdisk-extreme, hyperdisk-throughput, pd-ssd, pd-balanced https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-disks
a3 ultra hyperdisk-balanced, hyperdisk-extreme https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-disks
a2 hyperdisk-ml, pd-ssd, pd-standard, pd-balanced https://cloud.google.com/compute/docs/accelerator-optimized-machines#a2-disks
g2 hyperdisk-ml, hyperdisk-throughput, pd-ssd, pd-balanced https://cloud.google.com/compute/docs/accelerator-optimized-machines#g2-disks

Pricing looks similar to pd-extreme: https://cloud.google.com/compute/disks-image-pricing?hl=en#tg1-t0

Is this something that can be added?

@Michaelvll
Copy link
Collaborator

Great catch @skyshard! This definitely should be added. Would you like to submit a PR for this?

@JiangJiaWei1103
Copy link
Contributor

Hi @Michaelvll,

I'd like to pick this up. Thanks!

@JiangJiaWei1103
Copy link
Contributor

JiangJiaWei1103 commented Apr 7, 2025

Hi @Michaelvll,

As far as I know, DiskTier is currently defined with four levels, low to ultra, and the best option also points to ultra, as shown here. For GCP, each tier maps one-to-one to a specific disk type, as illustrated in the following lines:

skypilot/sky/clouds/gcp.py

Lines 1036 to 1046 in c503937

@classmethod
def _get_disk_type(cls,
disk_tier: Optional[resources_utils.DiskTier]) -> str:
tier = cls._translate_disk_tier(disk_tier)
tier2name = {
resources_utils.DiskTier.ULTRA: 'pd-extreme',
resources_utils.DiskTier.HIGH: 'pd-ssd',
resources_utils.DiskTier.MEDIUM: 'pd-balanced',
resources_utils.DiskTier.LOW: 'pd-standard',
}
return tier2name[tier]

To support new disk types like hyperdisk-ml, I was thinking about introducing a GCP-specific DiskTier that includes additional tiers such as ACC_LOW, ACC_ULTRA, etc., optimized for accelerated workloads. Would that approach make sense to you? I’d love to hear your thoughts!

Also, I have a couple of questions:

  1. How can users know the disk cost before spinning up the VM? I ask because I don’t see disk costs reported in the cost table. For instance:
    Image

I also noticed that disk costs aren’t currently considered during optimization:

skypilot/sky/optimizer.py

Lines 957 to 964 in c503937

# Warning message for using disk_tier=ultra
# TODO(yi): Consider price of disks in optimizer and
# move this warning there.
if chosen_resources.disk_tier == resources_utils.DiskTier.ULTRA:
logger.warning(
'Using disk_tier=ultra will utilize more advanced disks '
'(io2 Block Express on AWS and extreme persistent disk on '
'GCP), which can lead to significant higher costs (~$2/h).')

  1. How can users verify whether a given disk type is compatible with a specific instance_type before launching? As shown in the following figure, --disk-tier low should map to pd-standard, but this GCP docs doesn’t list pd-standard as a supported disk type for the a3-high series. Hence, I would expect an error raised in this case.

Image

Thanks a lot!

@SeungjinYang
Copy link
Collaborator

SeungjinYang commented Apr 8, 2025

Hello @JiangJiaWei1103 ! This is such an interesting problem, thanks for raising it. I took a look at the relevant code and found this:

  1. GCP class's _get_disk_type function is only called from _get_disk_specs within the same class
  2. GCP class's _get_disk_specs is only called from make_deploy_resources_variables within the same class. make_deploy_resources_variables is defined in the parent Cloud class, so its signature must stay the same.

Notably, when _get_disk_specs is called from make_deploy_resources_variables, the caller has access to the instance type.

I am wondering if we can pass the instance type all the way through to _get_disk_type, at which point we can implement all sorts of logic to get the correct disk given the desired type and the instance type. It could make SkyPilot really smart at figuring out the best disk type to use for specific instance type, for example.

This way we don't have to add more disk types specific to GCP - and specifying ultra disk tier for those high end GPU machine would work.

edit:

As for your question 2:

  1. How can users verify whether a given disk type is compatible with a specific instance_type before launching?

There is check_disk_tier_enabled in GCP class, but also - with the changes described above, perhaps we can make low disk tier work for a3 instance types by having it pick some other disk tier actually compatible with it.

As for question 1:

How can users know the disk cost before spinning up the VM?

I don't actually have a good idea on this.

@SeungjinYang
Copy link
Collaborator

SeungjinYang commented Apr 9, 2025

tangentially related: found this issue re: question 1

@JiangJiaWei1103
Copy link
Contributor

Hi @SeungjinYang,

Thank you so much for the insightful suggestions! To make sure I fully understand your proposal, I’ve opened a draft PR to illustrate the idea. The core concept is to use both instance_type and disk_tier to determine the appropriate disk type. For example, a combination like ("a3-ultragpu-8g", "ultra") could map to hyperdisk-extreme. I’ve included a simple implementation here:

skypilot/sky/clouds/gcp.py

Lines 1034 to 1051 in d645c9b

@classmethod
def _get_disk_type(cls, instance_type: Optional[str],
disk_tier: Optional[resources_utils.DiskTier]) -> str:
enable_acc = True if instance_type in service_catalog.gcp_catalog.GCP_ACC_INSTANCE_TYPES else False
tier = cls._translate_disk_tier(disk_tier)
if enable_acc:
tier2name = {
resources_utils.DiskTier.ULTRA: 'hyperdisk-extreme',
}
else:
tier2name = {
resources_utils.DiskTier.ULTRA: 'pd-extreme',
resources_utils.DiskTier.HIGH: 'pd-ssd',
resources_utils.DiskTier.MEDIUM: 'pd-balanced',
resources_utils.DiskTier.LOW: 'pd-standard',
}
return tier2name[tier]

Running sky launch -t g2-standard-4 --disk-tier ultra --dryrun yields:

Image

Once confirmed, I’ll close the draft PR and follow up with a proper implementation.

Regarding question 2, do you mean we could automatically elevate the disk tier (e.g., from LOW to MEDIUM,which maps to pd-balanced) if a given tier isn’t supported by the selected instance? If so, we might also consider displaying a message to inform users about the tier elevation.

As for question 1, I’d love to dive deeper into that topic. Thanks for sharing the issue link. It’s so helpful to think about incorporating disk cost into optimization!

@SeungjinYang
Copy link
Collaborator

Took a look at the PR, direction looks good!

As for Q2, I was thinking more on the lines on mapping LOW tier to pd-balanced directly for higher specced SKUs that don't support pd-standard. With the change with hyperdisk-extreme SkyPilot is already starting to take an opinionated stance on what performance means in context of instance type, and I don't see a reason SkyPilot shouldn't be more opinionated elsewhere.

@JiangJiaWei1103
Copy link
Contributor

Sounds great! I’ll draft a mapping proposal for both general CPU and GPU instances and share it here. We can then continue the discussion and see if the mapping aligns with your expectations. Thanks!

@JiangJiaWei1103
Copy link
Contributor

JiangJiaWei1103 commented Apr 12, 2025

Hi @SeungjinYang,

As you mentioned, SkyPilot can be more opinionated in selecting a better resource type (e.g., disk type) based on the instance type. Before we decide on the actual mapping between instance types and disk types, I’d like to share some relevant information.

According to the GCP doc, Google recommends using Hyperdisk due to several key advantages, such as customizable performance (you can adjust a Hyperdisk volume’s performance without changing its size), as well as superior IOPS and throughput limits. This section also provides guidance on choosing a suitable disk type. One crucial point is:

use Hyperdisk if it's available for your machine series.

If I'm not mistaken, this clearly suggests that Hyperdisk is preferred over Persistent Disk whenever it is available. Below is a table summarizing the disk type support for the instance types currently supported by SkyPilot GCP:

Machine series hyperdisk-balanced hyperdisk-extreme hyperdisk-throughput hyperdisk-ml hyperdisk-balanced-high-availability pd-standard pd-balanced pd-ssd pd-extreme
N2 V V V V V V
N1 V V V
N1+GPU V V V
A3 Mega V V V V V V
A3 High V (only supported for a3-highgpu-8g) V V V V V
A2 V V V V
G2 V V V V

References: General-purpose machine family for Compute Engine, Accelerator-optimized machine family

Given the diverse combinations, if we aim to subjectively select the most suitable disk type based on a user-specified disk_tier (ranging from LOW to ULTRA), I’d love to hear your thoughts on what a good strategy would be to map (instance_type, disk_tier) to the supported disk types shown in the table above!

For N1 and N1+GPU, we can stick with Persistent Disks. Similarly, for A3 Mega and A3 High, it’s reasonable to default to Hyperdisks. However, the mapping for the remaining instance types is still open for discussion. Thanks!

@SeungjinYang
Copy link
Collaborator

SeungjinYang commented Apr 14, 2025

Looking at this document on hyperdisks - there is one restriction that, for better or worse, makes this design simpler:

Hyperdisk Extreme, Hyperdisk ML and Hyperdisk Throughput volumes can't be used as boot disks.

Which leaves hyperdisk-balanced as the only disk type that is suitable for our use case. I agree that for A3 Mega and A3 High, it’s reasonable to default to Hyperdisks. Given that other VM types don't support hyperdisk-balanced, we can have these other VMs default to pd types.

Edit:
This does mean for instance types like G2, pd-balanced and pd-ssd are actually the only valid storage options for the boot disk. I didn't want to believe it so I checked the web console - it really is just those two.

Image

So in cases like G2, we may have to pin HIGH/ULTRA/BEST to pd-ssd :< not how I expected the conclusion here to be, but this is indeed how GCP works.

As a meta comment, I'm actually quite happy/excited whenever I encounter less-than-intuitive behaviors like this on cloud providers - to me, being able to navigate things like this is where Skypilot should shine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants