Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-12373. Change calculation logic for volume reserved space #7927

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

symious
Copy link
Contributor

@symious symious commented Feb 19, 2025

What changes were proposed in this pull request?

The current logic for "hdds.datanode.dir.du.reserved" is as follows:

private long getRemainingReserved(){
 return Math.max(reservedInBytes - getOtherUsed(), 0L); 
} 

which means if OtherUsed() is larger than reservedInBytes, then remainingReserved will be count to 0.

When we set a "hdds.datanode.dir.du.reserved" to 100GB, we actually want the disk to spare 100GB in case of "SPACE not enough exceptions".

But normally servers have a system level block reservation, which is 5%. So for a 10T disk, the system level reserved space is about 500GB, when we set the configuration to 100GB, the "remaningReserved" is calculated as 0, so for capacity and availabilty, reservation is not counted.

In current calculation logic ,in order to reserve a 100GB space, we need to set configuration to "600GB" (500 + 100) or "0.06" (0.05 + 0.01 for percent).

This ticket is to change the logic of the reservation calculation to have a more intuitive aspect for the users, thus no need to take care of the Other usages.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-12373

How was this patch tested?

unit test.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @symious for the patch.

Tests run: 16, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 1.096 s <<< FAILURE! - in org.apache.hadoop.ozone.container.common.volume.TestHddsVolume

Please wait for clean CI run in fork before opening PR (or open it as draft).

@adoroszlai adoroszlai marked this pull request as draft February 19, 2025 08:34
@adoroszlai
Copy link
Contributor

Thanks @symious for updating the patch, LGTM. Added some other reviewers who worked on space logic previously.

@ChenSammi
Copy link
Contributor

One thing I just thought about is, with this new change, if user has ever set the "hdds.datanode.dir.du.reserved" or "hdds.datanode.dir.du.reserved.percent", for those non ozone services, will be too much from reservation point of view after Ozone version upgrade. For example, If Ozone DN is co-deployed with YARN service, which need disk space for shuffle data and all other data. User may set the "hdds.datanode.dir.du.reserved.percent" to 20% or 30%. Then it will be too much for this new calculation logic.

@symious
Copy link
Contributor Author

symious commented Feb 21, 2025

User may set the "hdds.datanode.dir.du.reserved.percent" to 20% or 30%. Then it will be too much for this new calculation logic.

IMO, the reserved configuration is to make sure there are enough space left for Ozone usage, and for this goal, a 50GB space left would be enough. And Ozone should not worry about the other usages (YARN or system reserve), even Ozone spare a 20% reserve space for YARN, it's not guranteed that YARN will only use 20%.

It should be more safe for Ozone to always reserve a configured space regradless of other usages.

@adoroszlai
Copy link
Contributor

adoroszlai commented Feb 21, 2025

IMO, the reserved configuration is to make sure there are enough space left for Ozone usage, and for this goal, a 50GB space left would be enough.

According to config doc, hdds.datanode.volume.min.free.space (and percent) is for Ozone usage (closing containers), and hdds.datanode.dir.du.reserved (inherited from HDFS) is for non-Ozone usage.

<name>hdds.datanode.dir.du.reserved</name>
<value/>
<tag>OZONE, CONTAINER, STORAGE, MANAGEMENT</tag>
<description>Reserved space in bytes per volume. Always leave this much space free for non dfs use.
Such as /dir1:100B, /dir2:200MB, means dir1 reserves 100 bytes and dir2 reserves 200 MB.

<name>hdds.datanode.volume.min.free.space</name>
<value>5GB</value>
<tag>OZONE, CONTAINER, STORAGE, MANAGEMENT</tag>
<description>
This determines the free space to be used for closing containers
When the difference between volume capacity and used reaches this number,
containers that reside on this volume will be closed and no new containers
would be allocated on this volume.

And Ozone should not worry about the other usages (YARN or system reserve), even Ozone spare a 20% reserve space for YARN, it's not guranteed that YARN will only use 20%.

I agree, maybe we should consider deprecating this setting?

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @symious for updating the patch, LGTM except an unused method.

So after this change, datanode disk capacity reported will exclude system reserved space.

@siddhantsangwan siddhantsangwan self-requested a review February 25, 2025 08:58
@ChenSammi
Copy link
Contributor

ChenSammi commented Feb 26, 2025

@symious , current we have

  1. hdds.datanode.dir.du.reserved and hdds.datanode.dir.du.reserved.percent, similar properties inherited from HDFS, which defines the space left for non-Ozone use.
  2. hdds.datanode.volume.min.free.space and hdds.datanode.volume.min.free.space.percent, introduced in HDDS-8254, which defines that if the volume free space reaches this value, volume will be treated as full.

I totally agree it's a good idea. Based the current state, I would propose instead of change hdds.datanode.dir.du.reserved and hdds.datanode.dir.du.reserved.percent concepts, reuse hdds.datanode.volume.min.free.space and hdds.datanode.volume.min.free.space.percent. Make the SCM pipeline allocation, container allocation, container candidate selection in container balancer, container selection in replication manager, all aware of this min.free.space, so that disk full can be avoided starting from SCM. What do you think @symious?

Also current hdds.datanode.volume.min.free.space and hdds.datanode.volume.min.free.space.percent have a quite small default value, we should consider increase that.

cc @sadanand48, @vtutrinov .

@symious
Copy link
Contributor Author

symious commented Feb 27, 2025

Make the SCM pipeline allocation, container allocation, container candidate selection in container balancer, container selection in replication manager, all aware of this min.free.space, so that disk full can be avoided starting from SCM.

@ChenSammi Are you suggesting to use "min.free.space" instead of "dir.du.reserved"?

@ChenSammi
Copy link
Contributor

ChenSammi commented Feb 27, 2025

Make the SCM pipeline allocation, container allocation, container candidate selection in container balancer, container selection in replication manager, all aware of this min.free.space, so that disk full can be avoided starting from SCM.

@ChenSammi Are you suggesting to use "min.free.space" instead of "dir.du.reserved"?

Yes, I'm suggesting to use hdds.datanode.volume.min.free.space and hdds.datanode.volume.min.free.space.percent to achieve the same goal.

@symious
Copy link
Contributor Author

symious commented Mar 13, 2025

Disscussed with @ChenSammi , it's better to keep the logic of the "du.reserved" config, so we only change log and remove the system reserved from calculation here.

@ChenSammi @adoroszlai @sadanand48 PTAL.

@adoroszlai adoroszlai requested a review from sumitagrawl March 13, 2025 07:03
@siddhantsangwan
Copy link
Contributor

@symious I've also been looking into this area and I'm planning to review this pull request soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants