-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Treat zero cluster UUID as non-fabric-attached #54
Conversation
pkg/nvlib/device/device.go
Outdated
@@ -219,6 +219,9 @@ func (d *device) IsFabricAttached() (bool, error) { | |||
if ret != nvml.SUCCESS { | |||
return false, fmt.Errorf("error getting GPU Fabric Info: %v", ret) | |||
} | |||
if info.ClusterUuid == [16]uint8{} || info.CliqueId == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My one question would be whether ||
is the correct operator here. Is a clique ID of 0 valid if the cluster UUID is set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, clique id 0 is valid if clusterUUID is set -- in fact this is how our GH200 system has it configured
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just do the following (but possibly move it lower depending on my comment below):
if info.ClusterUuid == [16]uint8{} {
pkg/nvlib/device/device.go
Outdated
if info.ClusterUuid == [16]uint8{} || info.CliqueId == 0 { | ||
return false, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this come after verifying that GPU_FABRIC_STATE_COMPLETED
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it probably should.
Signed-off-by: Evan Lezar <[email protected]>
fdb3f60
to
311c577
Compare
On systems which are NVLink-capable, but not attached to the fabric we see:
This change treats zero ClusterUUID values as not-attached.