Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/run, pkg/nvidia: Detect mismatched NVIDIA kernel & user space driver #1541

Conversation

debarshiray
Copy link
Member

@debarshiray debarshiray commented Sep 13, 2024

The proprietary NVIDIA driver has a kernel space part and a user space
part, and they must always have the same matching version. Sometimes,
the host operating system might end up with mismatched parts. One
reason could be that the different third-party repositories used to
distribute the driver might be incompatible with each other. eg., in
the case of Fedora it could be RPM Fusion and NVIDIA's own repository.

This shows up in the systemd journal as:

  $ journalctl --dmesg
  ...
  kernel: NVRM: API mismatch: the client has the version 555.58.02, but
          NVRM: this kernel module has the version 560.35.03.  Please
          NVRM: make sure that this kernel module and all NVIDIA driver
          NVRM: components have the same version.
  ...

Without any special handling of this scenario, users would be presented
with a very misleading error:

  $ toolbox enter
  Error: failed to get Container Device Interface containerEdits for
      NVIDIA

Instead, improve the error message to be more self-documenting:

  $ toolbox enter
  Error: the proprietary NVIDIA driver's kernel and user space don't
      match
  Check the host operating system and systemd journal.

debarshiray added a commit to debarshiray/toolbox that referenced this pull request Sep 13, 2024
NVIDIA Container Toolkit 0.16.0 added a new API to avoid creating a new
info.Interface when creating a nvcdi.Interface, if an info.Interface
already exists [1].  Commit 649d02f already bumped the required
NVIDIA Container Toolkit version to 0.16.0, so take advantage of that.

[1] NVIDIA Container Toolkit commit 8fc4b9c742f894ef
    NVIDIA/nvidia-container-toolkit@8fc4b9c742f894ef
    NVIDIA/nvidia-container-toolkit#516

containers#1541
@debarshiray debarshiray force-pushed the wip/rishi/nvidia-take-avantage-of-new-APIs-00 branch from eee9336 to 581e2d5 Compare September 13, 2024 13:10
debarshiray added a commit to debarshiray/toolbox that referenced this pull request Sep 13, 2024
A new API was added to github.com/NVIDIA/go-nvlib 0.4.0 to specify a
logger to be used by a info.Interface [1].  Commit 649d02f
already bumped the required go-nvlib version to 0.6.0, so take advantage
of that.

[1] github.com/NVIDIA/go-nvlib commit 21c8f035ca66b29d
    NVIDIA/go-nvlib@21c8f035ca66b29d
    NVIDIA/go-nvlib#28

containers#1541
debarshiray added a commit to debarshiray/toolbox that referenced this pull request Sep 13, 2024
It's better to avoid abbreviations when the length of the string and the
depth of the indentation are favourable.

containers#1541
debarshiray added a commit to debarshiray/toolbox that referenced this pull request Sep 13, 2024
debarshiray added a commit to debarshiray/toolbox that referenced this pull request Sep 13, 2024
Copy link

Build succeeded.
https://softwarefactory-project.io/zuul/t/local/buildset/7d7dd171d6fc4552b13fa74d396e2c0e

✔️ unit-test SUCCESS in 5m 37s
✔️ unit-test-migration-path-for-coreos-toolbox SUCCESS in 3m 11s
✔️ unit-test-restricted SUCCESS in 5m 39s
✔️ system-test-fedora-rawhide SUCCESS in 1h 36m 39s
✔️ system-test-fedora-40 SUCCESS in 1h 34m 47s
✔️ system-test-fedora-39 SUCCESS in 1h 38m 46s

debarshiray added a commit to debarshiray/toolbox that referenced this pull request Sep 13, 2024
The proprietary NVIDIA driver has a kernel space part and a user space
part, and they must always have the same matching version.  Sometimes,
the host operating system might end up with mismatched parts.  One
reason could be that the different third-party repositories used to
distribute the driver might be incompatible with each other.  eg., in
the case of Fedora it could be RPM Fusion and NVIDIA's own repository.

This shows up in the systemd journal as:
  $ journalctl --dmesg
  ...
  kernel: NVRM: API mismatch: the client has the version 555.58.02, but
          NVRM: this kernel module has the version 560.35.03.  Please
          NVRM: make sure that this kernel module and all NVIDIA driver
          NVRM: components have the same version.
  ...

Without any special handling of this scenario, users would be presented
with a very misleading error:
  $ toolbox enter
  Error: failed to get Container Device Interface containerEdits for
      NVIDIA

Instead, improve the error message to be more self-documenting:
  $ toolbox enter
  Error: the proprietary NVIDIA driver's kernel and user space don't match
  Check the systemd journal and the contents of the operating system.

containers#1541
@debarshiray debarshiray changed the title [WIP] ... [WIP] cmd/run, pkg/nvidia: Detect mismatched NVIDIA kernel & user space driver Sep 13, 2024
Copy link

Build failed.
https://softwarefactory-project.io/zuul/t/local/buildset/1a9571d60c78424d9dce095a4ae65234

✔️ unit-test SUCCESS in 5m 42s
✔️ unit-test-migration-path-for-coreos-toolbox SUCCESS in 3m 24s
✔️ unit-test-restricted SUCCESS in 5m 33s
✔️ system-test-fedora-rawhide SUCCESS in 1h 48m 57s
✔️ system-test-fedora-40 SUCCESS in 1h 48m 12s
system-test-fedora-39 TIMED_OUT in 1h 50m 23s

debarshiray added a commit to debarshiray/toolbox that referenced this pull request Sep 13, 2024
The proprietary NVIDIA driver has a kernel space part and a user space
part, and they must always have the same matching version.  Sometimes,
the host operating system might end up with mismatched parts.  One
reason could be that the different third-party repositories used to
distribute the driver might be incompatible with each other.  eg., in
the case of Fedora it could be RPM Fusion and NVIDIA's own repository.

This shows up in the systemd journal as:
  $ journalctl --dmesg
  ...
  kernel: NVRM: API mismatch: the client has the version 555.58.02, but
          NVRM: this kernel module has the version 560.35.03.  Please
          NVRM: make sure that this kernel module and all NVIDIA driver
          NVRM: components have the same version.
  ...

Without any special handling of this scenario, users would be presented
with a very misleading error:
  $ toolbox enter
  Error: failed to get Container Device Interface containerEdits for
      NVIDIA

Instead, improve the error message to be more self-documenting:
  $ toolbox enter
  Error: the proprietary NVIDIA driver's kernel and user space don't
      match
  Check the host operating system and systemd journal.

containers#1541
@debarshiray
Copy link
Member Author

recheck

Copy link

Build succeeded.
https://softwarefactory-project.io/zuul/t/local/buildset/0cc9e0ebdb63406e85390af6ef604554

✔️ unit-test SUCCESS in 5m 35s
✔️ unit-test-migration-path-for-coreos-toolbox SUCCESS in 3m 36s
✔️ unit-test-restricted SUCCESS in 5m 29s
✔️ system-test-fedora-rawhide SUCCESS in 1h 37m 22s
✔️ system-test-fedora-40 SUCCESS in 1h 35m 09s
✔️ system-test-fedora-39 SUCCESS in 1h 41m 21s

Copy link

Build succeeded.
https://softwarefactory-project.io/zuul/t/local/buildset/60c2ec5eb1f94185853f62551542f4e3

✔️ unit-test SUCCESS in 5m 30s
✔️ unit-test-migration-path-for-coreos-toolbox SUCCESS in 3m 42s
✔️ unit-test-restricted SUCCESS in 5m 27s
✔️ system-test-fedora-rawhide SUCCESS in 1h 40m 31s
✔️ system-test-fedora-40 SUCCESS in 1h 39m 40s
✔️ system-test-fedora-39 SUCCESS in 1h 42m 20s

@debarshiray debarshiray changed the title [WIP] cmd/run, pkg/nvidia: Detect mismatched NVIDIA kernel & user space driver cmd/run, pkg/nvidia: Detect mismatched NVIDIA kernel & user space driver Sep 14, 2024
Copy link

Build succeeded.
https://softwarefactory-project.io/zuul/t/local/buildset/017f2cb610184ad5b0e9402727453497

✔️ unit-test SUCCESS in 5m 37s
✔️ unit-test-migration-path-for-coreos-toolbox SUCCESS in 3m 07s
✔️ unit-test-restricted SUCCESS in 5m 33s
✔️ system-test-fedora-rawhide SUCCESS in 1h 45m 03s
✔️ system-test-fedora-40 SUCCESS in 1h 43m 51s
✔️ system-test-fedora-39 SUCCESS in 1h 45m 20s

NVIDIA Container Toolkit 0.16.0 added a new API to avoid creating a new
info.Interface when creating a nvcdi.Interface, if an info.Interface
already exists [1].  Commit 649d02f already bumped the required
NVIDIA Container Toolkit version to 0.16.0, so take advantage of that.

[1] NVIDIA Container Toolkit commit 8fc4b9c742f894ef
    NVIDIA/nvidia-container-toolkit@8fc4b9c742f894ef
    NVIDIA/nvidia-container-toolkit#516

containers#1541
A new API was added to github.com/NVIDIA/go-nvlib 0.4.0 to specify a
logger to be used by a info.Interface [1].  Commit 649d02f
already bumped the required go-nvlib version to 0.6.0, so take advantage
of that.

[1] github.com/NVIDIA/go-nvlib commit 21c8f035ca66b29d
    NVIDIA/go-nvlib@21c8f035ca66b29d
    NVIDIA/go-nvlib#28

containers#1541
It's better to avoid abbreviations when the length of the string and the
depth of the indentation are favourable.

containers#1541
The proprietary NVIDIA driver has a kernel space part and a user space
part, and they must always have the same matching version.  Sometimes,
the host operating system might end up with mismatched parts.  One
reason could be that the different third-party repositories used to
distribute the driver might be incompatible with each other.  eg., in
the case of Fedora it could be RPM Fusion and NVIDIA's own repository.

This shows up in the systemd journal as:
  $ journalctl --dmesg
  ...
  kernel: NVRM: API mismatch: the client has the version 555.58.02, but
          NVRM: this kernel module has the version 560.35.03.  Please
          NVRM: make sure that this kernel module and all NVIDIA driver
          NVRM: components have the same version.
  ...

Without any special handling of this scenario, users would be presented
with a very misleading error:
  $ toolbox enter
  Error: failed to get Container Device Interface containerEdits for
      NVIDIA

Instead, improve the error message to be more self-documenting:
  $ toolbox enter
  Error: the proprietary NVIDIA driver's kernel and user space don't
      match
  Check the host operating system and systemd journal.

containers#1541
@debarshiray debarshiray force-pushed the wip/rishi/nvidia-take-avantage-of-new-APIs-00 branch from f77bbac to ce7a0d4 Compare September 14, 2024 18:02
@debarshiray debarshiray changed the title cmd/run, pkg/nvidia: Detect mismatched NVIDIA kernel & user space driver [WIP] cmd/run, pkg/nvidia: Detect mismatched NVIDIA kernel & user space driver Sep 14, 2024
Copy link

Build succeeded.
https://softwarefactory-project.io/zuul/t/local/buildset/15399988dc164677be40d626ee141e1c

✔️ unit-test SUCCESS in 5m 44s
✔️ unit-test-migration-path-for-coreos-toolbox SUCCESS in 3m 45s
✔️ unit-test-restricted SUCCESS in 5m 44s
✔️ system-test-fedora-rawhide SUCCESS in 1h 43m 30s
✔️ system-test-fedora-40 SUCCESS in 1h 48m 00s
✔️ system-test-fedora-39 SUCCESS in 1h 49m 45s

Copy link

Build succeeded.
https://softwarefactory-project.io/zuul/t/local/buildset/1e9b934cd90049ee8d509ac98fbb99da

✔️ unit-test SUCCESS in 5m 35s
✔️ unit-test-migration-path-for-coreos-toolbox SUCCESS in 3m 27s
✔️ unit-test-restricted SUCCESS in 5m 39s
✔️ system-test-fedora-rawhide SUCCESS in 1h 49m 34s
✔️ system-test-fedora-40 SUCCESS in 1h 48m 20s
✔️ system-test-fedora-39 SUCCESS in 1h 49m 35s

Copy link

Build succeeded.
https://softwarefactory-project.io/zuul/t/local/buildset/47de222f7e7b4cd6a08580ea16d368b8

✔️ unit-test SUCCESS in 5m 42s
✔️ unit-test-migration-path-for-coreos-toolbox SUCCESS in 3m 20s
✔️ unit-test-restricted SUCCESS in 5m 31s
✔️ system-test-fedora-rawhide SUCCESS in 1h 48m 51s
✔️ system-test-fedora-40 SUCCESS in 1h 44m 19s
✔️ system-test-fedora-39 SUCCESS in 1h 48m 22s

@debarshiray debarshiray changed the title [WIP] cmd/run, pkg/nvidia: Detect mismatched NVIDIA kernel & user space driver cmd/run, pkg/nvidia: Detect mismatched NVIDIA kernel & user space driver Sep 15, 2024
Copy link

Build failed.
https://softwarefactory-project.io/zuul/t/local/buildset/90ffb538e07f431ab3c2d0234223f323

✔️ unit-test SUCCESS in 5m 27s
✔️ unit-test-migration-path-for-coreos-toolbox SUCCESS in 3m 09s
✔️ unit-test-restricted SUCCESS in 5m 27s
✔️ system-test-fedora-rawhide SUCCESS in 1h 53m 18s
✔️ system-test-fedora-40 SUCCESS in 1h 48m 57s
system-test-fedora-39 TIMED_OUT in 1h 50m 29s

@debarshiray
Copy link
Member Author

recheck

Copy link

Build succeeded.
https://softwarefactory-project.io/zuul/t/local/buildset/34ab4eb0f1314faf8fb01959310d53d5

✔️ unit-test SUCCESS in 5m 26s
✔️ unit-test-migration-path-for-coreos-toolbox SUCCESS in 3m 30s
✔️ unit-test-restricted SUCCESS in 5m 28s
✔️ system-test-fedora-rawhide SUCCESS in 1h 46m 49s
✔️ system-test-fedora-40 SUCCESS in 1h 46m 14s
✔️ system-test-fedora-39 SUCCESS in 1h 46m 32s

@debarshiray debarshiray merged commit 8dd2f8e into containers:main Sep 16, 2024
3 checks passed
@debarshiray debarshiray deleted the wip/rishi/nvidia-take-avantage-of-new-APIs-00 branch September 16, 2024 00:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant