-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/run, pkg/nvidia: Detect mismatched NVIDIA kernel & user space driver #1541
cmd/run, pkg/nvidia: Detect mismatched NVIDIA kernel & user space driver #1541
Conversation
NVIDIA Container Toolkit 0.16.0 added a new API to avoid creating a new info.Interface when creating a nvcdi.Interface, if an info.Interface already exists [1]. Commit 649d02f already bumped the required NVIDIA Container Toolkit version to 0.16.0, so take advantage of that. [1] NVIDIA Container Toolkit commit 8fc4b9c742f894ef NVIDIA/nvidia-container-toolkit@8fc4b9c742f894ef NVIDIA/nvidia-container-toolkit#516 containers#1541
eee9336
to
581e2d5
Compare
A new API was added to github.com/NVIDIA/go-nvlib 0.4.0 to specify a logger to be used by a info.Interface [1]. Commit 649d02f already bumped the required go-nvlib version to 0.6.0, so take advantage of that. [1] github.com/NVIDIA/go-nvlib commit 21c8f035ca66b29d NVIDIA/go-nvlib@21c8f035ca66b29d NVIDIA/go-nvlib#28 containers#1541
It's better to avoid abbreviations when the length of the string and the depth of the indentation are favourable. containers#1541
Build succeeded. ✔️ unit-test SUCCESS in 5m 37s |
The proprietary NVIDIA driver has a kernel space part and a user space part, and they must always have the same matching version. Sometimes, the host operating system might end up with mismatched parts. One reason could be that the different third-party repositories used to distribute the driver might be incompatible with each other. eg., in the case of Fedora it could be RPM Fusion and NVIDIA's own repository. This shows up in the systemd journal as: $ journalctl --dmesg ... kernel: NVRM: API mismatch: the client has the version 555.58.02, but NVRM: this kernel module has the version 560.35.03. Please NVRM: make sure that this kernel module and all NVIDIA driver NVRM: components have the same version. ... Without any special handling of this scenario, users would be presented with a very misleading error: $ toolbox enter Error: failed to get Container Device Interface containerEdits for NVIDIA Instead, improve the error message to be more self-documenting: $ toolbox enter Error: the proprietary NVIDIA driver's kernel and user space don't match Check the systemd journal and the contents of the operating system. containers#1541
Build failed. ✔️ unit-test SUCCESS in 5m 42s |
The proprietary NVIDIA driver has a kernel space part and a user space part, and they must always have the same matching version. Sometimes, the host operating system might end up with mismatched parts. One reason could be that the different third-party repositories used to distribute the driver might be incompatible with each other. eg., in the case of Fedora it could be RPM Fusion and NVIDIA's own repository. This shows up in the systemd journal as: $ journalctl --dmesg ... kernel: NVRM: API mismatch: the client has the version 555.58.02, but NVRM: this kernel module has the version 560.35.03. Please NVRM: make sure that this kernel module and all NVIDIA driver NVRM: components have the same version. ... Without any special handling of this scenario, users would be presented with a very misleading error: $ toolbox enter Error: failed to get Container Device Interface containerEdits for NVIDIA Instead, improve the error message to be more self-documenting: $ toolbox enter Error: the proprietary NVIDIA driver's kernel and user space don't match Check the host operating system and systemd journal. containers#1541
recheck |
Build succeeded. ✔️ unit-test SUCCESS in 5m 35s |
Build succeeded. ✔️ unit-test SUCCESS in 5m 30s |
Build succeeded. ✔️ unit-test SUCCESS in 5m 37s |
NVIDIA Container Toolkit 0.16.0 added a new API to avoid creating a new info.Interface when creating a nvcdi.Interface, if an info.Interface already exists [1]. Commit 649d02f already bumped the required NVIDIA Container Toolkit version to 0.16.0, so take advantage of that. [1] NVIDIA Container Toolkit commit 8fc4b9c742f894ef NVIDIA/nvidia-container-toolkit@8fc4b9c742f894ef NVIDIA/nvidia-container-toolkit#516 containers#1541
A new API was added to github.com/NVIDIA/go-nvlib 0.4.0 to specify a logger to be used by a info.Interface [1]. Commit 649d02f already bumped the required go-nvlib version to 0.6.0, so take advantage of that. [1] github.com/NVIDIA/go-nvlib commit 21c8f035ca66b29d NVIDIA/go-nvlib@21c8f035ca66b29d NVIDIA/go-nvlib#28 containers#1541
It's better to avoid abbreviations when the length of the string and the depth of the indentation are favourable. containers#1541
The proprietary NVIDIA driver has a kernel space part and a user space part, and they must always have the same matching version. Sometimes, the host operating system might end up with mismatched parts. One reason could be that the different third-party repositories used to distribute the driver might be incompatible with each other. eg., in the case of Fedora it could be RPM Fusion and NVIDIA's own repository. This shows up in the systemd journal as: $ journalctl --dmesg ... kernel: NVRM: API mismatch: the client has the version 555.58.02, but NVRM: this kernel module has the version 560.35.03. Please NVRM: make sure that this kernel module and all NVIDIA driver NVRM: components have the same version. ... Without any special handling of this scenario, users would be presented with a very misleading error: $ toolbox enter Error: failed to get Container Device Interface containerEdits for NVIDIA Instead, improve the error message to be more self-documenting: $ toolbox enter Error: the proprietary NVIDIA driver's kernel and user space don't match Check the host operating system and systemd journal. containers#1541
f77bbac
to
ce7a0d4
Compare
Build succeeded. ✔️ unit-test SUCCESS in 5m 44s |
Build succeeded. ✔️ unit-test SUCCESS in 5m 35s |
Build succeeded. ✔️ unit-test SUCCESS in 5m 42s |
Build failed. ✔️ unit-test SUCCESS in 5m 27s |
recheck |
Build succeeded. ✔️ unit-test SUCCESS in 5m 26s |
The proprietary NVIDIA driver has a kernel space part and a user space
part, and they must always have the same matching version. Sometimes,
the host operating system might end up with mismatched parts. One
reason could be that the different third-party repositories used to
distribute the driver might be incompatible with each other. eg., in
the case of Fedora it could be RPM Fusion and NVIDIA's own repository.
This shows up in the systemd journal as:
Without any special handling of this scenario, users would be presented
with a very misleading error:
Instead, improve the error message to be more self-documenting: