You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update nvml.py to CUDA 11; smi.py DeviceQuery to R460 (#33)
* Updated for enums and structs for CUDA 9, 10, 10.1, 10.2, 11
* Added Dockerfile for pytest coverage
* adding nvml header files a requirement docs for implementation
* Added bindings for some functions added in CUDA 11.0
* Added support for more MIG functions
* Merged nvml.py and smi.py for CUDA 11 and R460 driver
* Updated for CUDA 11
* Delete nvml_cuda100.h
* Delete nvml_cuda110.h
* Delete nvml_cuda92.h
* Delete nvml_cuda101.h
* Delete nvml_cuda102.h
* Delete nvml_cuda90.h
* Delete nvml_cuda91.h
Co-authored-by: Travis Hester <[email protected]>
The fan speed value is the percent of maximum speed that the device's fan is currently intended to run at. It ranges from 0 to 100 %. Note: The reported speed is the intended fan speed. If the fan is physically blocked and unable to spin, this output will not match the actual fan speed. Many parts do not report fan speeds because they rely on cooling via fans in the surrounding enclosure.
110
+
The fan speed value is the percent of the product's maximum noise tolerance fan speed that the device's fan is currently intended to run at. This value may exceed 100% in certain cases. Note: The reported speed is the intended fan speed. If the fan is physically blocked and unable to spin, this output will not match the actual fan speed. Many parts do not report fan speeds because they rely on cooling via fans in the surrounding enclosure.
111
111
112
112
"pstate" or NVSMI_PSTATE
113
113
The current performance state for the GPU. States range from P0 (maximum performance) to P12 (minimum performance).
@@ -210,6 +210,9 @@ NVIDIA GPUs can provide error counts for various types of ECC errors. Some ECC e
210
210
"ecc.errors.corrected.volatile.device_memory" or NVSMI_ECC_ERROR_CORRECTED_VOLATILE_DEV_MEM
211
211
Errors detected in global device memory.
212
212
213
+
"ecc.errors.corrected.volatile.dram" or NVSMI_ECC_ERROR_CORRECTED_VOLATILE_DRAM
214
+
Errors detected in global device memory.
215
+
213
216
"ecc.errors.corrected.volatile.register_file" or NVSMI_ECC_ERROR_CORRECTED_VOLATILE_REGFILE
214
217
Errors detected in register file memory.
215
218
@@ -222,12 +225,21 @@ Errors detected in the L2 cache.
222
225
"ecc.errors.corrected.volatile.texture_memory" or NVSMI_ECC_ERROR_CORRECTED_VOLATILE_TEXTURE
223
226
Parity errors detected in texture memory.
224
227
228
+
"ecc.errors.corrected.volatile.cbu" or NVSMI_ECC_ERROR_CORRECTED_VOLATILE_CBU
229
+
Parity errors detected in CBU.
230
+
231
+
"ecc.errors.corrected.volatile.sram" or NVSMI_ECC_ERROR_CORRECTED_VOLATILE_SRAM
232
+
Errors detected in global SRAMs.
233
+
225
234
"ecc.errors.corrected.volatile.total" or NVSMI_ECC_ERROR_CORRECTED_VOLATILE_TOTAL
226
-
Total errors detected across entire chip. Sum of device_memory, register_file, l1_cache, l2_cache and texture_memory.
235
+
Total errors detected across entire chip.
227
236
228
237
"ecc.errors.corrected.aggregate.device_memory" or NVSMI_ECC_ERROR_CORRECTED_AGGREGATE_DEV_MEM
229
238
Errors detected in global device memory.
230
239
240
+
"ecc.errors.corrected.aggregate.dram" or NVSMI_ECC_ERROR_CORRECTED_AGGREGATE_DRAM
241
+
Errors detected in global device memory.
242
+
231
243
"ecc.errors.corrected.aggregate.register_file" or NVSMI_ECC_ERROR_CORRECTED_AGGREGATE_REGFILE
232
244
Errors detected in register file memory.
233
245
@@ -240,12 +252,21 @@ Errors detected in the L2 cache.
240
252
"ecc.errors.corrected.aggregate.texture_memory" or NVSMI_ECC_ERROR_CORRECTED_AGGREGATE_TEXTURE
241
253
Parity errors detected in texture memory.
242
254
255
+
"ecc.errors.corrected.aggregate.cbu" or NVSMI_ECC_ERROR_CORRECTED_AGGREGATE_CBU
256
+
Parity errors detected in CBU.
257
+
258
+
"ecc.errors.corrected.aggregate.sram" or NVSMI_ECC_ERROR_CORRECTED_AGGREGATE_SRAM
259
+
Errors detected in global SRAMs.
260
+
243
261
"ecc.errors.corrected.aggregate.total" or NVSMI_ECC_ERROR_CORRECTED_AGGREGATE_TOTAL
244
262
Total errors detected across entire chip. Sum of device_memory, register_file, l1_cache, l2_cache and texture_memory.
245
263
246
264
"ecc.errors.uncorrected.volatile.device_memory" or NVSMI_ECC_ERROR_UNCORRECTED_VOLATILE_DEV_MEM
247
265
Errors detected in global device memory.
248
266
267
+
"ecc.errors.uncorrected.volatile.dram" or NVSMI_ECC_ERROR_UNCORRECTED_VOLATILE_DRAM
268
+
Errors detected in global device memory.
269
+
249
270
"ecc.errors.uncorrected.volatile.register_file" or NVSMI_ECC_ERROR_UNCORRECTED_VOLATILE_REGFILE
250
271
Errors detected in register file memory.
251
272
@@ -258,12 +279,21 @@ Errors detected in the L2 cache.
258
279
"ecc.errors.uncorrected.volatile.texture_memory" or NVSMI_ECC_ERROR_UNCORRECTED_VOLATILE_TEXTURE
259
280
Parity errors detected in texture memory.
260
281
282
+
"ecc.errors.uncorrected.volatile.cbu" or NVSMI_ECC_ERROR_UNCORRECTED_VOLATILE_CBU
283
+
Parity errors detected in CBU.
284
+
285
+
"ecc.errors.uncorrected.volatile.sram" or NVSMI_ECC_ERROR_UNCORRECTED_VOLATILE_SRAM
286
+
Errors detected in global SRAMs.
287
+
261
288
"ecc.errors.uncorrected.volatile.total" or NVSMI_ECC_ERROR_UNCORRECTED_VOLATILE_TOTAL
262
-
Total errors detected across entire chip. Sum of device_memory, register_file, l1_cache, l2_cache and texture_memory.
289
+
Total errors detected across entire chip.
263
290
264
291
"ecc.errors.uncorrected.aggregate.device_memory" or NVSMI_ECC_ERROR_UNCORRECTED_AGGREGATE_DEV_MEM
265
292
Errors detected in global device memory.
266
293
294
+
"ecc.errors.uncorrected.aggregate.dram" or NVSMI_ECC_ERROR_UNCORRECTED_AGGREGATE_DRAM
295
+
Errors detected in global device memory.
296
+
267
297
"ecc.errors.uncorrected.aggregate.register_file" or NVSMI_ECC_ERROR_UNCORRECTED_AGGREGATE_REGFILE
268
298
Errors detected in register file memory.
269
299
@@ -276,8 +306,14 @@ Errors detected in the L2 cache.
276
306
"ecc.errors.uncorrected.aggregate.texture_memory" or NVSMI_ECC_ERROR_UNCORRECTED_AGGREGATE_TEXTURE
277
307
Parity errors detected in texture memory.
278
308
309
+
"ecc.errors.uncorrected.aggregate.cbu" or NVSMI_ECC_ERROR_UNCORRECTED_AGGREGATE_CBU
310
+
Parity errors detected in CBU.
311
+
312
+
"ecc.errors.uncorrected.aggregate.sram" or NVSMI_ECC_ERROR_UNCORRECTED_AGGREGATE_SRAM
313
+
Errors detected in global SRAMs.
314
+
279
315
"ecc.errors.uncorrected.aggregate.total" or NVSMI_ECC_ERROR_UNCORRECTED_AGGREGATE_TOTAL
280
-
Total errors detected across entire chip. Sum of device_memory, register_file, l1_cache, l2_cache and texture_memory.
316
+
Total errors detected across entire chip.
281
317
282
318
Section about retired_pages properties
283
319
NVIDIA GPUs can retire pages of GPU device memory when they become unreliable. This can happen when multiple single bit ECC errors occur for the same page, or on a double bit ECC error. When a page is retired, the NVIDIA driver will hide it such that no driver, or application memory allocations can access it.
@@ -360,6 +396,17 @@ Maximum frequency of SM (Streaming Multiprocessor) clock.
360
396
"clocks.max.memory" or "clocks.max.mem" or NVSMI_CLOCKS_MEMORY_MAX
361
397
Maximum frequency of memory clock.
362
398
399
+
Section about mig.mode properties
400
+
A flag that indicates whether MIG mode is enabled. May be either "Enabled" or "Disabled". Changes to MIG mode require a GPU reset.
401
+
402
+
"mig.mode.current" or NVSMI_MIG_MODE_CURRENT
403
+
The MIG mode that the GPU is currently operating under.
404
+
405
+
"mig.mode.pending" or NVSMI_MIG_MODE_PENDING
406
+
The MIG mode that the GPU will operate under after reset.
407
+
408
+
Section about commands to return additional properties
0 commit comments