Skip to content

Commit 20e79a8

Browse files
Update nvml.py to CUDA 11; smi.py DeviceQuery to R460 (#33)
* Updated for enums and structs for CUDA 9, 10, 10.1, 10.2, 11 * Added Dockerfile for pytest coverage * adding nvml header files a requirement docs for implementation * Added bindings for some functions added in CUDA 11.0 * Added support for more MIG functions * Merged nvml.py and smi.py for CUDA 11 and R460 driver * Updated for CUDA 11 * Delete nvml_cuda100.h * Delete nvml_cuda110.h * Delete nvml_cuda92.h * Delete nvml_cuda101.h * Delete nvml_cuda102.h * Delete nvml_cuda90.h * Delete nvml_cuda91.h Co-authored-by: Travis Hester <[email protected]>
1 parent 7c78212 commit 20e79a8

9 files changed

+2838
-867
lines changed

LICENSE.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
Copyright (c) 2011-2020, NVIDIA Corporation.
1+
Copyright (c) 2011-2021, NVIDIA Corporation.
22
All rights reserved.
33

44
Redistribution and use in source and binary forms, with or without

README.md

+6
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,12 @@ nvsmi = nvidia_smi.getInstance()
4747
nvsmi.DeviceQuery('memory.free, memory.total')
4848
```
4949

50+
```python
51+
from pynvml.smi import nvidia_smi
52+
nvsmi = nvidia_smi.getInstance()
53+
print(nvsmi.DeviceQuery('--help-query-gpu'), end='\n')
54+
```
55+
5056
Functions
5157
---------
5258
Python methods wrap NVML functions, implemented in a C shared library.

docker/Dockerfile

+39
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
##############################################################
2+
# This Dockerfile contains the additional NVIDIA compilers,
3+
# libraries, and plugins to enable OpenACC and NVIDIA GPU
4+
# acceleration of Devito codes.
5+
#
6+
# BUILD: docker build --network=host --file docker/Dockerfile --tag pynvml .
7+
# RUN: docker run --gpus all --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 pynvml
8+
##############################################################
9+
FROM python:3.6
10+
11+
ENV DEBIAN_FRONTEND noninteractive
12+
13+
ADD ./requirements.txt /app/requirements.txt
14+
15+
RUN pip install --no-cache-dir --upgrade pip && \
16+
pip install --no-cache-dir -r /app/requirements.txt && \
17+
rm -rf ~/.cache/pip
18+
19+
ADD ./pynvml /app/pynvml
20+
ADD ./setup.py /app/
21+
ADD ./setup.cfg /app/
22+
ADD ./README.md /app/
23+
ADD ./PKG-INFO /app/
24+
ADD ./MANIFEST.in /app/
25+
ADD ./help_query_gpu.txt /app/
26+
ADD docker/entrypoint.sh /docker-entrypoint.sh
27+
28+
RUN chmod +x /docker-entrypoint.sh
29+
30+
## Create App user
31+
# Set the home directory to our app user's home.
32+
ENV HOME=/app
33+
ENV APP_HOME=/app
34+
35+
WORKDIR /app
36+
37+
EXPOSE 8888
38+
ENTRYPOINT ["/docker-entrypoint.sh"]
39+
CMD ["bash"]

docker/entrypoint.sh

+8
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
#!/usr/bin/env bash
2+
3+
find /app -type f -name '*.pyc' -delete
4+
5+
export PATH=/venv/bin:$PATH
6+
export PYTHONPATH=$PYTHONPATH:/app
7+
8+
exec "$@"

help_query_gpu.txt

100755100644
+51-4
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ The GOM currently in use.
107107
The GOM that will be used on the next reboot.
108108

109109
"fan.speed" or NVSMI_FAN_SPEED
110-
The fan speed value is the percent of maximum speed that the device's fan is currently intended to run at. It ranges from 0 to 100 %. Note: The reported speed is the intended fan speed. If the fan is physically blocked and unable to spin, this output will not match the actual fan speed. Many parts do not report fan speeds because they rely on cooling via fans in the surrounding enclosure.
110+
The fan speed value is the percent of the product's maximum noise tolerance fan speed that the device's fan is currently intended to run at. This value may exceed 100% in certain cases. Note: The reported speed is the intended fan speed. If the fan is physically blocked and unable to spin, this output will not match the actual fan speed. Many parts do not report fan speeds because they rely on cooling via fans in the surrounding enclosure.
111111

112112
"pstate" or NVSMI_PSTATE
113113
The current performance state for the GPU. States range from P0 (maximum performance) to P12 (minimum performance).
@@ -210,6 +210,9 @@ NVIDIA GPUs can provide error counts for various types of ECC errors. Some ECC e
210210
"ecc.errors.corrected.volatile.device_memory" or NVSMI_ECC_ERROR_CORRECTED_VOLATILE_DEV_MEM
211211
Errors detected in global device memory.
212212

213+
"ecc.errors.corrected.volatile.dram" or NVSMI_ECC_ERROR_CORRECTED_VOLATILE_DRAM
214+
Errors detected in global device memory.
215+
213216
"ecc.errors.corrected.volatile.register_file" or NVSMI_ECC_ERROR_CORRECTED_VOLATILE_REGFILE
214217
Errors detected in register file memory.
215218

@@ -222,12 +225,21 @@ Errors detected in the L2 cache.
222225
"ecc.errors.corrected.volatile.texture_memory" or NVSMI_ECC_ERROR_CORRECTED_VOLATILE_TEXTURE
223226
Parity errors detected in texture memory.
224227

228+
"ecc.errors.corrected.volatile.cbu" or NVSMI_ECC_ERROR_CORRECTED_VOLATILE_CBU
229+
Parity errors detected in CBU.
230+
231+
"ecc.errors.corrected.volatile.sram" or NVSMI_ECC_ERROR_CORRECTED_VOLATILE_SRAM
232+
Errors detected in global SRAMs.
233+
225234
"ecc.errors.corrected.volatile.total" or NVSMI_ECC_ERROR_CORRECTED_VOLATILE_TOTAL
226-
Total errors detected across entire chip. Sum of device_memory, register_file, l1_cache, l2_cache and texture_memory.
235+
Total errors detected across entire chip.
227236

228237
"ecc.errors.corrected.aggregate.device_memory" or NVSMI_ECC_ERROR_CORRECTED_AGGREGATE_DEV_MEM
229238
Errors detected in global device memory.
230239

240+
"ecc.errors.corrected.aggregate.dram" or NVSMI_ECC_ERROR_CORRECTED_AGGREGATE_DRAM
241+
Errors detected in global device memory.
242+
231243
"ecc.errors.corrected.aggregate.register_file" or NVSMI_ECC_ERROR_CORRECTED_AGGREGATE_REGFILE
232244
Errors detected in register file memory.
233245

@@ -240,12 +252,21 @@ Errors detected in the L2 cache.
240252
"ecc.errors.corrected.aggregate.texture_memory" or NVSMI_ECC_ERROR_CORRECTED_AGGREGATE_TEXTURE
241253
Parity errors detected in texture memory.
242254

255+
"ecc.errors.corrected.aggregate.cbu" or NVSMI_ECC_ERROR_CORRECTED_AGGREGATE_CBU
256+
Parity errors detected in CBU.
257+
258+
"ecc.errors.corrected.aggregate.sram" or NVSMI_ECC_ERROR_CORRECTED_AGGREGATE_SRAM
259+
Errors detected in global SRAMs.
260+
243261
"ecc.errors.corrected.aggregate.total" or NVSMI_ECC_ERROR_CORRECTED_AGGREGATE_TOTAL
244262
Total errors detected across entire chip. Sum of device_memory, register_file, l1_cache, l2_cache and texture_memory.
245263

246264
"ecc.errors.uncorrected.volatile.device_memory" or NVSMI_ECC_ERROR_UNCORRECTED_VOLATILE_DEV_MEM
247265
Errors detected in global device memory.
248266

267+
"ecc.errors.uncorrected.volatile.dram" or NVSMI_ECC_ERROR_UNCORRECTED_VOLATILE_DRAM
268+
Errors detected in global device memory.
269+
249270
"ecc.errors.uncorrected.volatile.register_file" or NVSMI_ECC_ERROR_UNCORRECTED_VOLATILE_REGFILE
250271
Errors detected in register file memory.
251272

@@ -258,12 +279,21 @@ Errors detected in the L2 cache.
258279
"ecc.errors.uncorrected.volatile.texture_memory" or NVSMI_ECC_ERROR_UNCORRECTED_VOLATILE_TEXTURE
259280
Parity errors detected in texture memory.
260281

282+
"ecc.errors.uncorrected.volatile.cbu" or NVSMI_ECC_ERROR_UNCORRECTED_VOLATILE_CBU
283+
Parity errors detected in CBU.
284+
285+
"ecc.errors.uncorrected.volatile.sram" or NVSMI_ECC_ERROR_UNCORRECTED_VOLATILE_SRAM
286+
Errors detected in global SRAMs.
287+
261288
"ecc.errors.uncorrected.volatile.total" or NVSMI_ECC_ERROR_UNCORRECTED_VOLATILE_TOTAL
262-
Total errors detected across entire chip. Sum of device_memory, register_file, l1_cache, l2_cache and texture_memory.
289+
Total errors detected across entire chip.
263290

264291
"ecc.errors.uncorrected.aggregate.device_memory" or NVSMI_ECC_ERROR_UNCORRECTED_AGGREGATE_DEV_MEM
265292
Errors detected in global device memory.
266293

294+
"ecc.errors.uncorrected.aggregate.dram" or NVSMI_ECC_ERROR_UNCORRECTED_AGGREGATE_DRAM
295+
Errors detected in global device memory.
296+
267297
"ecc.errors.uncorrected.aggregate.register_file" or NVSMI_ECC_ERROR_UNCORRECTED_AGGREGATE_REGFILE
268298
Errors detected in register file memory.
269299

@@ -276,8 +306,14 @@ Errors detected in the L2 cache.
276306
"ecc.errors.uncorrected.aggregate.texture_memory" or NVSMI_ECC_ERROR_UNCORRECTED_AGGREGATE_TEXTURE
277307
Parity errors detected in texture memory.
278308

309+
"ecc.errors.uncorrected.aggregate.cbu" or NVSMI_ECC_ERROR_UNCORRECTED_AGGREGATE_CBU
310+
Parity errors detected in CBU.
311+
312+
"ecc.errors.uncorrected.aggregate.sram" or NVSMI_ECC_ERROR_UNCORRECTED_AGGREGATE_SRAM
313+
Errors detected in global SRAMs.
314+
279315
"ecc.errors.uncorrected.aggregate.total" or NVSMI_ECC_ERROR_UNCORRECTED_AGGREGATE_TOTAL
280-
Total errors detected across entire chip. Sum of device_memory, register_file, l1_cache, l2_cache and texture_memory.
316+
Total errors detected across entire chip.
281317

282318
Section about retired_pages properties
283319
NVIDIA GPUs can retire pages of GPU device memory when they become unreliable. This can happen when multiple single bit ECC errors occur for the same page, or on a double bit ECC error. When a page is retired, the NVIDIA driver will hide it such that no driver, or application memory allocations can access it.
@@ -360,6 +396,17 @@ Maximum frequency of SM (Streaming Multiprocessor) clock.
360396
"clocks.max.memory" or "clocks.max.mem" or NVSMI_CLOCKS_MEMORY_MAX
361397
Maximum frequency of memory clock.
362398

399+
Section about mig.mode properties
400+
A flag that indicates whether MIG mode is enabled. May be either "Enabled" or "Disabled". Changes to MIG mode require a GPU reset.
401+
402+
"mig.mode.current" or NVSMI_MIG_MODE_CURRENT
403+
The MIG mode that the GPU is currently operating under.
404+
405+
"mig.mode.pending" or NVSMI_MIG_MODE_PENDING
406+
The MIG mode that the GPU will operate under after reset.
407+
408+
Section about commands to return additional properties
409+
363410
"supported-clocks" or NVSMI_CLOCKS_SUPPORTED
364411
List of supported clocks.
365412

0 commit comments

Comments
 (0)