Quick Start and Basic Operation
DGX A100 System
DU-09821-001_v06
| 25
The following are the steps for performing a health check on the DGX A100 System, and
verifying the Docker and NVIDIA driver installation.
1.
Establish an SSH connection to the DGX A100 System.
2.
Run a basic system check.
$
sudo nvsm show health
Verify that the output summary shows that all checks are Healthy and that the overall
system status is Healthy.
3.
Verify that Docker is installed by viewing the installed Docker version.
$ sudo docker –version
This should return the version as “Docker version 19.03.5-ce”, where the actual version
may differ depending on the specific release of the DGX OS Server software.
4.
Verify connection to the NVIDIA repository and that the NVIDIA Driver is installed.
$ sudo docker run --gpus all --rm nvcr.io/nvidia/cuda:11.0-base nvidia-smi
Docker pulls the nvidia/cuda container image layer by layer, then runs nvidia-smi.
When completed, the output should show the NVIDIA Driver version and a description of
each installed GPU.
4.6
Running a Preflight Stress Test
NVIDIA recommends running the pre-flight stress test before putting a system into a
production environment or after servicing. You can specify running the test on the GPUs, CPU,
memory, and storage, and also specify the duration of the tests.
To run the tests, use NVSM.
Syntax:
$ sudo nvsm stress-test [--usage] [--force] [--no-prompt] [<test>...]
[DURATION]
Getting Help
$ sudo nvsm stress-test --usage
Recommended Test to Run
The following command tests all components (GPU, CPU, memory, storage) and takes about
20 minutes to complete:
$ sudo nvsm stress-test --force