Refer to the output:
~ $ sudo nvsm show healthinfo
—Timestamp: Sat Dec 16 16:26:32 2017 -0800
Version: 17.12-5
Checks—BIOS Revision [5.11].........................
DGX Serial Number [YSY72800016)..................
Verify installed DIMM memory sticks........................Healthy
...[output truncated)
Verify Ethernet controllers...........................Healthy
Verify installed GPU's..............................Unhealthy
Checking output of 'lspci' for expected GPU's
Missing GPU at PCI address '07:00.0'
Verify installed InfiniBand controllers....................Healthy
Verify PCIe switches..................................Healthy
...[output truncated)
What insights can a system administrator gain regarding the DGX system's health?
A financial services firm is deploying an AI model for fraud detection that requires rapid inference and data retrieval across multiple sites. Which feature should their storage system prioritize?
During a multi-day NeMo burn-in, intermittent "GPU fell off bus" errors occur. Which diagnostic approach isolates hardware faults?
A system administrator receives an alert about a potential hardware fault on an NVIDIA DGX A100. The GPU performance seems degraded, and the system fans are operating loudly. What step should be recommended to identify and troubleshoot the hardware fault?
After initial setup and health checks, the DGX H100 system administrator wants to verify that containers can access GPUs before running production workloads. Which method is recommended for this validation?
What information does the 'ibnodes' command display?
During multi-node HPL burn-in, GPUs show uneven utilization. Which configuration ensures balanced workload distribution?
A system administrator needs to install a GPU/DPU in a server. The server has a free PCI-e slot, there are enough free PCI-e lanes, and there is enough room for the card. Which procedure should be followed?
A user encounters "permission denied" errors when running GPU-accelerated containers on a Secure Boot-enabled system. What resolves this?
During cluster deployment, the UFM Cable Validation Tool reports "Wrong-neighbor" errors on multiple InfiniBand links. What is the most efficient way to resolve this issue?
|
PDF + Testing Engine
|
|---|
|
$49.5 |
|
Testing Engine
|
|---|
|
$37.5 |
|
PDF (Q&A)
|
|---|
|
$31.5 |
NVIDIA Free Exams |
|---|
|