generated from amazon-archives/__template_DevGuide
-
Notifications
You must be signed in to change notification settings - Fork 47
Open
Labels
nvidiaNVIDIA relatedNVIDIA related
Description
Describe the bug
When an AMI is created in EC2 Image Builder that installs the NVIDIA CUDA driver, the driver does not work. nvidia-smi outputs this error:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
I haven't been able to reproduce outside of EC2 Image Builder.
To Reproduce
Steps to reproduce the behavior:
- Create an EC2 Image Builder pipeline using base image
ssm:/aws/service/ami-amazon-linux-latest/al2023-ami-minimal-kernel-6.12-x86_64and a 4 GB EBS volume - Add the
update-linuxbuild component - Add this build component, which installs the NVIDIA driver and the minimum CUDA 13.1 components needed for hashcat:
# https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/amazon-linux.html
schemaVersion: 1.0
phases:
- name: build
steps:
- name: DnfAddRepo
action: ExecuteBash
inputs:
commands:
- dnf config-manager --add-repo=https://developer.download.nvidia.com/compute/cuda/repos/amzn2023/x86_64/cuda-amzn2023.repo
- name: DnfInstall
action: ExecuteBash
inputs:
commands:
- dnf -y module enable nvidia-driver:latest-dkms
- dnf -y install cuda-cudart-13-1 cuda-nvrtc-devel-13-1 nvidia-driver-cuda- Build an AMI
- Launch any GPU instance from the AMI and run
nvidia-smi
Expected behavior
The NVIDIA driver should function.
Additional context
al2023-ami-minimal-2023.9.20251208.0-kernel-6.12-x86_64: Latest good version
al2023-ami-minimal-2023.10.20260105.0-kernel-6.12-x86_64: Known bad version
Metadata
Metadata
Assignees
Labels
nvidiaNVIDIA relatedNVIDIA related