GPU Direct Access Speeds Up Deep Learning Training

The BaM system allows GPUs to talk directly to SSDs, skipping the CPU. This makes deep learning training much faster than before.

As of 19/05/2026, the BaM (Block-as-Memory) system architecture remains a core reference for high-throughput data processing in deep learning. Originally detailed in the 2023 ACM ASPLOS proceedings, the system functions as a software-hardware co-design enabling GPUs to initiate direct, on-demand storage access to NVMe SSDs via PCIe Peer-to-Peer (P2P) transfers.

The system effectively removes the CPU from the data-loading path by assigning NVMe block IDs directly to GPU threads, allowing for direct I/O requests.

Technical Implementation & Availability

The architecture relies on specific kernel-level integrations to map block storage into a GPU-addressable memory space. Developers can leverage the bam::array abstraction to implement this in custom workloads.

  • System Requirements: An x86 platform is necessary, coupled with specific Nvidia driver kernel modules and headers.

  • Dependency Chain: The project builds upon earlier work, specifically the SmartIO framework for device sharing.

  • Availability: The reproduction package is maintained as an open repository, providing the kernel module, support libraries, and micro-benchmarks.

FeatureDescription
MechanismGPU-initiated I/O via PCIe P2P
Abstractionbam::array
Primary GoalScaling Deep Learning Recommendation Models (DLRM)
ValidationNYC Taxi dataset / Benchmarks included in repository

Contextual Significance

The proliferation of Deep Learning Recommendation Models has forced hardware architectures to move beyond traditional tiered memory models. Because these models often involve massive embedding tables that exceed GPU memory capacity, standard I/O throughput often creates a stall point in the training pipeline.

Read More: GPU Firmware Issues Cause System Crashes and Black Screens May 2026

"The artifact is the source code [for] the BaM system that enables efficient, on-demand accesses to storage from GPU thread." — Project Documentation

While the research initially debuted in a 2022 pre-print, its subsequent publication in ASPLOS 2023 solidified its role in research involving GPU-Initiated Storage. By shifting the burden of I/O request management from the CPU kernel to the GPU thread, the system attempts to solve latency issues inherent in fetching training data from local non-volatile storage. However, adoption remains gated by specific hardware requirements and the necessity of managing PCIe-level data integrity outside of traditional OS buffers.

Frequently Asked Questions

Q: What is the BaM system and what does it do?
The BaM system is a new way for GPUs to access storage like NVMe SSDs directly. It removes the CPU from the data loading process, making things faster.
Q: How does the BaM system make deep learning faster?
By letting GPUs access SSDs directly, it speeds up how quickly data can be loaded for training deep learning models. This is important because large models often need more data than fits in GPU memory.
Q: Who can use the BaM system and what is needed?
Developers can use the BaM system with specific Nvidia drivers on x86 computers. The code is available online for custom projects.
Q: Why is direct GPU access to storage important for deep learning?
Many deep learning models, especially recommendation models, use huge amounts of data. The BaM system helps solve the problem of slow data loading that can hold back training by allowing GPUs to fetch data directly from fast SSDs.