As of 19/05/2026, the BaM (Block-as-Memory) system architecture remains a core reference for high-throughput data processing in deep learning. Originally detailed in the 2023 ACM ASPLOS proceedings, the system functions as a software-hardware co-design enabling GPUs to initiate direct, on-demand storage access to NVMe SSDs via PCIe Peer-to-Peer (P2P) transfers.
The system effectively removes the CPU from the data-loading path by assigning NVMe block IDs directly to GPU threads, allowing for direct I/O requests.
Technical Implementation & Availability
The architecture relies on specific kernel-level integrations to map block storage into a GPU-addressable memory space. Developers can leverage the bam::array abstraction to implement this in custom workloads.
System Requirements: An x86 platform is necessary, coupled with specific Nvidia driver kernel modules and headers.
Dependency Chain: The project builds upon earlier work, specifically the SmartIO framework for device sharing.
Availability: The reproduction package is maintained as an open repository, providing the kernel module, support libraries, and micro-benchmarks.
| Feature | Description |
|---|---|
| Mechanism | GPU-initiated I/O via PCIe P2P |
| Abstraction | bam::array |
| Primary Goal | Scaling Deep Learning Recommendation Models (DLRM) |
| Validation | NYC Taxi dataset / Benchmarks included in repository |
Contextual Significance
The proliferation of Deep Learning Recommendation Models has forced hardware architectures to move beyond traditional tiered memory models. Because these models often involve massive embedding tables that exceed GPU memory capacity, standard I/O throughput often creates a stall point in the training pipeline.
Read More: GPU Firmware Issues Cause System Crashes and Black Screens May 2026
"The artifact is the source code [for] the BaM system that enables efficient, on-demand accesses to storage from GPU thread." — Project Documentation
While the research initially debuted in a 2022 pre-print, its subsequent publication in ASPLOS 2023 solidified its role in research involving GPU-Initiated Storage. By shifting the burden of I/O request management from the CPU kernel to the GPU thread, the system attempts to solve latency issues inherent in fetching training data from local non-volatile storage. However, adoption remains gated by specific hardware requirements and the necessity of managing PCIe-level data integrity outside of traditional OS buffers.