NVIDIA has launched SANA-WM, an open-source world model with 2.6 billion parameters. This new model is designed to generate up to one minute of 720p video from a single image and camera trajectory, notably on a single GPU. The development signals a move towards more accessible, yet capable, video generation tools.
Key Technical Advancements
SANA-WM distinguishes itself by replacing many standard attention blocks with a frame-wise Gated DeltaNet (GDN). This approach differs from token-wise GDN used in language models, allowing SANA-WM to process an entire latent frame at each recurrent step. This architectural choice contributes to its efficiency.
The training regimen for SANA-WM involved a multi-stage process:
A total of approximately 18.5 days of training was conducted using 64 H100 GPUs.
This training utilized a dataset of 212,975 public video clips.
A significant portion, around 8 days (Stage 3), focused on extending training to sequences of up to 961 frames (60 seconds) and incorporating Dual-Branch Camera Control.
The primary diffusion transformer (DiT) training followed a four-stage progressive schedule, taking roughly 15 days. Early stages adapted pre-trained SANA-Video models to the frame-wise GDN structure on shorter clips.
Performance and Efficiency Claims
NVIDIA suggests SANA-WM offers substantial efficiency gains. In comparisons, SANA-WM paired with a refiner reportedly achieves 36 times greater throughput than models like LingBot-World, while maintaining comparable visual quality scores according to VBench. This enhanced performance is achieved with reduced computational requirements compared to other systems.
Read More: YouTube AI Likeness Detection Tool Now Open to All Creators 18 Plus
Open-Source Availability and Context
The release of SANA-WM as an open-source model is a significant aspect. This makes the technology potentially available for broader experimentation and application development. The project is part of a larger family of SANA models developed by NVIDIA, including those focused on high-resolution image and video synthesis. The code and project details are accessible via repositories on platforms like GitHub and Hugging Face.
Background on SANA Models
The SANA suite, developed by NVIDIA, emphasizes efficiency in generating high-resolution content. Previous iterations, such as SANA-Video and LongSANA, explored efficient video generation using techniques like Block Linear Attention. The SANA project, in general, aims for high-quality image and video synthesis with strategies that allow for deployment even on less powerful hardware, such as a laptop GPU. The goal appears to be lowering the cost of content creation.
Read More: AI Systems Use Smarter Design, Not Just Bigger Models