WhisperX and LLM Use One 3090 GPU with 24GB VRAM

A new way lets AI models WhisperX and a large language model work together on one 3090 GPU. This is like fitting two big programs into one small box.

A novel approach has surfaced, detailing a method to integrate the WhisperX large-v3 model alongside a 24B parameter Large Language Model (LLM) onto a single 3090 GPU with 24GB of VRAM. This development centers on a "context-capping recipe" that meticulously manages Video RAM (VRAM) usage, specifically by controlling the KV cache size. This technique allows for simultaneous, high-quality speech-to-text transcription and sophisticated text processing, all within the confines of a consumer-grade GPU.

The core of this recipe involves calculating and optimizing the KV cache, a crucial component for LLM performance. By keeping the KV cache size constrained – specifically around 1.25GB in the described setup – the combined models can fit. This configuration supports full Speech-to-Text (STT) quality from WhisperX while enabling parallel email triage operations by the LLM. The system reportedly maintains approximately 2GB of headroom on the GPU, suggesting a fine-tuned balance between model requirements and available resources.

Read More: Nvidia RTX 5060 Graphics Cards Start at $379 for Gamers

Rep. Ashley Hinson Wins Iowa Republican Senate Primary - 1

Technical Details and Workflow

The described workflow utilizes a specific LLM, identified as a variant of Mistral, with a "native 8K window." This window accommodates substantial input, such as a 10-email batch for triage plus 2K tokens for generation, without overwhelming the VRAM. The prompt example provided is: "Analyze my background email triage." The modelfile is designated as Modelfile.triage and references FROM devstral-small:latest.

WhisperX itself offers advanced features beyond basic transcription. It provides word-level timestamps and supports speaker diarization, enabling the identification of different speakers within an audio file. For increased timestamp accuracy, users can employ larger alignment models, though this comes at a higher GPU memory cost. The system can automatically detect or be manually set to use language-specific alignment models, and it supports various output formats including srt, vtt, txt, tsv, and json. The whisperx tool can also apply word highlighting, adding <u> tags to spoken words in SRT/VTT outputs, and allows for fine-tuning of speech start and end detection thresholds (vad_onset, vad_offset).

Read More: GPU Scarcity Ends, Free AI Models Start November

Rep. Ashley Hinson Wins Iowa Republican Senate Primary - 2

Performance and Hardware Considerations

WhisperX has been reported to achieve speeds of up to 70x real-time with its large-v3 model. While the primary focus is on single-GPU operation, alternative hardware setups exist for handling larger files or more demanding tasks. For instance, a WhisperX-A40-Large variant runs on an NVIDIA A40 GPU, offering more RAM at a higher cost.

Efforts to optimize WhisperX for specific hardware environments are ongoing. The NVIDIA Triton Inference Server, in conjunction with TensorRT-LLM, is being explored for deploying multimodal models, including Whisper. This involves building optimized TensorRT-LLM engines for both the encoder and decoder components of the Whisper model, configuring Triton server parameters for efficient batching and memory management. Parameters such as MAX_ATTENTION_WINDOW_SIZE, MAX_TOKENS_IN_KV_CACHE, and KV_CACHE_FREE_GPU_MEM_FRACTION are critical in these configurations.

Read More: Perplexity AI May Use Your Laptop For Processing Soon

Rep. Ashley Hinson Wins Iowa Republican Senate Primary - 3

Another notable integration involves the vLLM library, which has showcased Whisper's capabilities, particularly its encoder-decoder structure, for multimodal tasks. This work highlights the potential for real-time inference and flexible prompt engineering with models like openai/whisper-large-v3, allowing for specific control over audio and text prompts.

Background:

The challenge of fitting large AI models, particularly those with multimodal capabilities like speech recognition and language understanding, onto consumer hardware has been a significant barrier. The 24GB VRAM found in cards like the RTX 3090 represents a popular, albeit limited, frontier for local AI deployments. Techniques such as quantization, model pruning, and efficient memory management strategies, like the KV cache optimization detailed in the primary report, are crucial for overcoming these hardware limitations. WhisperX, building on OpenAI's Whisper model, focuses on enhancing transcription accuracy with word-level timestamps and speaker identification, making it a powerful tool for audio processing tasks. The integration with LLMs broadens its utility into more complex analytical and generative workflows.

Read More: AI Systems RAG and Agents Face New Security Checks

Frequently Asked Questions

Q: How can WhisperX and a large language model run on just one 3090 GPU with 24GB VRAM?
A new method uses a 'context-capping recipe' to carefully control the memory (VRAM) used by the models. It specifically manages the KV cache size, keeping it small (around 1.25GB) so both AI models can fit and work together on the GPU.
Q: What does this new method allow people to do with their 3090 GPU?
This technique lets users run high-quality speech-to-text (like from WhisperX) and complex text tasks (like email sorting by an LLM) at the same time. It works even though the GPU has limited memory, leaving about 2GB free.
Q: What are the technical details of this WhisperX and LLM integration on a 3090 GPU?
The system uses a Mistral-based LLM with a large input window (8K) and keeps the KV cache small (1.25GB). This setup allows WhisperX to provide accurate transcriptions with word-level details and speaker identification, while the LLM can process emails or other text tasks.
Q: How fast can WhisperX work, and what are the hardware limits?
WhisperX can be very fast, working up to 70 times faster than real-time with its large-v3 model. While this method works on a 3090 GPU, more powerful GPUs like the A40 exist for very large tasks, but they cost more.
Q: Are there other ways to make AI models like Whisper work better on limited hardware?
Yes, developers are exploring tools like NVIDIA Triton Inference Server and TensorRT-LLM to make AI models more efficient. Libraries like vLLM also help manage memory and improve performance for tasks that use both audio and text.