A novel approach has surfaced, detailing a method to integrate the WhisperX large-v3 model alongside a 24B parameter Large Language Model (LLM) onto a single 3090 GPU with 24GB of VRAM. This development centers on a "context-capping recipe" that meticulously manages Video RAM (VRAM) usage, specifically by controlling the KV cache size. This technique allows for simultaneous, high-quality speech-to-text transcription and sophisticated text processing, all within the confines of a consumer-grade GPU.
The core of this recipe involves calculating and optimizing the KV cache, a crucial component for LLM performance. By keeping the KV cache size constrained – specifically around 1.25GB in the described setup – the combined models can fit. This configuration supports full Speech-to-Text (STT) quality from WhisperX while enabling parallel email triage operations by the LLM. The system reportedly maintains approximately 2GB of headroom on the GPU, suggesting a fine-tuned balance between model requirements and available resources.
Read More: Nvidia RTX 5060 Graphics Cards Start at $379 for Gamers

Technical Details and Workflow
The described workflow utilizes a specific LLM, identified as a variant of Mistral, with a "native 8K window." This window accommodates substantial input, such as a 10-email batch for triage plus 2K tokens for generation, without overwhelming the VRAM. The prompt example provided is: "Analyze my background email triage." The modelfile is designated as Modelfile.triage and references FROM devstral-small:latest.
WhisperX itself offers advanced features beyond basic transcription. It provides word-level timestamps and supports speaker diarization, enabling the identification of different speakers within an audio file. For increased timestamp accuracy, users can employ larger alignment models, though this comes at a higher GPU memory cost. The system can automatically detect or be manually set to use language-specific alignment models, and it supports various output formats including srt, vtt, txt, tsv, and json. The whisperx tool can also apply word highlighting, adding <u> tags to spoken words in SRT/VTT outputs, and allows for fine-tuning of speech start and end detection thresholds (vad_onset, vad_offset).
Read More: GPU Scarcity Ends, Free AI Models Start November

Performance and Hardware Considerations
WhisperX has been reported to achieve speeds of up to 70x real-time with its large-v3 model. While the primary focus is on single-GPU operation, alternative hardware setups exist for handling larger files or more demanding tasks. For instance, a WhisperX-A40-Large variant runs on an NVIDIA A40 GPU, offering more RAM at a higher cost.
Efforts to optimize WhisperX for specific hardware environments are ongoing. The NVIDIA Triton Inference Server, in conjunction with TensorRT-LLM, is being explored for deploying multimodal models, including Whisper. This involves building optimized TensorRT-LLM engines for both the encoder and decoder components of the Whisper model, configuring Triton server parameters for efficient batching and memory management. Parameters such as MAX_ATTENTION_WINDOW_SIZE, MAX_TOKENS_IN_KV_CACHE, and KV_CACHE_FREE_GPU_MEM_FRACTION are critical in these configurations.
Read More: Perplexity AI May Use Your Laptop For Processing Soon

Another notable integration involves the vLLM library, which has showcased Whisper's capabilities, particularly its encoder-decoder structure, for multimodal tasks. This work highlights the potential for real-time inference and flexible prompt engineering with models like openai/whisper-large-v3, allowing for specific control over audio and text prompts.
Background:
The challenge of fitting large AI models, particularly those with multimodal capabilities like speech recognition and language understanding, onto consumer hardware has been a significant barrier. The 24GB VRAM found in cards like the RTX 3090 represents a popular, albeit limited, frontier for local AI deployments. Techniques such as quantization, model pruning, and efficient memory management strategies, like the KV cache optimization detailed in the primary report, are crucial for overcoming these hardware limitations. WhisperX, building on OpenAI's Whisper model, focuses on enhancing transcription accuracy with word-level timestamps and speaker identification, making it a powerful tool for audio processing tasks. The integration with LLMs broadens its utility into more complex analytical and generative workflows.
Read More: AI Systems RAG and Agents Face New Security Checks