Foundry is not a single AI model with a polished interface. It is a local audio production system built from several specialized components that work together inside one workflow.
Some of the foundations are open. What matters in practice is everything built around them: custom inference, aggressive quantization, model orchestration, patch workflows, editing, mixing, DSP, and a production pipeline designed for real use on consumer NVIDIA GPUs.
That is why Foundry should not be understood as "just one model in a UI." The product is the system.
Built on strong foundations, then pushed further
Foundry uses proven models where they make sense, but it does not run them in stock form. Music generation is based on ACE-Step 1.5, speech is based on Qwen3-TTS, Creative AI uses Qwen 3, 3.5, and 3.6, and stem separation uses Demucs v4. Those names describe the foundations, not the full product.
What users experience inside Foundry comes from the way those systems are modified, quantized, steered, and integrated. That is where most of the engineering work sits, and that is why Foundry behaves very differently from running the same open models on their own.
Music generation is only one layer
Music generation is a core part of Foundry, but it is not the whole story. The current stack uses a custom quantized 5 Hz 4B audio language model path that cuts VRAM requirements sharply while keeping top quality fast enough for actual iteration.
On supported NVIDIA GPUs, generation can run at roughly 10 to 20 times real time. That changes the experience completely. You are not waiting on a cloud queue, not spending credits per attempt, and not sending project data off your machine.
Just as important, generation is not treated as a one-shot result. Inside Foundry you can patch weak sections, blend alternate takes with spectral crossfades, separate stems, and keep refining a track instead of throwing away the whole render because one part missed.
Creative AI handles the musical thinking
Most raw generation models are only as good as the prompt they receive. Foundry's Creative AI sits above the generator and helps turn rough ideas into something musically usable.
It can expand a loose concept into a structured production brief, help write lyrics, shape sections, improve pacing, and refine prompts so the generation model gets clearer direction. For users who want stronger reasoning and writing quality, Foundry includes larger Qwen-based models, including a dense 27B class option.
This layer is also where Foundry does a better job with steering. Instead of dumping negative keywords into a caption and hoping the model interprets them sensibly, Foundry can reshape the request before generation starts. That produces more controlled results and avoids a lot of the usual prompt friction.
Speech is a first-class part of the system
Speech in Foundry is not an extra feature added on the side. It is a dedicated system built for spoken performance, voice identity, and consistency.
Foundry can generate speech in 10 languages, support cloned or generated voices, handle multi-speaker scenes, and direct delivery with 40 emotions across 5 intensity levels. Lower intensities preserve identity more strictly. Higher intensities can push expression further, even if that means relaxing the speaker match a little. That tradeoff is intentional and often useful in production.
Because speech, music, editing, and mixing all live in the same environment, you can build narration, dialog, trailers, podcasts, or character scenes without exporting back and forth across multiple tools.
Stem separation, editing, and DSP are where projects get finished
A generated result is rarely a finished result. Foundry includes an integrated mixer, arrangement tools, patch workflows, spectral crossfades, and a large DSP toolset so you can take an idea all the way to delivery inside one application.
You can split audio into stems, repair only the part that needs work, process a voice with temporary acoustic effects like phone calls or room tone, or reshape it completely into something stylized like a robot or demon. That matters because real production work is not just generation. It is selection, correction, layering, and finishing.
Stem separation is built directly into the workflow, which also makes Foundry useful for remixing, post-production, game audio, and video pipelines where exports need to stay flexible.
Long-form narration and agentic workflows
Foundry also includes agentic narration workflows for longer spoken content. It can identify speakers, segment text, and generate narration from imported material such as ebooks. This part of the platform is still developing, but it already shows the broader direction: Foundry is meant to handle full local audio workflows, not just short isolated generations.
How Foundry runs locally on smaller GPUs
Running several AI systems on one GPU is mostly a memory problem. Foundry solves that with aggressive quantization and an Ultra-VRAM Saver mode that swaps models in and out as needed.
- 6 GB: supported with Ultra-VRAM Saver enabled. Swapping is aggressive, but the full workflow can run locally.
- 8 to 10 GB: a much more comfortable starting point for regular music and speech work.
- 12 GB and above: smoother everyday use with less swapping and faster iteration.
- 16 GB and above: best for heavier projects, larger creative models, and more demanding multi-stage work.
- 24 GB and above: the largest Creative AI models can run or medium sized models can all stay permanently loaded for highest performance.
- 32 GB and above: another step up in what models can stay loaded, mostly useful for high performance automated mass-production or large batch parallel processing.
More VRAM always helps, of course. But the important point is that Foundry no longer needs high-end memory budgets just to be usable. Thanks to the custom quantization and inference work, it can deliver top-quality local generation on much smaller GPUs.
The key idea
Foundry is not one model pretending to be a studio. It is a local AI audio studio where music generation, Creative AI, speech, stem separation, editing, DSP, and narration work together as one system.
The individual models matter. The workflow built around them matters more. That is what makes Foundry useful.