Inside Foundry: How the AI Systems Work Together

Technology Deep Dive

Foundry is not a single AI model with a polished interface. It is a local audio production system built from several specialized components that work together inside one workflow.

Some of the foundations are open. What matters in practice is everything built around them: custom inference, aggressive quantization, model orchestration, patch workflows, editing, mixing, DSP, and a production pipeline designed for real use on consumer NVIDIA GPUs.

That is why Foundry should not be understood as "just one model in a UI." The product is the system.

Built on strong foundations, then pushed further

Foundry uses proven models where they make sense, but it does not run them in stock form. Music generation is based on ACE-Step 1.5, speech is based on Qwen3-TTS, Creative AI uses Qwen 3, 3.5, and 3.6, and stem separation uses Demucs v4. Those names describe the foundations, not the full product.

What users experience inside Foundry comes from the way those systems are modified, quantized, steered, and integrated. That is where most of the engineering work sits, and that is why Foundry behaves very differently from running the same open models on their own.

Music generation is only one layer

Music generation is a core part of Foundry, but it is not the whole story. The current stack uses a custom quantized 5 Hz 4B audio language model path that cuts VRAM requirements sharply while keeping top quality fast enough for actual iteration.

On supported NVIDIA GPUs, generation can run at roughly 10 to 20 times real time. That changes the experience completely. You are not waiting on a cloud queue, not spending credits per attempt, and not sending project data off your machine.

Just as important, generation is not treated as a one-shot result. Inside Foundry you can patch weak sections, blend alternate takes with spectral crossfades, separate stems, and keep refining a track instead of throwing away the whole render because one part missed.

Creative AI handles the musical thinking

Most raw generation models are only as good as the prompt they receive. Foundry's Creative AI sits above the generator and helps turn rough ideas into something musically usable.

It can expand a loose concept into a structured production brief, help write lyrics, shape sections, improve pacing, and refine prompts so the generation model gets clearer direction. For users who want stronger reasoning and writing quality, Foundry includes larger Qwen-based models, including a dense 27B class option.

This layer is also where Foundry does a better job with steering. Instead of dumping negative keywords into a caption and hoping the model interprets them sensibly, Foundry can reshape the request before generation starts. That produces more controlled results and avoids a lot of the usual prompt friction.

Speech is a first-class part of the system

Speech in Foundry is not an extra feature added on the side. It is a dedicated system built for spoken performance, voice identity, and consistency.

Foundry can generate speech in 10 languages, support cloned or generated voices, handle multi-speaker scenes, and direct delivery with 40 emotions across 5 intensity levels. Lower intensities preserve identity more strictly. Higher intensities can push expression further, even if that means relaxing the speaker match a little. That tradeoff is intentional and often useful in production.

Because speech, music, editing, and mixing all live in the same environment, you can build narration, dialog, trailers, podcasts, or character scenes without exporting back and forth across multiple tools.

Stem separation, editing, and DSP are where projects get finished

A generated result is rarely a finished result. Foundry includes an integrated mixer, arrangement tools, patch workflows, spectral crossfades, and a large DSP toolset so you can take an idea all the way to delivery inside one application.

You can split audio into stems, repair only the part that needs work, process a voice with temporary acoustic effects like phone calls or room tone, or reshape it completely into something stylized like a robot or demon. That matters because real production work is not just generation. It is selection, correction, layering, and finishing.

Stem separation is built directly into the workflow, which also makes Foundry useful for remixing, post-production, game audio, and video pipelines where exports need to stay flexible.

Long-form narration and agentic workflows

Foundry also includes agentic narration workflows for longer spoken content. It can identify speakers, segment text, and generate narration from imported material such as ebooks. This part of the platform is still developing, but it already shows the broader direction: Foundry is meant to handle full local audio workflows, not just short isolated generations.

How Foundry runs locally on smaller GPUs

Running several AI systems on one GPU is mostly a memory problem. Foundry solves that with aggressive quantization and an Ultra-VRAM Saver mode that swaps models in and out as needed.

  • 6 GB: supported with Ultra-VRAM Saver enabled. Swapping is aggressive, but the full workflow can run locally.
  • 8 to 10 GB: a much more comfortable starting point for regular music and speech work.
  • 12 GB and above: smoother everyday use with less swapping and faster iteration.
  • 16 GB and above: best for heavier projects, larger creative models, and more demanding multi-stage work.
  • 24 GB and above: the largest Creative AI models can run or medium sized models can all stay permanently loaded for highest performance.
  • 32 GB and above: another step up in what models can stay loaded, mostly useful for high performance automated mass-production or large batch parallel processing.

More VRAM always helps, of course. But the important point is that Foundry no longer needs high-end memory budgets just to be usable. Thanks to the custom quantization and inference work, it can deliver top-quality local generation on much smaller GPUs.

The key idea

Foundry is not one model pretending to be a studio. It is a local AI audio studio where music generation, Creative AI, speech, stem separation, editing, DSP, and narration work together as one system.

The individual models matter. The workflow built around them matters more. That is what makes Foundry useful.

More from Echoes

You Run LLMs Locally. You Generate Images Locally. Why Is Your Audio Still in the Cloud?

You went local for text and images. But every time you need a voiceover, a soundtrack, or a sound effect, you are back in a browser uploading files to someone else's GPU. Here is why local AI audio deserves a spot in your stack.

The Best ElevenLabs Alternatives in 2026 (Especially If You're Tired of the Bill)

Looking for ElevenLabs alternatives in 2026? We compare the top AI voice generators by price, privacy, and features, including one that runs entirely on your own computer.

How to Pick a TTS Tool for Production Use (Not Just Demos)

Every TTS tool sounds good on a demo. This is the version for people who actually need to ship something — covering consistency, per-character pricing at scale, API reliability, and when cloud vs. local is the right answer.

Best AI Voice Cloning Tools in 2026: The Complete Guide (Cloud vs. Local)

ElevenLabs, Resemble AI, Descript, Fish Audio, Play.ht — and one that keeps your voice on your own machine. An honest comparison of every major AI voice cloning tool in 2026, with real pricing, what happens to your voice data, and who each tool actually serves.

Best AI Music Generators in 2026: Cloud vs. Local Compared

Suno, Udio, AIVA, Boomy — and one that runs entirely on your machine. A complete comparison of every major AI music generator in 2026, with real pricing, limitations, and who each tool is actually for.

What "Digitally Signed" and "Windows Defender Verified" Actually Mean

A plain-language explanation of digital signatures, code signing certificates, and Windows SmartScreen reputation - and why new software shows a warning even when it is perfectly safe.

Foundry Is Now a Music and Speech Studio

Demodokos Foundry generates music and speech on your local machine. Voice cloning, 40 emotions, multi-speaker narration, audiobooks, podcasts, and full music production in one app.

Voice Cloning and the Emotion Engine

How voice cloning and emotional direction work in Foundry. 40 emotions, 5 intensity levels, 60 speaker presets, and cloned voices that stay in character.

The Local Production Workflow: Music and Voice in One Place

Generate music and speech on your GPU. Layer them on a timeline. Apply 32 DSP effects. Export finished audio. Here is the full local production workflow.

Creative AI and the 120-Command Automation Engine

The Creative AI writes captions and lyrics from a single idea. The automation engine offers 120+ commands for batch workflows, CLI scripting, and agentic control.