Voice Cloning and the Emotion Engine

Speech Deep Dive

Flat text-to-speech has been around for years. You paste text, you get audio, and it sounds like a GPS giving directions. Technically correct, emotionally dead. Foundry takes a different approach.

The emotion system

Every voice in Foundry, whether it is a built-in preset or a cloned voice, supports 40 distinct emotions. Each emotion has 5 intensity levels. A whisper at level 1 is barely there. At level 5, it is intense and urgent while still being a whisper.

You can assign different emotions to different parts of your script. The narrator starts calm and measured. By the climax, the voice carries tension. In the resolution, it softens. This is not post-processing or pitch tricks. The generation model produces the emotional quality directly in the audio.

Some examples of what this enables:

  • An audiobook narrator who genuinely sounds heartbroken during a character's loss
  • A podcast host whose excitement sounds real, not performed
  • A game villain whose cold calm shifts to rage at exactly the right moment
  • A children's story narrator who sounds warm and playful throughout

Voice cloning

The 60 built-in speaker presets cover a lot of ground. Male, female, young, old, deep, bright, gravelly, smooth. But sometimes you need a specific voice.

Voice cloning creates a new voice profile from a short audio sample. Record yourself, or use an existing recording. Foundry captures the vocal characteristics and builds a reusable voice that works with the full emotion engine.

The cloned voice stays consistent. If you generate a 30-minute audiobook chapter with a cloned voice, it sounds like the same person from start to finish. Different emotions, different energy levels, but always recognizably the same voice.

Per-line emotional direction

Scripts rarely have one mood throughout. A conversation shifts. A story builds. An advertisement has energy peaks and quiet moments.

Foundry lets you set emotion per line or per paragraph in your script. Line one: confident. Line two: hesitant. Line three: resolute. The transitions happen naturally in the generated audio. No splicing, no crossfading between separate generations.

All of this runs locally

Voice cloning samples stay on your computer. The emotion model runs on your GPU. Nothing is uploaded to any server. For anyone working with client voices, sensitive scripts, or unreleased material, this matters. Your voice data never leaves your machine.