Nov 09, 2025
MUHAMMAD GHIFARY
For a while now, I’ve wanted to create more educational content, both in writing and in audio-visual forms. So far, my written/static content is holding up pretty well, but when it comes to audio-visual, I’ve been stuck. I feel that inertia creeping in: recording myself speaking, doing the audio/video editing, and all that extra overhead.
May be I can stretch my creative comfort zone by sticking to what I’m comfortable at — writing — and then letting AI take care the audio-visual part. I picture an animated digital character that looks like me, speaking in my natural voice, all triggered by a simple textual script.
To make that vision real, some generative AI tools would be needed:
These days, there are already many no-code AI tools available out there that can do these things. For examples, ElevenLabs (for TTS and voice cloning), MidJourney, Nano Banana, Veo (for visual image or video generation), HeyGen, Dreamface (for face lip sync / animation). Many content creators have shown the way to do so.
Since I’m also curious about how the machinery works behind the scenes, not just using the tools, I want to explore the with-code approach (though I may try the full flow with no-code too). In this article, I’m going to start by delving into more details on the voice side (TTS and voice cloning). Let’s dive in.
Have you ever had your phone read an article to you or heard audiobooks? That’s Text-to-Speech (TTS) in action, the tech that turns written text into spoken words. TTS has come a long way — from flat, robotic voices to ones that can whisper, pause, emphasize, or sound emotional. It’s everywhere now: voice assistants, screen readers, podcasts, navigation, etc.
TTS also has a long history, dating back to the early 20th century in the analog era. Here’s a brief overview of its evolution:
Some of the earliest work was in analog / mechanical / signal-processing devices. For instance, Bell Labs’ Voder (Voice Operating Demonstrator) in the 1930s was a machine that could produce recognizable speech by manually controlling parameters. Later, devices like the vocoder were developed for speech coding and synthesis.
In the 1950s - 60s, formant synthesis was introduced: modeling the human vocal tract and resonant frequencies (formants), then generating speech via parametric methods. While not very natural or expressive, it was very flexible and didn’t need huge memory or huge databases of recorded speech.
As computers became more powerful, TTS system began to include linguistic rules: text analysis (tokenization, phonemes), prosody rules (intonation, duration, stress). An early example is Klatt’s speech synthesizer (KlattTalk), developed in the 1980s. DECtalk (1983-84) is another landmark: a commercial system that used rule-based (source-filter) methods, and could be customized in terms of speech rate, pitch, etc.
Over time, methods that concatenated actual recorded speech segments (diphones, units, larger segments) became popular. These used large speech databases. The idea is that recorded speech pieces have the natural prosody and timbre, so stitching them can give more natural speech. Early systems like those in the 1990s, and commercial systems (e.g., AT&T Natural Voices) used these ideas.