What Does TTS Mean? Text-to-Speech Explained Without the Jargon

What Does TTS Mean? Text-to-Speech Explained Without the Jargon

Last Tuesday I was sitting in a dentist's waiting room scrolling through a 6,000-word investigation about microplastics in drinking water. Terrifying stuff. I wanted to finish it before they called my name but I was on paragraph four and the hygienist was already glancing my way. So I tapped a button on my phone, put in one earbud, and a voice started reading the article to me. Not a podcast. Not an audiobook. The actual article, word for word, read aloud by software. I kept listening through the cleaning, through the fluoride, through the part where the dentist told me I need to floss more. Finished the whole thing in the parking lot.

That software has a name. TTS.

TTS stands for text-to-speech. That's it. Three words. Software takes written text as input and produces spoken audio as output. Your eyes don't do the reading. Your ears do. The concept is old enough to collect social security — Bell Labs built a machine that could speak in 1961, a thing called the IBM 7094 that sang "Daisy Bell" in a voice that sounded like a drowning calculator. But the acronym TTS has seeped into everyday tech vocabulary only in the last decade or so, mostly because the technology finally got good enough that normal people wanted to use it.

I remember the first time I heard a computer read something out loud. I think it was... 2007? Maybe 2008. My roommate in college had a Mac and he typed something into the terminal and this flat, dead, vaguely male voice spoke the words back. We thought it was hilarious. We made it say profane things for twenty minutes and then never used it again. The voice was so obviously fake, so robotic, so devoid of anything resembling human cadence that using it for actual reading felt like a punishment. Like being read a bedtime story by a refrigerator.

That was the state of TTS for a long time. Functional but miserable.

So how does text-to-speech actually work? I'm going to simplify ruthlessly here because the full explanation involves neural networks and mel spectrograms and Fourier transforms and I don't want either of us to suffer through that. The basic pipeline has two stages. First, the system figures out how to pronounce the text — it breaks words into phonemes, which are the smallest units of sound in a language. The word "cat" becomes three phonemes. The word "through" also becomes three phonemes even though it has seven letters, because English spelling is a conspiracy against rational thought. The system handles abbreviations, numbers, dates, acronyms. It sees "Dr." and knows to say "doctor" not "dee arr." It sees "1,200" and says "twelve hundred" or "one thousand two hundred" depending on context. This part is called text analysis or sometimes the front end.

Second stage is where the actual sound gets made. Old systems used concatenative synthesis — they'd recorded a human saying thousands of tiny sound fragments and then stitched them together like a quilt made of syllables. The seams showed. You could hear the cuts, the unnatural joins, the way pitch would lurch between fragments like a car with a bad transmission. Modern systems, the ones built in the last three or four years, use neural networks trained on hundreds of hours of human speech. They don't stitch. They generate. The model has internalized what human speech sounds like — the rhythm, the breath, the way pitch drops at the end of a statement and rises at the end of a question — and it produces a waveform from scratch. The difference between old TTS and new TTS is the difference between a ransom note made from magazine cutouts and an actual handwritten letter.

You've already used TTS. I can almost guarantee it. Every time Siri reads a text message out loud while you're driving, that's TTS. Every time Alexa tells you the weather, TTS. Google Maps telling you to turn left in 200 feet? TTS. The automated voice at the pharmacy saying your prescription is ready? TTS. The announcement at the airport gate that your flight is delayed again? Probably TTS, though at that point you're too angry to appreciate the technology. Screen readers that help blind and visually impaired people navigate the web — JAWS, NVDA, VoiceOver — those are TTS engines at their core, doing essential work every single day for millions of people who literally cannot access the internet without them.

And then there's the use case that pulled me in personally. Reading articles.

I read a lot for work. Not books — articles. Blog posts, research papers, industry reports, those absurdly long Twitter threads that someone should have just made into a blog post. I was drowning in open tabs. Thirty, forty, sometimes sixty tabs, each one an article I intended to read but hadn't gotten to because reading takes time I don't have. My colleague Sarah saw my browser one day and said "that's not a browser, that's a cry for help." She wasn't wrong.

TTS fixed that. Not completely. Not magically. But meaningfully. I started piping articles through a text-to-speech engine and listening while I did other things — cooking, commuting, folding laundry, walking the dog. Tasks that occupy your hands and eyes but leave your ears wide open. Suddenly those sixty tabs started closing. Not because I was reading faster but because I was reading at all, during time that was previously wasted. I listen at about 1.4x speed, which feels brisk but natural. My friend Jake cranks his to 2x and swears he comprehends everything. I tried 2x once and retained approximately nothing. Different brains I guess.

The quality gap between 2020-era TTS and 2026-era TTS is staggering and I don't think most people realize it because they haven't checked back in. If you tried a text-to-speech tool five years ago and bounced off because the voice sounded like a GPS from 2009, I get it. I did the same thing. But the voices now are genuinely difficult to distinguish from a human reading. Not always — proper nouns still trip them up sometimes, and code snippets or heavily formatted text can produce weird results — but for normal prose? An article, a blog post, an essay? The voice breathes. It pauses after commas. It emphasizes words that should be emphasized. My mother-in-law heard me listening to an article through my laptop speakers and asked who the podcast host was. There is no podcast host. It's a neural network pretending to be one, and doing a disturbingly good job.

There's a reason this matters beyond convenience and it's worth talking about even though I'm not an accessibility expert and don't pretend to be one. Something like 15 percent of the world's population has some form of dyslexia. For many of those people, reading a long article on a screen is genuinely exhausting in a way that neurotypical readers don't experience. TTS doesn't cure dyslexia. But it removes the bottleneck. The information goes in through a different channel, one that doesn't involve decoding letters on a page, and suddenly a 5,000-word article isn't a wall anymore. It's just... a thing you listen to for fifteen minutes. I've gotten emails from users telling me this. One person wrote "I haven't finished a long article in years and I just finished three today." I sat with that for a while.

And it's not only dyslexia. People with low vision. People recovering from concussions who've been told to limit screen time. People learning English as a second language who benefit from hearing pronunciation while seeing the words. People with ADHD who find that audio plus visual highlighting keeps them focused in a way that silent reading doesn't. The use cases fan out in directions I didn't anticipate when I first started thinking about this stuff.

So where does TTS show up in practice today, outside of voice assistants and navigation? Browser extensions are a big one — tools like CastReader that sit inside your browser and read whatever page you're on. E-readers with built-in speech. Apps like Pocket and Instapaper added TTS features. PDF readers. Email clients. Learning platforms. Language learning apps use TTS extensively so you can hear how a word sounds without recording a human saying every word in the dictionary. Video games use TTS for accessibility options. Discord has TTS built in. Slack does too, though I've never met anyone who actually uses it in Slack.

The business side of TTS has exploded. ElevenLabs, a startup that barely existed three years ago, is now valued in the billions. OpenAI's voice engine made headlines. Google's Cloud Text-to-Speech API serves millions of requests per day. Amazon Polly. Microsoft Azure Speech. The market is crowded and getting more crowded because the barrier to entry dropped — open source models like Coqui, Piper, and Kokoro mean you don't need a hundred million dollars and a data center to build a TTS product anymore. The Chrome extension space alone has gotten crowded. You need a decent GPU and patience. The voices these open source models produce would have been state of the art at any major tech company three years ago. Not anymore.

I should be honest about what TTS still can't do well. Emotion. If you feed it a passage from a novel where a character is furious, the voice will read it calmly and pleasantly because it doesn't understand narrative emotion — it understands prosody patterns. Sarcasm is invisible to it. Humor mostly lands flat. Poetry sounds technically correct but spiritually empty, like a pianist hitting every note perfectly while feeling nothing. Tables and charts are a disaster — the voice will read cell contents in an order that makes no spatial sense. Mathematical equations come out as gibberish. Code blocks, same. These are not articles, these are structured data, and TTS engines are built for prose.

But prose? Prose it nails now. And prose is what most people want read to them.

I've been using TTS daily for about two years now and it's changed the shape of my days in a way that's hard to overstate without sounding dramatic. I consume roughly three times more written content than I did before, and I do it during time that used to be dead — the commute, the gym, the grocery store. My partner thinks it's antisocial that I walk around the house with one earbud in listening to articles about semiconductor supply chains. She's probably right. But I know things now. Weird, specific, interesting things that I would never have gotten around to reading with my eyes. Last week I listened to a 12,000-word piece about the history of the shipping container while assembling IKEA furniture. Finished both. Not great at either, honestly, but I finished.

TTS means text-to-speech. Three words, one acronym, a technology that's been around for sixty years and only became worth using in the last five. If you tried it once and hated it, try it again. The voices are different now. The experience is different. And there's something oddly freeing about letting your ears do the reading while your eyes do something else entirely. Your dentist's waiting room will never feel the same.