How to Read Any Web Page Aloud with OpenClaw (AI Agent + TTS)

The CastReader skill for OpenClaw converts any URL into an MP3 audio file. Send a link to your OpenClaw agent via Telegram or Discord, and it extracts the article text, runs it through the Kokoro TTS model, and sends back a playable audio file. It's the only ClawHub skill that handles web page content extraction — kokoro-tts, mac-tts, and openai-tts only convert plain text strings. Install it with clawhub install castreader.

The Problem with Existing TTS Skills

OpenClaw ships with three text-to-speech skills out of the box: kokoro-tts, openai-tts, and mac-tts. They work fine. Give them a sentence, they produce audio. Give them a paragraph, same deal. Solid tools for what they do.

But try this: send your OpenClaw agent a message like "Read this article for me: https://paulgraham.com/greatwork.html" and see what happens.

Nothing useful. The TTS skills receive the URL as a string. They don't fetch the page. They don't parse HTML. They certainly don't figure out which parts of the DOM are the actual article versus the navigation bar, the sidebar, the cookie consent banner, and the "Subscribe to my newsletter" popup. They just see a URL string and either try to pronounce it ("aitch tee tee pee ess colon slash slash paul graham dot com slash great work dot aitch tee em ell") or error out.

Fair enough. That's not what they were built for.

But the gap is real. Half the things I want to listen to live on web pages. Long blog posts. Technical documentation. Research papers on arXiv. News articles. Kindle books in the cloud reader. Notion docs my team shared. The content is right there, behind a URL, and no existing skill can touch it.

What Makes Web Page Extraction Hard

You might think: just fetch the HTML, strip the tags, feed the text to a TTS engine. Twenty lines of code. Done.

That works on maybe 40% of the web. Static HTML pages with clean markup. The other 60% will bite you.

Kindle Cloud Reader doesn't put readable text in the DOM at all. It uses custom font subsets — each book gets a different mapping where the glyph for "A" might render as "7" in the actual font file. Copy the text out of the page and you get scrambled garbage. The only way to read it is to decode the font mapping itself.

WeRead (Tencent's reading platform, 300 million users) renders everything on canvas. The DOM contains zero text nodes. The words you see on screen are painted pixels.

Notion wraps every word in its own <span> with custom styling. Google Docs renders into a deeply nested iframe structure. ChatGPT and Claude mix assistant messages with UI elements, model selectors, copy buttons, and metadata — all in the same DOM tree.

A generic "strip HTML tags" approach produces garbage on all of these. You need platform-specific extraction logic that understands how each site structures its content.

What the CastReader Skill Actually Does

CastReader brings 15+ dedicated extractors to OpenClaw. Each one is purpose-built for a specific platform:

Kindle Cloud Reader: Intercepts the KindleModuleManager, decodes font subset mappings, uses OCR calibration to reconstruct the actual text from glyph data
WeRead: Hooks into the page's fetch calls at the network level to capture chapter data before it hits the canvas renderer
Notion, Google Docs, Feishu, Yuque: Platform-specific DOM traversal that navigates each app's unique component structure
LLM platforms (ChatGPT, Claude, Gemini, Doubao, DeepSeek, Kimi): Isolate assistant messages from UI chrome, handle markdown rendering, skip code block copy buttons
Everything else: A general-purpose visible-text-block extractor that scores DOM containers by text density, link ratios, and semantic signals — drawing from the same research behind Readability.js and Boilerpipe

The extracted text goes through Kokoro TTS, which supports 40+ languages. The output is an MP3 file sent back through your OpenClaw agent. No API key required. No per-character billing.

Three commands are available:

extract — Returns the extracted text as structured JSON (paragraphs, metadata)
generate-audio — Extracts and converts to MP3
read-aloud — Opens the page in a browser with real-time paragraph highlighting (requires the CastReader Chrome extension)

Installation

clawhub install castreader

That's the whole setup. No API key. No YAML configuration. No environment variables. The skill registers its commands and starts accepting URLs immediately.

If you haven't set up OpenClaw yet, you'll need that first — an agent runtime connected to Telegram, Discord, or whatever messaging platform you use. The OpenClaw docs cover the setup, and our complete setup guide walks through the entire process step by step. CastReader is just a skill that plugs into it.

Usage: The Three-Second Version

Send your agent a message on Telegram:

"Read this article for me: https://waitbutwhy.com/2015/01/artificial-intelligence-revolution-1.html"

The agent invokes the CastReader skill. It fetches the page, extracts 8,000 words of content (skipping the header, sidebar, related posts section, and comment thread), chunks it through the TTS engine, and sends back an MP3. Total time depends on article length — a typical 2,000-word post takes about 15 seconds.

More examples:

"Extract the text from this Notion page: https://notion.so/..." — Returns clean JSON, useful for piping into other skills
"Generate an MP3 of this Wikipedia article: https://en.wikipedia.org/wiki/Diffie-Hellman_key_exchange" — Gets you a 12-minute audio file of the full article
"Read my Kindle book aloud: https://read.amazon.com" — Decodes the font-encrypted text and generates audio from the current chapter

How It Compares

Approach	Extracts web pages?	Natural voices?	Highlights text?	Free?
CastReader skill	Yes (15+ platforms)	Yes (Kokoro)	Yes (browser mode)	Yes
kokoro-tts	No (plain text only)	Yes (Kokoro)	No	Yes
openai-tts	No (plain text only)	Yes (OpenAI)	No	Needs API key
mac-tts	No (plain text only)	Yes (macOS)	No	Yes
Edge TTS (built-in)	No (agent reply only)	Yes (Edge)	No	Yes

The distinction matters. If someone sends you a text snippet and you want audio, any TTS skill works. If someone sends you a link and you want audio, CastReader is currently the only ClawHub skill that handles the full pipeline: fetch, extract, convert, deliver.

Browser Mode: Paragraph Highlighting

The read-aloud command does something the MP3 pipeline can't — it opens the URL in a real browser with the CastReader Chrome extension loaded, triggers reading via CDP (Chrome DevTools Protocol), and streams audio while highlighting each paragraph in real time.

Each paragraph gets a visible highlight as the voice reaches it. The page auto-scrolls to keep the current paragraph in view. You can click any paragraph to jump to it — already heard the introduction, skip to section three. Want to re-listen to a dense technical paragraph, click it again.

This mode requires the Chrome extension installed locally. The OpenClaw agent coordinates the browser session, but the actual audio playback and highlighting run in the extension's content script — zero latency between the audio timeline and the visual highlight, because both live in the same browser context.

For Kindle Cloud Reader specifically, this is the only way to get the text read aloud at all. The font decoding, OCR calibration, and glyph-to-text mapping all happen inside the extension. No server-side extraction can replicate it, because the font subsets are unique per session and per book.

When You'd Use This

The pattern I keep coming back to: someone drops a link in a group chat. A 3,000-word blog post. A technical RFC. A research paper. I don't have 15 minutes to read it right now, but I do have 15 minutes of walking to the coffee shop.

Forward the link to my OpenClaw agent. Get an MP3 back. Listen while walking.

No app switching. No copy-pasting into a separate TTS tool. No manual cleanup of extracted text. One message in, one audio file out.

clawhub install castreader

That's it. Send your agent a URL and listen. Visit the CastReader OpenClaw page for more details on supported platforms and commands.