← All articles
AMDJun 23, 2026 6 min read

Sound, not words: the case against transcribing every call

There’s a fashionable idea that the future of answering machine detection is transcription — run speech-to-text on the call, read the words, classify the transcript. It’s clever. For a predictive dialer making a split-second decision, it’s the wrong tool. Here’s why we bet on the sound.

The transcription pitch goes like this: a voicemail is defined by what it says — “you’ve reached,” “please leave a message,” “at the tone.” So transcribe the audio with a speech model, then classify the text. Read the words, know the answer. As a way to understand content, it’s genuinely good. As the engine for the real-time human-or-machine decision on a live dial, it loses — for five concrete reasons.

1. It’s too slow for a dialer

Transcription needs a chunk of speech before it has anything to read. Published transcription-based AMD issues its early decision around two seconds in. On a predictive dialer, two seconds is an eternity — it’s the difference between handing an agent a live person and dropping that person while the model waits for a sentence. The decision has to land in a few hundred milliseconds, not a few seconds.

2. It fails when a real person says nothing

A large share of real answers start with silence: the person picks up and waits for you to speak, or they’re in a noisy room, or they just haven’t said a word yet. A model that reads words has nothing to read — so it falls back to “nothing there” and your live prospect gets dropped. A human who answers and simply breathes makes a sound; they don’t always make a sentence.

3. It’s bound to language

Speech-to-text is trained mostly on clean, native English. Heavy accents, non-English lists, bilingual markets, and code-switching all degrade it — exactly the conditions a lot of outbound runs in. The sound of a person picking up a phone, and the sound of a machine playing a greeting, are the same in every language. An acoustic decision doesn’t need to understand the words to know what answered.

4. Most of what answers a call has no words at all

Think about everything your dialer actually hits: beeps, carrier tones, fax signals, dead air, carrier false-answers, spam-trap intercepts. None of those are speech. A transcript of them is blank or garbage. But acoustically they’re unmistakable — each has a signature. A detector that only knows how to read words is blind to the majority of the things it needs to catch.

5. It’s heavy for a yes/no

Running a full speech-recognition model on every answered call is a lot of compute to answer a single binary question. At dialer volume, that cost and latency add up — for a decision that doesn’t actually require knowing the words.

What AMDY does instead

AMDY classifies the sound — the acoustic signature of what answered the phone — the way your own ear knows a person picked up before they finish a word. It doesn’t transcribe, so it isn’t waiting on speech, isn’t thrown by silence, isn’t bound to a language, and isn’t blind to the beeps, tones, and dead air that make up most of what a dialer hits. The decision lands in a fraction of a second.

Where transcription does belong

To be fair: reading the words is the right tool for understanding content — summarizing a voicemail, analyzing what was said, post-call intelligence. That’s a real and useful job. It’s just a different job than deciding, in the first half-second of a live dial, whether to hand the call to an agent. One asks “what did they say?” The other asks “is anyone there?” — and for that, sound gets you the answer first.

Hear the difference on your own calls

Sub-second decisions, no telephony migration, free to try: the Sandbox plan is 50,000 detections a month, no card, 5-minute Vicidial install.