Sound, not words: the case against transcribing every call
There’s a fashionable idea that the future of answering machine detection is transcription — run speech-to-text on the call, read the words, classify the transcript. It’s clever. For a predictive dialer making a split-second decision, it’s the wrong tool. Here’s why we bet on the sound.
The transcription pitch goes like this: a voicemail is defined by what it says — “you’ve reached,” “please leave a message,” “at the tone.” So transcribe the audio with a speech model, then classify the text. Read the words, know the answer. As a way to understand content, it’s genuinely good. As the engine for the real-time human-or-machine decision on a live dial, it loses — for five concrete reasons.
1. It’s too slow for a dialer
Transcription needs a chunk of speech before it has anything to read. Published transcription-based AMD issues its early decision around two seconds in. On a predictive dialer, two seconds is an eternity — it’s the difference between handing an agent a live person and dropping that person while the model waits for a sentence. The decision has to land in a few hundred milliseconds, not a few seconds.
2. It fails when a real person says nothing
A large share of real answers start with silence: the person picks up and waits for you to speak, or they’re in a noisy room, or they just haven’t said a word yet. A model that reads words has nothing to read — so it falls back to “nothing there” and your live prospect gets dropped. A human who answers and simply breathes makes a sound; they don’t always make a sentence.
3. It’s bound to language
Speech-to-text is trained mostly on clean, native English. Heavy accents, non-English lists, bilingual markets, and code-switching all degrade it — exactly the conditions a lot of outbound runs in. The sound of a person picking up a phone, and the sound of a machine playing a greeting, are the same in every language. An acoustic decision doesn’t need to understand the words to know what answered.
4. Most of what answers a call has no words at all
Think about everything your dialer actually hits: beeps, carrier tones, fax signals, dead air, carrier false-answers, spam-trap intercepts. None of those are speech. A transcript of them is blank or garbage. But acoustically they’re unmistakable — each has a signature. A detector that only knows how to read words is blind to the majority of the things it needs to catch.
5. It’s heavy for a yes/no
Running a full speech-recognition model on every answered call is a lot of compute to answer a single binary question. At dialer volume, that cost and latency add up — for a decision that doesn’t actually require knowing the words.
What AMDY does instead
AMDY classifies the sound — the acoustic signature of what answered the phone — the way your own ear knows a person picked up before they finish a word. It doesn’t transcribe, so it isn’t waiting on speech, isn’t thrown by silence, isn’t bound to a language, and isn’t blind to the beeps, tones, and dead air that make up most of what a dialer hits. The decision lands in a fraction of a second.
Where transcription does belong
To be fair: reading the words is the right tool for understanding content — summarizing a voicemail, analyzing what was said, post-call intelligence. That’s a real and useful job. It’s just a different job than deciding, in the first half-second of a live dial, whether to hand the call to an agent. One asks “what did they say?” The other asks “is anyone there?” — and for that, sound gets you the answer first.
Hear the difference on your own calls
Sub-second decisions, no telephony migration, free to try: the Sandbox plan is 50,000 detections a month, no card, 5-minute Vicidial install.