Vapi Silence Timeout and Endpointing: Complete Configuration Guide

Two of the most complained-about behaviors in Vapi agents come from the same place: the agent cuts the caller off mid-sentence, or the agent waits awkwardly long before responding. Both are controlled by silence detection and endpointing settings. Tune them right and the conversation feels natural. Tune them wrong and every call is frustrating. This walks through every setting.

What Silence Detection Actually Does

Silence detection is how Vapi decides that the caller finished speaking. After a configurable number of milliseconds of silence on the caller's audio channel, Vapi closes the turn, sends the transcript to the LLM, and starts generating a response. Too short, and the agent interrupts natural pauses. Too long, and the response feels delayed.

The Key Settings

silenceTimeoutSeconds: how long the caller can be silent before the agent prompts them (e.g., "are you still there?"). Typical range 20 to 60 seconds. Endpointing thresholds: how quickly Vapi decides a caller's turn is finished. Controlled via the model and transcriber config. startSpeakingPlan.waitSeconds: how long to wait after silence before starting the response. Typical 0.4 to 0.8 seconds.

stopSpeakingPlan: controls how the agent handles interruptions. numWords is how many words the caller must say before the agent stops talking (default 2 to 3). voiceSeconds is how long the caller must speak before the agent stops (default around 0.3s).

Recommended Baseline Values

For most inbound receptionist agents: silenceTimeoutSeconds: 30, startSpeakingPlan.waitSeconds: 0.5, stopSpeakingPlan.numWords: 2, stopSpeakingPlan.voiceSeconds: 0.3. These feel natural for typical phone conversation and are a good starting point before tuning.

For outbound cold calling: silenceTimeoutSeconds: 15 (caller is less engaged, cut dead air faster), startSpeakingPlan.waitSeconds: 0.4, stopSpeakingPlan.numWords: 2. Shorter thresholds because outbound calls are more script-driven.

Effect of startSpeakingPlan.waitSeconds on Caller Experience

0.2s (interrupts natural pauses)28% caller OK

0.4s (slightly too eager)72% caller OK

0.5s (natural feel)94% caller OK

0.8s (slight perceived delay)78% caller OK

1.2s (awkward delay)42% caller OK

Smart Endpointing

Smart endpointing uses a small model to predict whether the caller actually finished a thought or is just pausing. With smart endpointing enabled, you can use shorter silence thresholds without the agent interrupting, because the model distinguishes between "I'm done" pauses and "I'm thinking" pauses.

Enable it in the assistant's advanced settings. It adds a small latency cost (30 to 80ms) but dramatically improves conversation feel. For most use cases the tradeoff is worth it.

Interruption Handling

stopSpeakingPlan controls what happens when the caller speaks while the agent is talking. numWords is the threshold of words spoken before the agent stops. If you set it to 1, the agent stops every time the caller says "yeah" or "uh-huh" while the agent is explaining something. This is annoying. Set it to 2 or 3 so backchannels do not trigger full interruption.

voiceSeconds is the duration threshold. If you set it low (0.1s), any cough or background noise triggers interruption. Set it around 0.3s to require actual speech.

Silence Timeout Behavior

When the caller goes silent for silenceTimeoutSeconds, the agent speaks a configured message (silenceMessage) and resets the timer. After a configurable number of total silence events (maxSilences), Vapi ends the call. Default is usually 3 silences before hanging up.

Tune silenceMessage to be natural. "Are you still there?" is fine the first time. On a second silence, "I'm not hearing anything, feel free to call back when you're ready" is better than repeating the same message. You can configure different messages per silence count.

Noise and Background Audio

If the caller is in a noisy environment (car, cafe), background noise can trigger the voice activity detector and confuse endpointing. Enable background noise suppression in the assistant settings. This uses denoising before endpointing, dramatically improving turn detection in noisy calls.

Language-Specific Tuning

Different languages have different natural pause rhythms. Japanese conversation, for example, has longer inter-turn pauses than English. If your agent handles non-English calls, bump silence thresholds by 20 to 40 percent for Japanese, Korean, and some Nordic languages. Spanish and Italian often need shorter thresholds because natural pauses are shorter.

Symptoms and Their Usual Cause

Agent cuts caller off mid-sentenceTune endpoint

Long awkward delay before agent respondsTune endpoint

Agent keeps repeating itself when caller pausesTune endpoint

Random background noise triggers interruptionTune endpoint

How to Tune in Practice

Place 10 test calls. Watch the transcripts for turn-taking issues. If agent cuts callers off, increase startSpeakingPlan.waitSeconds by 0.1s and retest. If agent feels slow, decrease by 0.1s and enable smart endpointing. Iterate. The right setting for your use case depends on your caller demographic. Older callers tend to need longer pauses; younger callers find longer pauses awkward.

Do not tune based on your own test calls exclusively. You know when you are about to finish a sentence; real callers do not telegraph it as cleanly. Record 20 real calls before you finalize the settings.