How to Reduce Vapi Latency Under 800ms (Complete Tuning Guide)

Callers notice any response delay above roughly 1.5 seconds. Above 2.5 seconds they start interrupting. Above 4 seconds they hang up. If your Vapi agent feels slow, it is almost never one thing. It is usually two or three settings stacked together adding up to 1.5+ seconds of latency that could have been under 800ms with the right configuration.

This guide walks through the nine settings that actually move the needle on Vapi response time, roughly ordered by impact. The latency budget breaks down into four stages: silence detection (waiting to decide the caller finished talking), transcription (speech-to-text), LLM response generation, and text-to-speech synthesis. Each stage has knobs.

Understanding the Vapi Latency Budget

A well-tuned Vapi agent should hit roughly 300ms of silence detection, 150ms of STT finalization, 250ms of LLM first-token latency, and 100ms of TTS first-audio. That sums to around 800ms from caller silence to the start of the agent's response. Anything above 1200ms feels slow. Anything above 2000ms feels broken.

1. Switch to a Fast Model

The single biggest latency win is picking a fast LLM. GPT-4 Turbo has a first-token latency of roughly 400 to 800ms. GPT-4o Realtime is dramatically faster at 150 to 300ms. Claude 3.5 Haiku hits 200 to 400ms. Groq-hosted Llama 3 models can hit sub-100ms first-token latency when the context is small. The tradeoff is quality, but for most voice use cases like appointment booking, hours-and-location Q&A, or intake flows, a fast model is indistinguishable from a slow one.

Change the provider and model in your Vapi assistant's model section. If you were on GPT-4 Turbo, switch to GPT-4o or Claude Haiku first and A/B test. Do not jump straight to Groq unless you have measured your current baseline.

2. Shorten the System Prompt

Every token in your system prompt adds to the LLM processing time. A 3000-word system prompt can add 200 to 400ms of latency vs a tight 500-word one. Audit your prompt ruthlessly. Remove examples the model does not need, strip redundant instructions, and move static knowledge to a tool call or a retrieval-augmented generation step instead of baking it into the prompt.

A good target is under 1000 tokens for the system prompt. Everything beyond that should live in a separate knowledge base that gets queried only when relevant.

3. Enable Streaming on the Model Response

Vapi supports streaming responses from the LLM, which means TTS starts speaking the first sentence while the model is still generating the rest. If streaming is disabled, Vapi waits for the full response before speaking, adding 500 to 1500ms of perceived latency. This should be on by default but double-check in the advanced settings.

Typical Latency Reduction by Setting Change

Switching GPT-4 Turbo to GPT-4o55% faster

Enabling response streaming40% faster

Shortening system prompt by 2000 tokens30% faster

Switching ElevenLabs to PlayHT Turbo25% faster

Tuning endpointing timeout20% faster

4. Tune the Endpointing Threshold

Endpointing is how Vapi decides the caller stopped talking. The default waits roughly 500ms of silence. If your callers speak in short phrases with natural pauses, this triggers mid-sentence and the agent interrupts. If you bump it up to 1000ms to stop the interruptions, you pay for that directly in response latency.

The right answer for most production agents is 300 to 500ms with smart endpointing enabled. Smart endpointing uses a small model to predict whether the caller is actually finished or just pausing, which lets you use a shorter timeout without the interruption problem.

5. Pick a Faster TTS Provider

ElevenLabs Turbo is excellent quality but has 300 to 500ms of first-audio latency. PlayHT Turbo lands in the 200 to 300ms range. Deepgram Aura can hit under 200ms. Cartesia Sonic is the fastest of the major providers at 90 to 150ms. The quality gap between ElevenLabs and Cartesia is narrow enough for most agents that the latency win is worth it.

6. Pre-warm the Pipeline

Cold starts are real. The first call after a period of inactivity is often 500 to 1000ms slower than the second call because the LLM connection, STT pipeline, and TTS are all being established. If your calls are bursty, add a cron-job webhook that fires a synthetic warm-up call every few minutes during business hours. Cost is negligible, latency benefit on the first real call of each hour is huge.

7. Eliminate Slow Tool Calls

Every tool call blocks the response until the webhook returns. If you have a HubSpot lookup that takes 3 seconds to return the customer record, the caller is hearing silence for 3 seconds. Audit every tool in the assistant configuration. Any tool that averages over 1 second needs to be either cached, moved off the critical path, or optimized.

For CRM lookups, cache the record by phone number at the start of the call instead of fetching it live every time. For knowledge base queries, pre-compute embeddings and use a local vector DB instead of calling OpenAI at runtime.

8. Drop Unnecessary Transcription Providers

Vapi defaults to Deepgram for transcription, which is fast. If you switched to Whisper for accuracy reasons, you are paying 500 to 1000ms extra. For English with clear audio, Deepgram Nova is almost as accurate at a fraction of the latency. Reserve Whisper for multi-language or very noisy environments.

9. Check Your Network Path

If your n8n or custom backend is hosted far from Vapi's region, every tool call round-trip eats 100 to 300ms just in network latency. Vapi routes through US-East primarily. If your n8n is on a European server, factor in a 150 to 200ms round trip per tool call. Host your backend in the same region as your Vapi deployment or use an edge deployment platform.

Typical Latency Breakdown by Stage (Well-Tuned vs Default)

Endpointing (silence detection)~45% of total budget

Speech-to-text finalization~25% of total budget

LLM first-token latency~55% of total budget

Text-to-speech first-audio~30% of total budget

How to Measure Vapi Latency Properly

Vapi's own call logs include per-stage latency metrics under the call detail view. Look at messages with role assistant and check the timing breakdown, which shows how long each stage took. Do this for ten real calls across different times of day. Outliers are not the problem. The median is. Optimize for the median call experience, not the worst case.

After making any changes, place twenty test calls and re-measure. Do not trust your ears because your perception of latency is heavily biased by whether you are expecting a delay. Pull the numbers from the Vapi dashboard and compare.