How to Make Vapi Sound More Human (14 Settings That Matter)
The difference between a Vapi agent that sounds like a robot and one that sounds like a real person is not one big setting. It is about fifteen small ones stacked together. Most people get the voice right, then leave the prompt, pacing, and conversation controls on default and wonder why callers immediately know they are talking to AI. This guide walks through every setting that actually moves the needle on perceived humanness, ranked by impact.
1. Pick a Voice That Was Trained on Real Conversations
Not all TTS voices are equal. Voices trained on audiobook narration sound professional but stilted on phone calls. Voices trained on podcast and conversation data sound natural. For ElevenLabs, voices like Josh, Rachel, and some of the newer conversational voices are designed for chat. For Cartesia, Sonic voices are specifically tuned for real-time conversation. Avoid anything labeled "narration" or "documentary."
Test 5 voices on a 90-second scripted call and listen to the recordings back-to-back. The gap is obvious. Pick the one that sounds like someone you would hire.
2. Add Filler Words to the System Prompt
Humans say "um," "uh," "let me see," and "one moment" constantly. AI agents by default do not, which is the most immediate tell. Add explicit instructions to the system prompt: "Use natural filler words like ‘let me check’, ‘one moment’, and ‘hmm’ when processing information. Occasionally say ‘sure’ or ‘of course’ when acknowledging requests."
The model will sprinkle these naturally into responses. Overdoing it sounds worse than none, so do not request filler in every sentence.
3. Enable Backchanneling
Backchannels are small acknowledgements like "mm-hmm," "yeah," and "got it" that humans make while the other person is still talking. They signal active listening. Vapi supports backchanneling natively in the assistant settings. Turn it on. The caller will hear "mm-hmm" and "right" while they are giving information, which makes the conversation feel two-way instead of one-sided.
4. Tune the Speed of the TTS
Default TTS speed is usually slightly slow for phone conversation. Human phone conversation happens at roughly 150 to 160 words per minute. Most TTS defaults land at 130 to 140, which sounds deliberate and robotic. Bump the speed up by 10 to 15 percent in the voice settings. The speech becomes noticeably more natural without sacrificing clarity.
Settings That Most Move Caller Perception of Humanness
5. Keep Responses Short
Humans on phone calls say one or two sentences at a time. AI agents default to paragraph-long answers because LLMs trained on text want to be thorough. Instruct the system prompt: "Keep all responses to one or two sentences. Ask follow-up questions instead of giving long explanations. Never give a list of more than 3 items."
Short responses sound conversational. Long responses sound scripted.
6. Interruption Handling
Vapi lets you configure how the agent handles interruptions. By default, the agent stops speaking when the caller interrupts. This is correct behavior, but also enable the setting that lets the agent continue the response after briefly acknowledging the interruption, rather than starting a new turn from scratch. Humans do this naturally, AI agents often do not.
7. Write the First Message Like a Human Would Say It
The first message sets the tone for the whole call. "Hello, thank you for calling Acme Dental. How may I assist you today?" is formal and robotic. "Hey, thanks for calling Acme, this is Sarah. What can I help you with?" sounds like a real receptionist. Use contractions. Use a first name. Drop "may I assist you."
8. Give the Agent a Personality
"You are a helpful assistant" produces a bland agent. "You are Sarah, a friendly and slightly casual front desk receptionist at a busy dental practice. You have been working there for three years and know the regulars by name" produces an agent with personality.
The personality should match the business. A dental receptionist is warm and casual. A law firm intake agent is professional and calm. A collections call agent is firm but respectful. Match the personality to the context.
9. Handle Silence and Ambient Noise
When the caller pauses to think, the agent should not panic and repeat itself. Vapi has a silence timeout setting that controls how long the agent waits before prompting the caller. Set it to 8 to 10 seconds, not 3. Real humans give each other time to think.
If there is background noise like a car horn or barking dog, the agent should not mistake it for speech and respond. Enable background noise suppression in the assistant settings.
10. Use Natural Transitions
When moving between topics, humans say things like "so, about that...", "one more question...", "okay, next..." AI agents often jump abruptly. Teach the model to use transitions in the system prompt, especially when moving between intake questions.
11. Handle Names and Numbers Correctly
TTS engines butcher unusual names and long number sequences. If the agent has to say a phone number like 555-1234, the default reading is "five hundred fifty-five one thousand two hundred thirty-four." Use SSML markup or pronunciation hints to spell out numbers digit by digit. For names, include a pronunciation guide in the system prompt for common difficult names like "Siobhan" or "Dhruv."
12. Slow Down for Important Information
When reading back critical information like a confirmation number, address, or appointment time, the agent should slow down and enunciate. Add SSML break tags or instruct the model to say things like "let me repeat that slowly: ..." for confirmations.
13. Emotional Context
If the caller is frustrated, the agent should sound empathetic, not cheerful. Some advanced TTS engines like ElevenLabs v3 support emotional tags. Even without that, the system prompt can instruct: "Match the caller's emotional tone. If they sound frustrated, acknowledge it. If they sound excited, match their energy."
14. Test With Real People, Not With Yourself
You know it is an AI, so you are biased. Have a friend or family member who does not know the agent is AI place a 3-minute call. Their reaction tells you everything. If they realize within 30 seconds, you have work to do. If they are confused only after the agent repeats itself or hits a weird edge case, you are close.
Caller Detection Rate by Tuning Level
The 80/20 Rule
If you only do three things, do these: pick a conversational voice, instruct the model to use filler words and short responses, and enable backchanneling. Those three changes alone take most agents from obviously-robotic to plausibly-human. The other eleven settings take you the rest of the way, but the first three give you 80 percent of the benefit for 20 percent of the effort.
Join 215+ AI Agency Owners
Get free access to our all-in-one outreach platform, AI content templates, and a community of builders landing clients in days.