Voice AI with Gemini Live: Prototype Set Up in 3 Hours

Last week I had another one of those typical screening calls: friendly, structured — but somehow interchangeable. More digital form than actual conversation.

That got me thinking: this can be done better. So I sat down and spent an afternoon (realistically about 3 hours) testing how hard it actually is today to do something significantly better.

The result: a simple voice prototype called Uschi — an AI receptionist for a fictional general practice in Cologne-Ehrenfeld.

Telephone handset on a reception counter beside a printed appointment card — Analog reception, synthetic voice.

The Prototype: Uschi, AI Receptionist

Her name is Uschi. Receptionist at a general practice in Cologne-Ehrenfeld. Mid-50s, been doing the job for 25 years. Asks how you're doing before getting to the appointment. Interrupts you when she has something to say. Switches to Turkish if you don't speak German. And looks up a free slot in the system on the side.

Uschi is an AI. Her personality is a system prompt with a few paragraphs. The rest comes from the model.

View on LinkedIn

No scripted lines. A real conversation, in real time. If you hesitate, it waits. Interrupt it, and it stops talking. It also switches languages if you suddenly start speaking Spanish.

The whole thing runs in the browser. Microphone on, talk. No "please press 1." No "your call is important to us."

1. What Does It Cost?

Google and OpenAI both offer Realtime Voice APIs. The price differences are substantial depending on the model.

Cost per 3-Minute Conversation (Estimate)

Model	Audio Input / 1M Tokens	Audio Output / 1M Tokens	3 Min. Call	Note
Gemini 3.1 Flash Live	$3.00	$12.00	a few cents	Cheapest model with native audio processing
OpenAI gpt-realtime-mini	$10.00	$20.00	under 10 cents	Budget option
OpenAI gpt-realtime-1.5	$32.00	$64.00	30–50 cents	Flagship, best quality, significantly more expensive

Sources: Google Gemini API Pricing, OpenAI API Pricing

Exact costs per conversation depend on many factors: conversation length, speaking/listening ratio, whether context caching is used, and how much conversation history is reprocessed per turn. The values above are rough orientation, not exact calculations.

Rough Projection for a Medical Practice

Assuming 50 calls per day, 22 working days, averaging 3 minutes per call:

	Gemini 3.1 Flash Live	OpenAI Realtime Mini	OpenAI Realtime 1.5
Per month (1,100 calls)	low double digits €	low double digits €	triple digits €

For comparison: a receptionist spending 30–40% of their time on the phone costs around €3,500–4,500 per month including employer contributions. So API costs alone aren't the bottleneck. What makes a production system expensive is development, telephony integration (SIP/PSTN), operations, and compliance.

2. How Does It Work Technically?

Three components:

Browser → Server → Gemini, all via WebSockets.

The browser streams audio via a Python server to Google's Live API. Gemini responds with audio chunks that are played back directly. The entire server: under 300 lines of code.

Why it sounds more natural than older voice assistants: Gemini 3.1 Flash Live works natively with audio. The model can process and generate audio directly, without routing through a separate speech-to-text and text-to-speech pipeline. The response sounds less read aloud and more spoken. A transcription function is optionally available if you need the text in parallel.

3. Tool Calling: When the AI Reaches Into the System

Without access to appointments, Uschi would just be a chatbot.

Gemini supports function calling within the live session. You define a function with parameters like date and time of day. When the patient says "Do you have anything free next Tuesday afternoon?", Gemini recognizes the intent and calls the function.

Important: the function call is synchronous. The conversation pauses briefly while the function executes. In practice, you barely notice because the query returns quickly. But it's not a true background action — more like briefly flipping through a calendar.

In the prototype, the function reads a text file. In reality, that would be an API to Doctolib, a practice management system, or a calendar. Defining the interface is simple. The real work is in the backend: authentication, error handling, permissions, logging. That's the part you don't build in an afternoon.

4. Limitations of the Prototype

Uschi is a prototype. Built in an afternoon. Naturally, there are limits.

The voice sounds good but not perfect. In long sentences, you notice the synthetic character. Latency fluctuates between "instant" and "brief pause" depending on server load.

Speech recognition works reliably even with dialects and accents. Language switching works better than expected (Gemini supports 97 languages). And the personality comes through: the dry humor, the Cologne charm. Uschi doesn't feel like a bot.

What the prototype doesn't show: the entire telephony layer is missing. No SIP, no real phone number, no routing, no call forwarding. This is a browser demo with a microphone, not a finished phone assistant. From this demo to a system that reliably takes calls in a real practice is a long road.

5. Data Privacy: What You Need to Know

What happens to the voice data?

With Gemini 3.1 Flash Live, audio data is streamed in real time and, according to Google's API terms, not used for model training when you use the paid API. Data runs through Google servers. Details on regions and transient storage are in the terms of service; EU locations are available through Vertex AI.

For production use in Germany, you need a data processing agreement (DPA) with Google, a data protection impact assessment, and an announcement that the caller is speaking with an AI.

Important note for the medical context: Google's API terms of service explicitly exclude use "in clinical practice" and for "medical advice." A voice bot deployed in a medical practice must not provide medical assessments. For pure appointment scheduling and standard questions, that's fine — but the boundary must be clearly defined and technically enforced.

For a demo like this prototype, none of this is critical. For real practice operations with patient data, the architecture needs to be clean: stream audio only, don't store it. Pseudonymize transcripts. And a clear announcement at the start of the call.

6. Handoff to Humans

In a production system, the most important feature wouldn't be talking — it would be recognizing when a human needs to take over.

A voice bot in a practice would need to recognize when it's out of its depth: keywords like chest pain, shortness of breath, or emergency would need to be routed to a human immediately. And after three unsuccessful attempts, the AI should honestly say: "Let me connect you with someone from the team."

In the prototype, that's a few lines in the system prompt. In a real system, there's escalation logic, call forwarding, and a fallback for when no one is available.

Conclusion

It took me three hours to build a prototype that holds conversations, speaks 97 languages, and looks up appointments from a text file.

For a demo, that holds up. It's not a finished product. Between this prototype and a system that reliably takes calls in a medical practice lie telephony infrastructure, practice software integration, data privacy architecture, and extensive testing.

The conversation feels more natural than you'd expect from an API.

Tech Stack & Links

Model: Google Gemini 3.1 Flash Live (native audio-to-audio)
Backend: Python, FastAPI, WebSockets
Frontend: Vanilla HTML/JS, AudioWorklet API
Cost: A few cents per 3-minute conversation (Gemini)
Languages: 97, seamless switching mid-conversation

Frequently Asked Questions

Is it GDPR compliant? The prototype: uncritical for internal tests. A production system needs a DPA, data protection impact assessment, and AI transparency announcement. Google's API terms also exclude use for medical advice.

How far is this from a real phone assistant? A prototype takes an afternoon. A production-ready system with telephony integration (SIP/PSTN), practice software connection, data privacy, and escalation logic takes weeks to months.