Project AIVA: Architecting a 'Human' AI Interviewer on Google Cloud

The biggest challenge in building an AI interviewer isn't generating the questions—it's crossing the "Uncanny Valley" of latency.

When I designed AIVA (AI Virtual Interview Assistant), I knew that a standard chatbot architecture would fail. In a real interview, a 3-second pause feels like an eternity. If the AI lags, the immersion breaks, and the user remembers they are talking to a machine.

To build a tool that actually helps candidates practice, the system needed to be multimodal (audio, video, text) and real-time. It had to "hear" interruptions, "think" instantly, and "speak" naturally.

Here is how I used the Google Cloud ecosystem to solve the three biggest problems in conversational AI: Latency, Cost, and Immersion.

AIVA System Architecture on Google Cloud Platform

1. Solving the Latency Problem: The Need for Speed & Structure

The Problem: Standard LLM chains are too slow and prone to formatting errors. Waiting for a transcription, then sending it to an LLM, often results in unstructured text that breaks the frontend parser.

The Solution: Gemini 2.5 on Vertex AI (Native JSON Mode).

I chose Gemini 2.5 not just for its low Time-To-First-Token (TTFT), but for its controlled generation capabilities. By setting response_mime_type="application/json" in the Vertex AI configuration, I enforce strict schema adherence at the model layer.

Instead of using fragile Regex to parse the AI's rating, the system guarantees a valid JSON object containing the score, feedback, and one-liner. We then use an "Async Scatter-Gather" pattern, firing three of these specialized prompts in parallel:

The Scorer: Returns a numeric 1-10 rating (JSON enforced).
The Coach: Generates detailed, constructive feedback.
The Summarizer: Creates a one-line actionable takeaway.

By running these in parallel using Python's asyncio.gather, we cut the user's wait time by 60%. The candidate gets instant, reliably-structured feedback the moment they stop speaking.

Responsible AI: An AI interviewer must remain professional. I utilized Vertex AI's Safety Settings API to explicitly configure the model's tolerance for toxicity and bias, ensuring the feedback remains constructive and safe for a workplace context, regardless of the user's input.

60%

Faster Response

<500ms

End-to-End Latency

Parallel Prompts

2. Solving the Immersion Problem: A Hybrid Video Engine

The Problem: Generating high-fidelity AI video (like HeyGen) is expensive ($$) and slow. Using basic Text-to-Speech (TTS) is cheap but robotic. We needed a middle ground.

The Solution: The Hybrid Dispatch Engine.

I architected a system that intelligently switches between high-fidelity and low-latency modes based on context:

The "Premium" Tier

For core languages (English, French), we utilize HeyGen APIs to pre-generate realistic video avatars. This provides maximum immersion for primary use cases.

The "Global" Tier

For other languages or dynamic follow-ups, the system seamlessly falls back to Google Cloud Text-to-Speech (TTS) with viseme-based avatar animation.

This ensures the platform is globally accessible (supporting languages HeyGen doesn't) and resilient (if the video API fails, the audio engine takes over instantly).

3. Solving the Scale Problem: Serverless WebSockets

The Problem: Real-time audio requires persistent full-duplex connections (WebSockets). Standard serverless platforms often drop these connections during scaling events, and load balancers randomly distribute packets, breaking the stream.

The Solution: Google Cloud Run with Session Affinity.

I deployed the architecture on Cloud Run specifically because it creates a unique intersection of serverless economics and stateful capabilities. By enabling Session Affinity in the Cloud Run revision settings, I ensured that once a candidate connects to a container, the load balancer routes all subsequent audio packets to that exact instance.

Session Affinity: Layer 7 load balancing ensures WebSocket packets stick to the originating container—no dropped connections during scale events.
Cost Efficiency: We utilize CPU allocation set to "always allocated" during the session to prevent throttling, but scale to zero when idle.
Resilience: I implemented a custom heartbeat mechanism to handle the 4-minute streaming limit of Google's Speech-to-Text API, transparently reconnecting the stream without the user ever knowing.

4. Enterprise-Grade Security on Serverless

In HR tech, PII (Personally Identifiable Information) protection is non-negotiable. "Serverless" doesn't mean "insecure"—in fact, it can be more secure when architected correctly.

Secret Management: I leveraged Google Secret Manager injected directly into Cloud Run environment variables at runtime. This ensures that sensitive credentials (like HeyGen API keys) never touch the filesystem or the Docker image layers.

Furthermore, the entire deployment is managed via Cloud Build triggers, enforcing a "zero-touch" production environment where no developer needs manual SSH access to the containers. This isn't just a convenience—it's an audit requirement for enterprise clients.

Secrets at Runtime: Credentials injected from Secret Manager, never baked into images.
Zero-Touch Deployment: Cloud Build triggers eliminate manual access to production.
Infrastructure as Code: Single cloudbuild.yaml runs migrations, syncs assets, and deploys—reproducible in minutes.

Why Google Cloud?

AIVA demonstrates that AI Engineering is about more than just writing prompts. It's about understanding the trade-offs between cost, speed, and user experience.

By combining the raw speed of Gemini 2.5, the scaling power of Cloud Run, and a pragmatic Hybrid Audio/Video architecture, we built a platform that doesn't just interview candidates—it engages them.