Implementation Guide10 min read

How to architect your first Vertex AI agent: A practitioner's checklist

Most first Vertex AI agents fail not because of bad code, but because of bad architecture decisions made in the first week. Model selection, orchestration layer, retrieval strategy, security posture — these 12 questions determine whether you ship a production agent or an expensive demo. Answer them before you write a line of code.

Before you build: the 12 questions

Answer every question below before opening the Vertex AI console. These answers become the architecture specification for your agent — they inform your system prompt, your Gemini model selection, your Vertex AI Search index configuration, your IAM setup, and your evaluation pipeline.

01

What is the single job this agent does?

Write it in one sentence. 'Surface relevant product recommendations based on a user's viewing history and current search query.' If you need two sentences, you are scoping two agents. This sentence becomes the foundation of every design decision that follows — your system prompt, your evaluation criteria, your escalation policy.

02

What type of use case is this — Q&A, action-taking, or multi-step reasoning?

The answer determines your orchestration layer. Q&A agents (retrieving and synthesising information) work well with Agent Builder's built-in Vertex AI Search integration. Action-taking agents (writing to systems, triggering workflows) need Reasoning Engine with tool plugins. Multi-step reasoning agents need Reasoning Engine with a ReAct or chain-of-thought loop.

03

Which Gemini model fits this use case?

Gemini 2.0 Flash handles 90% of enterprise agent tasks at 10x lower cost than Pro. Pro is justified for complex multi-document synthesis, ambiguous long-form reasoning, and tasks where Flash consistently fails your evaluation test cases. Ultra adds native multimodal reasoning at the highest capability tier. Always benchmark Flash first against your eval dataset before paying for Pro.

04

Agent Builder, Reasoning Engine, or direct Vertex AI SDK?

Agent Builder: right for document-grounded Q&A agents with Vertex AI Search integration and a managed runtime. Reasoning Engine: right for multi-agent systems, LangChain/LangGraph workflows, and custom tool orchestration in Python. Direct SDK: right only for simple, stateless generation tasks where you don't need a persistent agent loop. Most production agents live on Agent Builder or Reasoning Engine.

05

What is the data access pattern — RAG, fine-tuning, or grounding?

RAG via Vertex AI Search: right for agents grounded in your own documents, policies, and knowledge bases. Fine-tuning: right when the model needs to adopt a specific vocabulary, tone, or domain behaviour — not to 'know facts'. Grounding with Google Search: right when the agent needs current, publicly available information. You can combine all three, but start with one. RAG is the right default for most enterprise agents.

06

What are all the data sources, and how will they be indexed?

List every source: PDFs in Cloud Storage, BigQuery tables, website URLs, API responses. For each, determine the ingestion path: Document AI for complex PDFs, Vertex AI Search datastore import for structured data, scheduled pipelines via Dataflow or Cloud Composer for refreshing data. Each data source needs an indexing strategy and a refresh cadence before build starts.

07

What are the IAM and security requirements?

Define: which service accounts the agent uses (never user credentials), which GCP services are in scope for VPC Service Controls, whether CMEK is required for data-at-rest in BigQuery and Cloud Storage, whether DLP must scan agent outputs for PII before returning to users, and the Cloud Audit Logs retention policy. These decisions are infrastructure-level and must be made before deployment architecture is finalised.

08

What are the latency requirements?

Conversational agents typically need <2 second P95 response latency. Background processing agents may tolerate 30+ seconds. Latency requirements determine: whether you need streaming responses (Gemini supports token streaming), whether RAG retrieval must be cached (Memorystore), whether you need regional endpoints, and whether you can afford multi-step reasoning chains or need single-pass responses.

09

What is the human-in-the-loop policy?

Define the conditions that require human review or approval: actions above a monetary threshold, content flagged by the safety filter, user requests that the agent cannot resolve with high confidence, and regulated actions (medical advice, legal counsel, financial recommendations). For each condition, define the escalation path — Cloud Tasks queue for async review, real-time human agent transfer, or automatic deflection with an explanation.

10

How will you evaluate this agent before go-live?

You need a minimum of 150 representative test cases: input + expected output + pass/fail criteria. The evaluation framework should run automatically on every deployment using Vertex AI Evaluation Service or a custom pipeline in Cloud Build. Define your go-live threshold: 'agent must achieve ≥85% correct responses on the eval dataset with zero critical failures'. Without this, you do not know your error rate until users find it.

11

What is the monitoring strategy post-launch?

Define upfront: which metrics you will track (response latency, token spend, safety filter trigger rate, user escalation rate, thumbs up/down feedback), where they will be stored (BigQuery via Cloud Logging export, Cloud Monitoring custom metrics), and what alert thresholds will trigger PagerDuty or Cloud Monitoring notifications. The cost of retroactive monitoring instrumentation is 5x the cost of doing it during build.

12

What is the cost budget, and what is the rollback plan?

Estimate token volume: (average query tokens + average response tokens) × expected daily queries × 30. For Gemini 2.0 Flash, multiply by the current per-token rate and add ~20% for RAG retrieval overhead. Define the rollback trigger (error rate above X, quality score below Y, cost above Z per day) and the rollback procedure — Cloud Run revision rollback is instant, Agent Builder version rollback requires re-pointing the endpoint.

Choosing the right Gemini model

Model selection is the most consequential cost decision in a Vertex AI agent build. The price difference between Flash and Pro is roughly 10x. Here is the decision framework we use on every engagement.

FactorGemini 2.0 FlashGemini 1.5 ProGemini Ultra
Best for90% of enterprise agent tasksComplex reasoning, long docsMultimodal, highest capability
Context window1M tokens2M tokens1M tokens
Tool callingFull supportFull supportFull support
Relative costLowest (~10x cheaper than Pro)Mid-tierHighest
LatencyFast (~1–2s typical)Moderate (~3–5s typical)Moderate–slow
MultimodalImages + textImages + text + videoImages + text + video (best)
Recommended forRAG agents, Q&A, tool useLegal/financial doc analysisVision-heavy workflows

Our recommendation

Start every new agent on Gemini 2.0 Flash. Build your evaluation dataset. Run Flash against it. Only upgrade to Pro if Flash fails more than 10% of your critical test cases. In our experience, Flash passes 85–90% of enterprise agent evaluations, making the 10x cost premium of Pro unjustified for most use cases.

Agent Builder vs Reasoning Engine: when to use which

This is the most common architecture confusion in new Vertex AI projects. Both are production-grade managed runtimes for Gemini-powered agents — but they serve very different use cases.

Vertex AI Agent Builder

Choose when:

  • You need a document-grounded Q&A agent with Vertex AI Search integration
  • The agent's knowledge base is in Cloud Storage (PDFs, Docs, web pages)
  • You want managed infrastructure — Google handles scaling, availability, updates
  • You don't need custom Python orchestration logic
  • You want built-in conversation history and session management

Agent Builder is the fastest path from zero to production for RAG agents. Most enterprise knowledge management, customer support, and document Q&A agents belong here.

Reasoning Engine

Choose when:

  • You need multi-agent orchestration (LangChain, LangGraph, custom loops)
  • The agent must call external APIs and take actions in backend systems
  • You need custom tool plugins that go beyond document retrieval
  • You are building a multi-step reasoning pipeline with conditional logic
  • You need to port an existing LangChain agent to a managed GCP runtime

Reasoning Engine is for code-first engineers who need control. It executes your Python application in a managed environment with Vertex AI model access, logging, and tracing built in.

RAG vs fine-tuning: the decision framework

The RAG vs fine-tuning question is one of the most misunderstood decisions in enterprise AI. The answer almost always depends on whether you are solving a knowledge problem or a behaviour problem.

Use RAG when

  • The agent needs to answer questions about your specific documents, policies, or data
  • Your knowledge base changes frequently (product catalogue, support articles, regulations)
  • You need the agent to cite sources or show where an answer came from
  • You are building a document Q&A, knowledge management, or support agent
  • You need to get to production quickly — RAG requires no training time

Use fine-tuning when

  • The model needs to adopt your brand's specific writing style or tone consistently
  • The agent handles a narrow, well-defined domain with specialised vocabulary (legal, medical, financial)
  • You have 1,000+ high-quality labelled training examples
  • The base model consistently fails to follow your output format requirements
  • You have already validated RAG performance and it is insufficient for your use case

Never use fine-tuning to

  • Teach the model new facts — use RAG instead; fine-tuned facts are unreliable and stale
  • Fix a poorly written system prompt — improving the prompt is always cheaper and faster
  • Substitute for RAG when your knowledge base changes more than monthly

Cost note: Fine-tuning a Gemini model on Vertex AI costs roughly $3–8 per 1,000 training examples and requires a training job that takes 2–4 hours. RAG with Vertex AI Search costs approximately $0.40 per 1,000 queries for retrieval. For most enterprise use cases with evolving knowledge bases, RAG delivers better ROI.

Security checklist: non-negotiables from day one

These five security controls are significantly cheaper to implement at the start of a project than to retrofit after go-live. For regulated industries — financial services, healthcare, government — they are not optional.

VPC Service Controls

Create a VPC Service Controls perimeter that encompasses Vertex AI, BigQuery, Cloud Storage, and Secret Manager. Any API call crossing the perimeter is blocked. This prevents data exfiltration even from compromised service accounts. Configure your access policy before any data enters your GCP project.

IAM least privilege

The Vertex AI service account gets only the roles it needs: roles/aiplatform.user for model calls, roles/bigquery.dataViewer on specific datasets, roles/storage.objectViewer on specific buckets. Never use roles/owner or roles/editor on a production service account. Run IAM Recommender quarterly to remove unused permissions.

Customer-Managed Encryption Keys (CMEK)

For regulated data in BigQuery datasets and Cloud Storage buckets, configure CMEK via Cloud KMS. This gives you control over the encryption key lifecycle — you can revoke access to data by disabling the key, independent of Google's infrastructure. Required for most financial services and healthcare workloads.

Cloud Audit Logs

Enable Data Access audit logs for Vertex AI, BigQuery, and Cloud Storage — not just Admin Activity logs. Forward all audit logs to a locked Cloud Storage bucket (retention lock enabled) with a minimum 1-year retention period. For regulated industries, 7-year retention is standard. Audit log gaps are a common compliance finding.

DLP for agent outputs

If the agent may surface PII from your data sources in its responses, configure Cloud Data Loss Prevention (DLP) to scan agent outputs before they are returned to users. Define info types (SSN, payment card, health record ID) and configure the DLP response — de-identify, mask, or block responses containing these patterns. Especially important for agents with broad access to user data.

Key takeaways

Gemini 2.0 Flash handles 90% of enterprise agent tasks at a fraction of Pro pricing — benchmark it first before committing to a more expensive model.

Agent Builder is the right starting point for document-grounded agents; Reasoning Engine is for multi-agent and custom orchestration workflows.

RAG via Vertex AI Search is the correct default retrieval strategy for most enterprise agents — fine-tuning is for domain style, not knowledge.

VPC Service Controls, IAM least privilege, and CMEK are non-negotiable from day one for regulated industries — retrofitting them is expensive.

Build your evaluation dataset of 150+ test cases in parallel with the agent, not after — go-live without an eval pipeline means unknown error rates.

Define latency, escalation, monitoring, and rollback requirements before a line of code is written — they are architecture inputs, not afterthoughts.

2-week risk-free pilot

Ready to build your first Vertex AI agent?

We handle the architecture decisions, configure the GCP security controls, and ship to production. Fixed price. Zero delivery risk.