Most first Vertex AI agents fail not because of bad code, but because of bad architecture decisions made in the first week. Model selection, orchestration layer, retrieval strategy, security posture — these 12 questions determine whether you ship a production agent or an expensive demo. Answer them before you write a line of code.
Answer every question below before opening the Vertex AI console. These answers become the architecture specification for your agent — they inform your system prompt, your Gemini model selection, your Vertex AI Search index configuration, your IAM setup, and your evaluation pipeline.
Write it in one sentence. 'Surface relevant product recommendations based on a user's viewing history and current search query.' If you need two sentences, you are scoping two agents. This sentence becomes the foundation of every design decision that follows — your system prompt, your evaluation criteria, your escalation policy.
The answer determines your orchestration layer. Q&A agents (retrieving and synthesising information) work well with Agent Builder's built-in Vertex AI Search integration. Action-taking agents (writing to systems, triggering workflows) need Reasoning Engine with tool plugins. Multi-step reasoning agents need Reasoning Engine with a ReAct or chain-of-thought loop.
Gemini 2.0 Flash handles 90% of enterprise agent tasks at 10x lower cost than Pro. Pro is justified for complex multi-document synthesis, ambiguous long-form reasoning, and tasks where Flash consistently fails your evaluation test cases. Ultra adds native multimodal reasoning at the highest capability tier. Always benchmark Flash first against your eval dataset before paying for Pro.
Agent Builder: right for document-grounded Q&A agents with Vertex AI Search integration and a managed runtime. Reasoning Engine: right for multi-agent systems, LangChain/LangGraph workflows, and custom tool orchestration in Python. Direct SDK: right only for simple, stateless generation tasks where you don't need a persistent agent loop. Most production agents live on Agent Builder or Reasoning Engine.
RAG via Vertex AI Search: right for agents grounded in your own documents, policies, and knowledge bases. Fine-tuning: right when the model needs to adopt a specific vocabulary, tone, or domain behaviour — not to 'know facts'. Grounding with Google Search: right when the agent needs current, publicly available information. You can combine all three, but start with one. RAG is the right default for most enterprise agents.
List every source: PDFs in Cloud Storage, BigQuery tables, website URLs, API responses. For each, determine the ingestion path: Document AI for complex PDFs, Vertex AI Search datastore import for structured data, scheduled pipelines via Dataflow or Cloud Composer for refreshing data. Each data source needs an indexing strategy and a refresh cadence before build starts.
Define: which service accounts the agent uses (never user credentials), which GCP services are in scope for VPC Service Controls, whether CMEK is required for data-at-rest in BigQuery and Cloud Storage, whether DLP must scan agent outputs for PII before returning to users, and the Cloud Audit Logs retention policy. These decisions are infrastructure-level and must be made before deployment architecture is finalised.
Conversational agents typically need <2 second P95 response latency. Background processing agents may tolerate 30+ seconds. Latency requirements determine: whether you need streaming responses (Gemini supports token streaming), whether RAG retrieval must be cached (Memorystore), whether you need regional endpoints, and whether you can afford multi-step reasoning chains or need single-pass responses.
Define the conditions that require human review or approval: actions above a monetary threshold, content flagged by the safety filter, user requests that the agent cannot resolve with high confidence, and regulated actions (medical advice, legal counsel, financial recommendations). For each condition, define the escalation path — Cloud Tasks queue for async review, real-time human agent transfer, or automatic deflection with an explanation.
You need a minimum of 150 representative test cases: input + expected output + pass/fail criteria. The evaluation framework should run automatically on every deployment using Vertex AI Evaluation Service or a custom pipeline in Cloud Build. Define your go-live threshold: 'agent must achieve ≥85% correct responses on the eval dataset with zero critical failures'. Without this, you do not know your error rate until users find it.
Define upfront: which metrics you will track (response latency, token spend, safety filter trigger rate, user escalation rate, thumbs up/down feedback), where they will be stored (BigQuery via Cloud Logging export, Cloud Monitoring custom metrics), and what alert thresholds will trigger PagerDuty or Cloud Monitoring notifications. The cost of retroactive monitoring instrumentation is 5x the cost of doing it during build.
Estimate token volume: (average query tokens + average response tokens) × expected daily queries × 30. For Gemini 2.0 Flash, multiply by the current per-token rate and add ~20% for RAG retrieval overhead. Define the rollback trigger (error rate above X, quality score below Y, cost above Z per day) and the rollback procedure — Cloud Run revision rollback is instant, Agent Builder version rollback requires re-pointing the endpoint.
Model selection is the most consequential cost decision in a Vertex AI agent build. The price difference between Flash and Pro is roughly 10x. Here is the decision framework we use on every engagement.
| Factor | Gemini 2.0 Flash | Gemini 1.5 Pro | Gemini Ultra |
|---|---|---|---|
| Best for | 90% of enterprise agent tasks | Complex reasoning, long docs | Multimodal, highest capability |
| Context window | 1M tokens | 2M tokens | 1M tokens |
| Tool calling | Full support | Full support | Full support |
| Relative cost | Lowest (~10x cheaper than Pro) | Mid-tier | Highest |
| Latency | Fast (~1–2s typical) | Moderate (~3–5s typical) | Moderate–slow |
| Multimodal | Images + text | Images + text + video | Images + text + video (best) |
| Recommended for | RAG agents, Q&A, tool use | Legal/financial doc analysis | Vision-heavy workflows |
Our recommendation
Start every new agent on Gemini 2.0 Flash. Build your evaluation dataset. Run Flash against it. Only upgrade to Pro if Flash fails more than 10% of your critical test cases. In our experience, Flash passes 85–90% of enterprise agent evaluations, making the 10x cost premium of Pro unjustified for most use cases.
This is the most common architecture confusion in new Vertex AI projects. Both are production-grade managed runtimes for Gemini-powered agents — but they serve very different use cases.
Choose when:
Agent Builder is the fastest path from zero to production for RAG agents. Most enterprise knowledge management, customer support, and document Q&A agents belong here.
Choose when:
Reasoning Engine is for code-first engineers who need control. It executes your Python application in a managed environment with Vertex AI model access, logging, and tracing built in.
The RAG vs fine-tuning question is one of the most misunderstood decisions in enterprise AI. The answer almost always depends on whether you are solving a knowledge problem or a behaviour problem.
Use RAG when
Use fine-tuning when
Never use fine-tuning to
Cost note: Fine-tuning a Gemini model on Vertex AI costs roughly $3–8 per 1,000 training examples and requires a training job that takes 2–4 hours. RAG with Vertex AI Search costs approximately $0.40 per 1,000 queries for retrieval. For most enterprise use cases with evolving knowledge bases, RAG delivers better ROI.
These five security controls are significantly cheaper to implement at the start of a project than to retrofit after go-live. For regulated industries — financial services, healthcare, government — they are not optional.
Create a VPC Service Controls perimeter that encompasses Vertex AI, BigQuery, Cloud Storage, and Secret Manager. Any API call crossing the perimeter is blocked. This prevents data exfiltration even from compromised service accounts. Configure your access policy before any data enters your GCP project.
The Vertex AI service account gets only the roles it needs: roles/aiplatform.user for model calls, roles/bigquery.dataViewer on specific datasets, roles/storage.objectViewer on specific buckets. Never use roles/owner or roles/editor on a production service account. Run IAM Recommender quarterly to remove unused permissions.
For regulated data in BigQuery datasets and Cloud Storage buckets, configure CMEK via Cloud KMS. This gives you control over the encryption key lifecycle — you can revoke access to data by disabling the key, independent of Google's infrastructure. Required for most financial services and healthcare workloads.
Enable Data Access audit logs for Vertex AI, BigQuery, and Cloud Storage — not just Admin Activity logs. Forward all audit logs to a locked Cloud Storage bucket (retention lock enabled) with a minimum 1-year retention period. For regulated industries, 7-year retention is standard. Audit log gaps are a common compliance finding.
If the agent may surface PII from your data sources in its responses, configure Cloud Data Loss Prevention (DLP) to scan agent outputs before they are returned to users. Define info types (SSN, payment card, health record ID) and configure the DLP response — de-identify, mask, or block responses containing these patterns. Especially important for agents with broad access to user data.
Gemini 2.0 Flash handles 90% of enterprise agent tasks at a fraction of Pro pricing — benchmark it first before committing to a more expensive model.
Agent Builder is the right starting point for document-grounded agents; Reasoning Engine is for multi-agent and custom orchestration workflows.
RAG via Vertex AI Search is the correct default retrieval strategy for most enterprise agents — fine-tuning is for domain style, not knowledge.
VPC Service Controls, IAM least privilege, and CMEK are non-negotiable from day one for regulated industries — retrofitting them is expensive.
Build your evaluation dataset of 150+ test cases in parallel with the agent, not after — go-live without an eval pipeline means unknown error rates.
Define latency, escalation, monitoring, and rollback requirements before a line of code is written — they are architecture inputs, not afterthoughts.