Case Study16 min read

From 6-day claims processing to 18 hours: Building an Azure AI triage agent

A step-by-step walkthrough of an insurance claims triage deployment — from the business problem through architecture decisions, the 4-week build, and 90-day production results. Everything that worked, and three things we'd do differently.

The problem

The client: a regional insurer with 180,000 active policies across personal lines and commercial property. 3,200 new claims arriving each week. 18 FTEs whose primary role was manual triage — reading claim documents, cross-referencing policy terms, assigning claims to the correct team, and flagging missing documents.

The problem: six-day average from claim submission to first triage decision. Three root causes — (1) document volume exceeded team capacity during peak periods, (2) every claim required policy lookup that took 15–20 minutes manually, (3) fraud signal checking was manual and inconsistent.

3,200
claims/week
18
FTEs on manual triage
6 days
avg to first decision
12+
document formats

Why this was a harder problem than it looked

Three complicating factors emerged in the first discovery session:

01

Document format diversity

Claims arrived in 12+ formats: PDFs (digital and scanned), photographs of handwritten forms, email threads, Excel worksheets, and faxed images. A single claim often contained multiple formats. Standard OCR produced inconsistent extraction quality across formats — Azure AI Document Intelligence was needed as a preprocessing layer, not just a GPT-4o call.

02

Simultaneous cross-referencing requirement

Accurate triage required the agent to hold three things in context simultaneously: the extracted claim fields, the relevant policy terms (from an 8-year archive of 400,000 policy documents), and the fraud signal indicators (from a separate scoring dataset). Standard single-turn RAG wasn't sufficient — multi-step retrieval was required.

03

Compliance risk on every error

Incorrect triage direction — especially wrong fraud classification or wrong policy coverage determination — created downstream regulatory exposure. This meant the agent needed a human-in-the-loop escalation path for low-confidence decisions, not just a binary pass/fail output.

Why Azure AI Foundry (not OpenAI.com or AWS)

Three factors made Azure AI Foundry the unambiguous choice:

  • Existing Azure estate: Dynamics 365, Azure Data Lake, Azure Active Directory (Entra ID) — Azure AI Foundry integrates natively with all three, eliminating integration work that would have taken 2–3 additional weeks on a different platform.
  • HIPAA-adjacent compliance requirements: The insurer's data governance policy required data residency in US East — Azure OpenAI private endpoint configuration provided this. OpenAI.com offered no equivalent data residency guarantee.
  • IT team Azure proficiency: The client's infrastructure team had deep Azure expertise. A deployment on AWS or a standalone OpenAI integration would have required external infrastructure support throughout the engagement.

The build — week by week

Week 1

Document pipeline

Azure AI Document Intelligence deployed as the preprocessing layer for all 12 claim document formats. Custom extraction models trained on 300 historical claim samples for each format category. A field normalisation layer built to standardise output structure regardless of source format — ensuring GPT-4o always receives consistently structured data.

Week 2

Policy lookup RAG pipeline

Azure AI Search indexed the full 8-year policy archive (400,000 documents, ~2.8GB) with vector + semantic hybrid search. Chunking strategy: 512-token chunks with 64-token overlap, preserving policy clause boundaries. Semantic ranking configured to prioritise temporal recency for policy version lookup. Evaluation: 95% retrieval precision on 200-question test set derived from historical triage decisions.

Week 3

Triage agent

GPT-4o via Azure OpenAI reading extracted document fields + retrieved policy context, outputting a structured JSON triage decision: claim type, priority level (P1–P4), assigned team, fraud risk score (0–100), confidence score (0–1), escalation flag, and missing documents checklist. Prompt Flow pipeline: version-controlled prompt, automated evaluation on every prompt change, A/B testing framework for refinements.

Week 4

Integration, security & compliance

Azure API Management gateway connecting the triage agent to the claims management system. Managed Identity authentication — zero stored credentials. Private endpoint on Azure OpenAI resource. Full audit logging to Azure Monitor: every triage decision, document hash, GPT-4o model version, confidence score, and processing time. Content safety configured with financial services harm thresholds. UAT: 200 historical claims re-triaged by agent vs. original human decisions — 96.4% agreement rate.

Results — 90 days post-launch

71%
claims auto-triaged
18 hrs
avg processing (was 6 days)
3 FTEs
redeployed to complex claims
$1.4M
projected annual savings
96.4%
triage accuracy on UAT set
Zero
compliance incidents

The 29% of claims not auto-triaged fall into three categories: low-confidence agent decisions (escalated with full context for human review), novel claim types outside the training distribution, and claims explicitly flagged by fraud scoring above the P1 threshold. Human reviewers reported that escalated claims arrived with significantly better context than pre-agent — the agent's structured output pre-populated the reviewer interface with extracted fields and retrieved policy clauses.

The most impactful decision

Building the Prompt Flow evaluation pipeline before go-live. We ran 200 historical claims through the agent before launch and caught three failure modes — incorrect policy version selection for multi-year policies, a chunking artifact creating hallucinated clause references, and an edge case in the fraud scoring threshold logic. Fixing these in staging saved estimated 3–4 weeks of post-launch remediation.

What we'd do differently

Start evaluation dataset earlier

We needed 300+ labelled claim examples for Prompt Flow evaluation. Compiling and labelling these took 2 weeks — running in parallel with architecture work, but creating a dependency on Week 3. Starting dataset compilation in Week 0 would have compressed the timeline by a week.

Invest in document normalisation earlier

We underestimated format diversity. The 12 identified formats became 18 by Week 2 as edge cases emerged from the archive. Building the normalisation layer as a configurable module (rather than per-format handlers) would have been faster to extend.

Build the human-review UI in parallel

Human reviewers needed UI changes to consume the agent's structured output efficiently. We scoped this as post-launch work, but reviewer feedback during UAT identified requirements that added 1 week to the timeline. Starting UI work in Week 3 would have removed this delay.

2-week risk-free pilot

Ready to build your Azure AI agent?

We scope, build, and deploy your first production agent in 2 weeks. Fixed price. Zero delivery risk.