A step-by-step walkthrough of an insurance claims triage deployment — from the business problem through architecture decisions, the 4-week build, and 90-day production results. Everything that worked, and three things we'd do differently.
The client: a regional insurer with 180,000 active policies across personal lines and commercial property. 3,200 new claims arriving each week. 18 FTEs whose primary role was manual triage — reading claim documents, cross-referencing policy terms, assigning claims to the correct team, and flagging missing documents.
The problem: six-day average from claim submission to first triage decision. Three root causes — (1) document volume exceeded team capacity during peak periods, (2) every claim required policy lookup that took 15–20 minutes manually, (3) fraud signal checking was manual and inconsistent.
Three complicating factors emerged in the first discovery session:
Claims arrived in 12+ formats: PDFs (digital and scanned), photographs of handwritten forms, email threads, Excel worksheets, and faxed images. A single claim often contained multiple formats. Standard OCR produced inconsistent extraction quality across formats — Azure AI Document Intelligence was needed as a preprocessing layer, not just a GPT-4o call.
Accurate triage required the agent to hold three things in context simultaneously: the extracted claim fields, the relevant policy terms (from an 8-year archive of 400,000 policy documents), and the fraud signal indicators (from a separate scoring dataset). Standard single-turn RAG wasn't sufficient — multi-step retrieval was required.
Incorrect triage direction — especially wrong fraud classification or wrong policy coverage determination — created downstream regulatory exposure. This meant the agent needed a human-in-the-loop escalation path for low-confidence decisions, not just a binary pass/fail output.
Three factors made Azure AI Foundry the unambiguous choice:
Azure AI Document Intelligence deployed as the preprocessing layer for all 12 claim document formats. Custom extraction models trained on 300 historical claim samples for each format category. A field normalisation layer built to standardise output structure regardless of source format — ensuring GPT-4o always receives consistently structured data.
Azure AI Search indexed the full 8-year policy archive (400,000 documents, ~2.8GB) with vector + semantic hybrid search. Chunking strategy: 512-token chunks with 64-token overlap, preserving policy clause boundaries. Semantic ranking configured to prioritise temporal recency for policy version lookup. Evaluation: 95% retrieval precision on 200-question test set derived from historical triage decisions.
GPT-4o via Azure OpenAI reading extracted document fields + retrieved policy context, outputting a structured JSON triage decision: claim type, priority level (P1–P4), assigned team, fraud risk score (0–100), confidence score (0–1), escalation flag, and missing documents checklist. Prompt Flow pipeline: version-controlled prompt, automated evaluation on every prompt change, A/B testing framework for refinements.
Azure API Management gateway connecting the triage agent to the claims management system. Managed Identity authentication — zero stored credentials. Private endpoint on Azure OpenAI resource. Full audit logging to Azure Monitor: every triage decision, document hash, GPT-4o model version, confidence score, and processing time. Content safety configured with financial services harm thresholds. UAT: 200 historical claims re-triaged by agent vs. original human decisions — 96.4% agreement rate.
The 29% of claims not auto-triaged fall into three categories: low-confidence agent decisions (escalated with full context for human review), novel claim types outside the training distribution, and claims explicitly flagged by fraud scoring above the P1 threshold. Human reviewers reported that escalated claims arrived with significantly better context than pre-agent — the agent's structured output pre-populated the reviewer interface with extracted fields and retrieved policy clauses.
The most impactful decision
Building the Prompt Flow evaluation pipeline before go-live. We ran 200 historical claims through the agent before launch and caught three failure modes — incorrect policy version selection for multi-year policies, a chunking artifact creating hallucinated clause references, and an edge case in the fraud scoring threshold logic. Fixing these in staging saved estimated 3–4 weeks of post-launch remediation.
Start evaluation dataset earlier
We needed 300+ labelled claim examples for Prompt Flow evaluation. Compiling and labelling these took 2 weeks — running in parallel with architecture work, but creating a dependency on Week 3. Starting dataset compilation in Week 0 would have compressed the timeline by a week.
Invest in document normalisation earlier
We underestimated format diversity. The 12 identified formats became 18 by Week 2 as edge cases emerged from the archive. Building the normalisation layer as a configurable module (rather than per-format handlers) would have been faster to extend.
Build the human-review UI in parallel
Human reviewers needed UI changes to consume the agent's structured output efficiently. We scoped this as post-launch work, but reviewer feedback during UAT identified requirements that added 1 week to the timeline. Starting UI work in Week 3 would have removed this delay.