Skip to main content
All Case Studies
European Warranty Insurer
Vehicle Warranty Insurance · 84 claims in pilot

Automating Vehicle Warranty Claims with AI Document Intelligence

From 18% to 98% Decision Accuracy in 7 Days

MIRAContextBuilderClaimEval
7 days from baseline to 98% accuracy
98%
Decision Accuracy
from 18% baseline
7 days
Time to Results
full pipeline live
80-90%
No-LLM Resolution
of line items
3
Languages
DE / FR / IT

The Problem

A vehicle warranty claim lands on an adjuster's desk: five documents, three languages, a cost estimate with 30 line items, a policy with coverage tiers, mileage-dependent reimbursement rates and component exclusion lists.

Processing takes 15-30 minutes per claim. Multiply that by hundreds of claims per month.

76% of denied claims fail for a single reason: the part is not covered by the policy - a deterministic lookup, not a judgment call.

84
Claims analyzed in pilot
50/50
Approval-to-denial split
CHF 1,450
Avg approved payout
23
Vehicle brands represented
14
Distinct document types
4.9
Avg documents per claim

Why Existing Approaches Fall Short

Most document processing tools extract text from PDFs - that gets you 10% of the way. The real challenge is cross-document reasoning: a cost estimate means nothing without the policy, a mileage reading means nothing without the coverage cap.

Generic OCR pipelines lack domain depth

Swiss repair invoices contain line items in German with part numbers, labor codes, carry-forward subtotals across pages. Standard OCR extracts text but not structured data with coverage implications.

Single-model approaches plateau early

Sending an entire claim to an LLM produces inconsistent results - hallucinated part names, invented coverage rules, unreliable financial calculations. Baseline LLM-only accuracy: 18%.

Rule-based systems cannot scale to vocabulary

Thousands of part names across multiple languages. A "Wasserpumpe" in German is a "pompe à eau" in French. Pure keyword matching breaks on the first synonym it has not seen.

The solution requires a hybrid architecture: deterministic rules where they work, keyword matching for known vocabulary and LLM reasoning as a calibrated fallback - each layer with explicit confidence scoring and provenance tracking.

The Architecture: A Multi-Stage Pipeline

An end-to-end claims processing pipeline moves raw documents through five stages, each with quality gates and audit logging.

Stage 1
Ingestion
Stage 2
Classification
Stage 3
Extraction
Stage 4
Screening
Stage 5
QA Review

Coverage Analysis: Three-Tier Matching

Each line item goes through a three-tier matching pipeline, from fastest/highest-confidence to slowest/lowest-confidence:

Tier 1Rule Engine
Confidence: 1.040-50%

Deterministic matches for fee items, known exclusions, consumables. Zero ambiguity, zero latency.

Tier 2Keyword Matcher
Confidence: 0.70-0.9030-40%

Maps German/French repair terms to 30+ component categories with synonyms and umlaut normalization.

Tier 3LLM Fallback
Confidence: 0.60-0.8510-20%

GPT-4o with structured prompts for genuinely ambiguous items. Concurrency-optimized with 10 parallel calls.

The Implementation Journey: From 18% to 98%

Days 1-318%

Baseline and Architecture

The initial LLM-only approach produced plausible-sounding but unreliable results: hallucinated part names, invented coverage rules, inconsistent calculations.

Key insight: LLMs are good at language understanding but bad at deterministic business rules. The architecture had to separate judgment from precision.
Days 3-576%

Screening + Coverage Pipeline

The 11-check screening pipeline and three-tier coverage matching drove the biggest single improvement. Deterministic checks alone caught most denials correctly.

Days 5-888% → 94% → 98%

Iteration and Refinement

60+ evaluation iterations solving increasingly subtle problems: substring matching bugs, labor demotion logic, and part-number normalization across OCR outputs.

Holdout Test76.7%

Unseen Data Validation

30 previously unseen claims revealed failure modes the development set did not cover: causal exclusion clauses, missing document scenarios and sub-component interpretation gaps.

The holdout gap (98% vs 77%) is the honest measure of generalization.

Key Highlights

  • 14 document types classified at near-100% accuracy
  • 51 fields extracted per warranty policy with confidence scoring
  • Full provenance chain from decision back to source page and character position
  • Three-tier coverage matching: deterministic rules, keyword synonyms, LLM fallback
  • Multilingual keyword matching across German, French and Italian
  • 850+ unit tests covering the full pipeline

Tech Stack

PythonFastAPIPydanticReact 18TypeScriptTailwind CSSGPT-4oAzure Doc Intelligence

Ready to Achieve Similar Results?

Start with a pilot on your closed claims and see the impact for yourself.

Request a Pilot