Executive Summary
A European vehicle warranty provider needed to automate the adjudication of multilingual warranty claims. Each claim involved up to five documents in three languages (German, French, Italian), with processing taking more than 30 minutes on average per claim due to manual cross-referencing of cost estimates, coverage tiers, mileage caps and component exclusion lists.
True Aim deployed the full product stack: ContextBuilder for document ingestion and structured extraction, MIRA for hybrid deterministic and AI coverage matching, and ClaimEval for quality assurance with full provenance tracking. A multi-tier matching architecture routes 80%-90% of line items through deterministic rules and keyword matching with zero LLM cost, reserving AI reasoning for genuinely ambiguous cases. Within seven days and through multiple evaluation iterations, decision accuracy reached 98% on the development set of holdout data.
Operational Context
A vehicle warranty claim lands on an adjuster's desk: five documents, three languages, a cost estimate with 30 line items, a policy with coverage tiers, mileage-dependent reimbursement rates and component exclusion lists.
Standard processing takes 30+ minutes per claim. Multiply that by thousands of claims per month.
76% of denied claims fail for a single reason: the part is not covered by the policy - a deterministic lookup, not a judgment call.
Architectural Approach
Most document processing tools extract text from PDFs - that gets you 10% of the way. The real challenge is cross-document reasoning: a cost estimate means nothing without the policy, a mileage reading means nothing without the coverage cap.
Generic OCR pipelines lack domain depth
Swiss repair invoices contain line items in German with part numbers, labor codes, carry-forward subtotals across pages. Standard OCR extracts text but not structured data with coverage implications.
Single-model approaches plateau early
Sending an entire claim to an LLM produces inconsistent results - hallucinated part names, invented coverage rules, unreliable financial calculations.
Rule-based systems cannot scale to vocabulary
Thousands of part names across multiple languages. A "Wasserpumpe" in German is a "pompe à eau" in French. Pure keyword matching breaks on the first synonym it has not seen.
The solution requires a hybrid architecture: deterministic rules where they work, keyword matching for known vocabulary and LLM reasoning as a calibrated fallback - each layer with explicit confidence scoring and provenance tracking.
Multi-Stage Pipeline
An end-to-end claims processing pipeline moves raw documents through five stages, each with quality gates and audit logging.
Coverage Analysis: Multi-Tier Matching
Each line item goes through a multi-tier matching pipeline, prioritizing deterministic approaches before falling back to AI:
Exact matches for known items, exclusions, and consumables. Zero ambiguity, instant resolution.
Maps multilingual repair terms to component categories with synonym normalization.
LLM with structured prompts for genuinely ambiguous items that require contextual reasoning.
Deployment and Iteration
Baseline and Architecture
The initial LLM-only approach produced plausible-sounding but unreliable results: hallucinated part names, invented coverage rules, inconsistent calculations.
Screening + Coverage Pipeline
The deterministic screening pipeline and multi-tier coverage matching drove the biggest single improvement. Rule-based checks alone caught most denials correctly.
Iteration and Refinement
60+ evaluation iterations solving increasingly subtle problems: substring matching bugs, labor demotion logic, and part-number normalization across OCR outputs.
Unseen Data Validation
30 previously unseen claims revealed failure modes the development set did not cover: causal exclusion clauses, missing document scenarios and sub-component interpretation gaps.
The holdout gap (98% vs 77%) is the honest measure of generalization.
Measured Outcomes
Initial Evaluation Set
| Decision accuracy | 98% |
| Approved claims correctly identified | 100% |
| Denied claims correctly identified | 96% |
| False reject rate | 0% |
| False approve rate | 4% |
Holdout Set
| Decision accuracy | 77% |
| Approved claims correctly identified | 73% |
| Denied claims correctly identified | 80% |
| False reject rate | 26.7% |
| False approve rate | 20.0% |
Edge Cases and Operational Insights
Holdout Failure Analysis
Key Lessons
Deterministic checks beat LLM judgment for structured rules.
76% of claim denials are "part not covered" - a lookup, not a judgment call. The screening pipeline handles these with 100% confidence and zero LLM cost.
The accuracy curve has diminishing returns.
76% accuracy took 3 days. 76% to 98% took 4 more days. The holdout gap (98% vs 77%) confirmed that real-world generalization requires significantly more data.
The payout formula is harder than the AI.
Getting the financial calculation to match the insurer's exact formula requires sitting with an adjuster and walking through calculations by hand. No amount of prompt engineering solves a business logic bug.
Provenance tracking is non-negotiable.
In a regulated industry, every extracted value and coverage decision must trace back to a source document. Fundamental to trust and compliance.
Production Readiness
- Expanded on ground truth dataset
- Integrated with insurer's policy management API for real-time status checks
- Implemented causal exclusion reasoning for cascading-damage denials
- Deployed production triage: auto-reject high-confidence denials, auto-approve simple approvals, route edge cases for human review
The goal is not to replace adjusters but to give them back their time on the 76% of denials that are straightforward lookups, so they can focus their expertise on the 24% that actually require judgment.
Technical Considerations
Multi-Language Processing
Language-aware classification and extraction across German, French and Italian. Comparable approval rates with no measurable language bias.
Auditability and Compliance
Full provenance chain: every extracted field links to its source document and page. Every coverage decision includes match method, confidence score and explanation.
Payout Formula Precision
Order-of-operations errors in financial calculations can cause significant overpayment. Finding them requires walking through calculations claim by claim with adjusters.
Cost Control
Deterministic rules and keyword matching handle the majority of items with zero LLM cost. Token usage monitoring with configurable thresholds prevents runaway costs.
“The business logic matters more than the AI. Matching the insurer's payout calculations requires working closely with adjusters and walking through real claims to ensure the logic reflects how decisions are actually made.”