Questa pagina è disponibile solo in inglese

Vendor Evaluation

Build vs Buy: What to Demand from IDP Vendors

A practical evaluation framework for insurance teams evaluating claims automation solutions

February 2025

15 min read

Strategic Guide

The Wrong Question

“Should we build or buy our claims AI?”

This is the wrong question. Vendors will show you polished demos. Your developers will promise they can “just use GPT-4.” Neither perspective captures reality.

The right question: What do you need to be true for this to work in production, and who can actually make it true?

We've worked through 47 evaluation iterations on a claims automation pipeline, watching accuracy climb from 18% to 98%. Along the way, we documented every failure mode, every false assumption, and every hard-won lesson. Here's how to evaluate claims AI vendors without getting fooled by demos.

When to Build

Build is a capability investment, not a cost savings. Choose it for strategic reasons, not fear of vendors.

Build Makes Sense When:

Your policy logic is genuinely unique. If your coverage rules can't be configured in a vendor product, you'll fight their system forever.
You have ML engineering capacity, and retention. One departing engineer shouldn't cripple your system.
Claims processing is your competitive advantage. Some insurers differentiate on claims speed. Most don't.
Regulatory requirements demand full control. Swiss and EU data residency, audit trails, explainability: some regulators want everything in-house.
You're prepared for 12-18 months before production. That's realistic. Our 47 iterations took 2 months of focused work.

Build Traps to Avoid:

“We'll just use GPT-4.”

You'll hit the same accuracy ceiling everyone else does. Our baseline LLM-only approach achieved 18% accuracy. Not 80%. Not 60%. Eighteen percent.

“Our developers can figure it out.”

Claims domain is deep. German compound words like Abgasrückführungsventil (exhaust gas recirculation valve) require automotive vocabulary that doesn't come from Stack Overflow.

“We don't want vendor lock-in.”

You'll trade it for internal lock-in: dependency on engineers who understand your bespoke system.

The Real Cost of Build:

• 2-4 ML engineers for 12+ months
• Domain expert time for labeling and validation (~20 person-hours for 50-claim ground truth)
• Ongoing maintenance: models drift, document formats change
• Opportunity cost of not shipping sooner

When to Buy

Buy is a speed-to-market trade-off. Choose it because you value time, not because you think it's “easier.”

Buy Makes Sense When:

You need production results in 3-6 months. Vendors have solved problems you haven't encountered yet.
You don't have (or want to build) ML capacity. Hiring and retaining ML engineers is hard.
Your claim types are relatively standard. Motor, property, health: vendors have seen these before.
You want IT focused on integration, not model building. Your competitive advantage is probably not in AI research.

Buy Traps to Avoid:

“They showed a great demo.”

Demo data is curated. Our system looked great on German documents until we ran French claims. All 25 approved claims were wrongly rejected.

“They said 99% accuracy.”

Ask: accuracy on what? Their test set or your documents? Our 98% accuracy came with 36% payout accuracy. Decision accuracy ≠ business accuracy.

“It's plug and play.”

Integration is always harder than promised. Document formats vary. Field names differ. Your policy structure has quirks.

The Real Cost of Buy:

• License fees (per claim, per document, or platform)
• Integration effort (typically 3-6 months, not 3-6 weeks)
• Ongoing optimization (you'll still need to tune)
• Vendor dependency (what if they pivot, fail, or get acquired?)

The Vendor Evaluation Framework

A systematic approach to comparing vendors. Weight categories based on your priorities.

Category 1: Accuracy & Evaluation

30% Weight

Question	What You're Looking For
What's your holdout accuracy?	Specific %, methodology explained
What's your false approve rate?	Tracked separately, <5%
Can you show error categories?	Taxonomy exists, improvement tracked
How many iterations to reach current accuracy?	Shows discipline, not luck

Red flags: “Our accuracy is 99%” without explaining holdout methodology. We tracked 14 distinct failure modes across matching, extraction, coverage logic, and calculation.

Category 2: Auditability

25% Weight

Question	What You're Looking For
Can I trace any field to its source document?	Page and character level
Do you log model version and confidence?	Complete audit trail
Can you reproduce a 6-month-old decision?	Full reproducibility
How do you capture human overrides?	With justification required

For Swiss and EU regulated workflows, auditability isn't negotiable.

Category 3: Architecture

20% Weight

Question	What You're Looking For
What % is deterministic vs ML?	Higher deterministic = lower risk
What are your confidence thresholds?	Asymmetric, configurable
How do you detect distribution shift?	Monitoring in place
What's your human-in-loop workflow?	Full QA console

Our architecture: 57% of items handled by deterministic rules (zero LLM cost). Higher deterministic percentage = lower cost and higher explainability.

Cost & Operations

15%

• Cost per claim breakdown
• Token limits, circuit breakers
• P50, P95, P99 latencies
• Native language handling

Risk & Safety

10%

• Documented failure modes
• Automated regression testing
• Graceful degradation to human
• EU/Swiss data residency

The Demo Checklist

What to demand when vendors present.

Before the Demo

• Send 10-20 of YOUR claims
• Include edge cases: multilingual, poor scan quality
• Define success criteria upfront

During the Demo

• Run on YOUR documents
• Ask to see a failure
• Click through to field provenance
• Request confidence scores

After the Demo

• Compare claimed vs actual accuracy
• Count “we'll tune that” responses
• Evaluate: could your team use this daily?

The 5-Minute Stress Test

Pick your weirdest claim. A German compound word like Zylinderkopfdichtung (cylinder head gasket) with a French invoice format and a borderline coverage decision. If they can't process it live, their “99% accuracy” doesn't apply to your reality.

The Hybrid Path

Consider the middle ground. Build some, buy some.

Buy

Document extraction, OCR, classification: vendors are good at generic document understanding

Build

Policy-specific business rules, coverage logic: you're better at your specific domain

Own

Evaluation framework, ground truth, quality monitoring. Maintain control of the decision layer

Why this works: Vendors handle commodity parts. You handle differentiated parts. You can swap vendors without rewriting rules. You maintain audit control over final decisions.

Next Steps

Define must-haves vs nice-to-haves

Auditability is non-negotiable. Automatic optimization is not.

Prepare your demo claim set

Twenty claims, including your edge cases and multilingual documents.

Build your scorecard before the first vendor call

Don't let demos anchor your evaluation criteria.

Decide: what do you want to own long-term?

The decision layer? The extraction layer? The whole stack?

The goal isn't to avoid vendors or avoid building. It's to make an informed decision based on your specific situation, timeline, and capabilities.

True Aim AG builds auditable AI systems for Swiss and EU regulated markets. We've learned these lessons the hard way, 47 iterations worth.

Key Takeaways

Build for strategic capability, not cost savings
Buy for speed-to-market, not simplicity
5-category vendor evaluation framework
Demo checklist: before, during, after
The hybrid path: buy commodity, build differentiation
Auditability is non-negotiable

Evaluating Claims AI Vendors?

We can help you build an evaluation framework tailored to your requirements and run objective vendor comparisons.

Get in Touch Read: Auditability in Claims AI