Zum Hauptinhalt springen

Diese Seite ist nur auf Englisch verfügbar

All Articles
Compliance & Governance

Auditability in Claims AI: Building Evidence Chains That Auditors Actually Accept

Everyone says their AI is explainable. Few can prove it to an auditor. Here's the architecture that makes it possible.

February 2025
12 min read
Technical Deep Dive

When regulators ask “why did your system approve this claim?”, they don't want to hear about attention weights or SHAP values. They want to click on a field and see exactly where it came from. They want to re-run a decision from six months ago and get the same answer. They want to know when a human intervened and why.

We learned this the hard way. Our claims automation system went through 47 evaluation iterations, improving from 18% to 98% accuracy. Every decision has a complete audit trail. Here's the architecture that makes it possible, and the regulatory requirements it satisfies.

What Regulators Actually Want

Let's translate regulatory language into technical specifications.

FINMA Guidance 08/2024 (Switzerland)

Switzerland has no AI-specific law. Instead, FINMA applies technology-neutral governance requirements to AI systems, framing AI risks as operational risks:

  • Model risks: robustness, correctness, bias, explainability
  • IT/cyber risks: security, availability, integrity
  • Third-party dependency risks: vendor lock-in, concentration

FINMA explicitly notes that AI “results often cannot be understood, explained, or reproduced.” This is the problem you must solve.

EU AI Act

The EU AI Act classifies life and health insurance risk assessment and pricing as high-risk. Claims automation isn't explicitly listed as high-risk, but transparency obligations apply. EIOPA emphasizes governance and risk management aligned with existing insurance regulation.

What This Means in Practice

  • Every decision must have a traceable evidence chain
  • Human overrides must be captured with justification
  • You must be able to reproduce a decision from 6 months ago
  • "The model said so" is not an acceptable explanation

Key insight: Regulators don't care about your model architecture. They care about your evidence trail.

The Evidence Chain Architecture

We use three layers of traceability. Each layer answers a different audit question.

Layer 1: Field-to-Source Provenance

“Where did this number come from?”

Every extracted field must link to: source document, page number, location (coordinates), and extraction confidence.

Field: [extracted_field_name]
Value: [extracted_value]
Source: [document.pdf], Page X
Location: [bounding box coordinates]
Confidence: [high/medium/low]

When an auditor asks “where did this value come from?”, you click and show them the exact location in the source document. No guessing, no searching.

Layer 2: Decision Audit Trail

“Why was this claim approved?”

Every decision must record: input hash, model version, confidence score, timestamp, and decision tier.

Decision: [APPROVED/DENIED/REVIEW]
Input hash: [reproducibility hash]
Model version: [version identifier]
Confidence: [score]
Decision method: [deterministic/AI/hybrid]
Timestamp: [ISO timestamp]

Our system uses a multi-tier decision cascade, starting with deterministic rules and falling back to AI only when needed. Each tier is logged, so you know exactly which method made the decision.

Layer 3: Override Capture

“Why did a human change this decision?”

Every human override must capture: original AI decision, human decision, override reason (required field), reviewer identity, and timestamp.

Original decision: [AI decision with confidence]
Override decision: [human decision]
Reason: [required justification text]
Reviewer: [authenticated user ID]
Timestamp: [ISO timestamp]

This creates a feedback loop: override patterns reveal system gaps. If adjusters consistently override denials for the same reason, that's a signal to improve the system.

Quality Gates: The Audit Checkpoint

We use a three-state quality gate that determines when automation proceeds and when humans must intervene.

PASSAll required fields present, confidence above threshold.

Proceed automatically. Log decision with full evidence chain. No human review required.

WARNConfidence below threshold OR non-critical field missing.

Proceed with flag. Route to sampling queue. Human may review or skip.

FAILCritical field missing OR confidence very low OR rule violation.

Block automation. Route to human adjuster. Require explicit human decision.

Confidence Asymmetry

We use asymmetric thresholds:

  • For approvals: higher confidence threshold required
  • For denials: lower confidence threshold acceptable
  • Below thresholds: routed to human review

Why asymmetric? False approvals cost more: financial loss plus regulatory risk. False denials can be appealed by the customer. This isn't bias; it's risk management.

Result: We reduced our REFER_TO_HUMAN rate from 60% to less than 5%.

Failure Modes That Break Audit Trails

From our 47 evaluation iterations, here are failures that regulators will catch:

Extraction Errors Without Provenance
SymptomA value was extracted incorrectly, but there's no link back to the source
Audit ImpactWithout provenance, you can't explain the error
FixEvery extracted value links to source location
Decisions Based on Unavailable Data
SymptomSystem made a decision, but key information wasn't in the document set
Audit ImpactSystem was correct given available data, but auditor doesn't know that
FixDocument what information was available at decision time
Approve-by-Default Logic
SymptomHigh accuracy overall, but wrong for wrong reasons
Audit ImpactCan't explain why approved, just "no failures found"
FixRequire explicit coverage confirmation, not absence of rejection

What Doesn't Work for Explainability

  • SHAP/LIME values: Technically interesting, meaningless to adjusters. “Feature X contributed 0.3 to the decision” is not an explanation.
  • Attention weights: “The model focused on these words” doesn't explain WHY the decision was made.
  • Confidence scores alone: “87% confident” isn't an explanation. You need: “87% confident BECAUSE [evidence].”
  • Post-hoc rationales: Generating explanations after the decision risks mismatch between explanation and actual logic. Auditors will test this.

Key insight: Explainability means showing your work, not describing your feelings about the answer.

Regulatory Mapping Quick Reference

RequirementFINMA 08/2024EU AI ActHow to Address
DocumentationMaterial applicationsTransparency obligationsModel cards, decision logs, version control
TraceabilityResults must be explainableRecord-keeping for high-impactField-to-source provenance
Reproducibility"Often cannot be reproduced" (the problem)ImpliedInput hashing, model versioning
Human oversightRequired for material decisionsRequired for high-riskQuality gates, override capture
Fallback mechanismsExplicitly requiredExpectedTier cascade, human queue
Risk classificationCentralized inventory expectedHigh-risk for life/health pricingUse-case registry with risk tags

Core Principles for Auditable AI

Building audit-ready AI systems requires commitment to four foundational principles. The specific implementation will vary by organization, but these principles remain constant.

Complete Provenance

Every extracted value must be traceable to its source document and location.

Decision Reproducibility

Given the same inputs and model version, you must be able to reproduce any historical decision.

Human Accountability

Every human override must be captured with required justification and reviewer identity.

Continuous Validation

Regular testing of historical decisions to verify your audit trail remains intact.

Next Steps

1

Audit your current evidence chain

Can you trace any field to its source?

2

Test reproducibility

Re-run a 3-month-old decision. Same output?

3

Document your quality gate criteria

What triggers human review?

4

Map your architecture to FINMA/EU AI Act

Use the table above as a starting point.

The goal isn't perfect AI. It's defensible AI: systems where every decision can be explained, every field can be traced, and every human intervention is documented. That's what regulators want. That's what auditors accept.

Based on internal research from 47 evaluation iterations (18% → 98% accuracy) and regulatory analysis of FINMA Guidance 08/2024, EU AI Act, and EIOPA governance frameworks.

Key Takeaways

  • Three-layer evidence chain architecture
  • Field-to-source provenance tracking
  • Decision audit trail with input hashing
  • Human override capture with required justification
  • Quality gates that satisfy auditors
  • FINMA & EU AI Act compliance mapping

Related Topics

AuditabilityFINMAEU AI ActClaims AIComplianceEvidence Chains

Building Auditable AI?

We can help you design evidence chain architectures that satisfy regulators and auditors.