Why Model Validation Matters

Deploying an ML model to production without rigorous validation is like launching a rocket without testing—catastrophic failures are inevitable. In AML, the stakes are particularly high: missed detections allow financial crime to proliferate, while excessive false positives overwhelm compliance teams and damage customer relationships.

Regulators increasingly require documented model validation. The OCC's Model Risk Management guidance and similar frameworks worldwide mandate independent review, ongoing monitoring, and clear documentation of model limitations.

Validation Framework

Our validation approach spans the entire model lifecycle:

1. Pre-Deployment Validation

Before any model reaches production:

Holdout Testing

Temporal Split: Train on older data, test on recent data (simulates real-world deployment)
20% Holdout Set: Never seen during training or hyperparameter tuning
Stratified Sampling: Ensure rare events (true money laundering) represented in test set

Cross-Validation

Time Series CV: Rolling window validation preserving temporal ordering
5-Fold Validation: Assess model stability across different data subsets
Consistency Check: Performance should not vary wildly across folds

Example: Temporal Cross-Validation

Fold 1: Train on Jan-Jun 2024, Test on Jul 2024
Fold 2: Train on Jan-Jul 2024, Test on Aug 2024
Fold 3: Train on Jan-Aug 2024, Test on Sep 2024
Fold 4: Train on Jan-Sep 2024, Test on Oct 2024
Fold 5: Train on Jan-Oct 2024, Test on Nov 2024

Average performance across folds:
Precision: 88.2% (±2.1%)
Recall: 93.7% (±1.8%)

2. Performance Metrics

Accuracy alone is meaningless in AML (where 99.9% of transactions are legitimate). We track:

Metric	Definition	Target
Precision	Of flagged transactions, % truly suspicious	>85%
Recall (TPR)	Of true ML cases, % detected	>95%
False Positive Rate	Of legitimate transactions, % flagged	<5%
AUC-ROC	Overall discrimination ability	>0.98
AUC-PR	Precision-recall trade-off	>0.90

Business Metrics

Technical metrics must translate to business value:

Alert Volume: Daily alerts generated (target: <200 for 1M transactions)
Investigation Time: Average time per alert (target: <20 minutes)
SAR Conversion Rate: % of alerts leading to SARs (target: >15%)
Cost Per Alert: Total compliance cost divided by alerts investigated

3. Backtesting

Run the new model on historical data where outcomes are known:

Backtesting Protocol

1. Select historical period (e.g., last 90 days)
2. Run new model on all transactions from that period
3. Compare model alerts to:
- • Previously filed SARs (should catch these)
- • Cases marked as false positives (should avoid these)
- • Transactions later confirmed as money laundering (critical test)
4. Calculate precision, recall, false positive rate on known outcomes
5. Identify edge cases where model fails

4. A/B Testing in Production

Shadow mode and gradual rollout minimize risk:

Phase 1: Shadow Mode (4 weeks)

New model runs alongside existing system
New model alerts logged but NOT acted upon
Compare alerts: what does new model catch? What does it miss?
Analysts review sample of new model alerts for quality

Phase 2: Canary Deployment (2 weeks)

Route 5% of traffic to new model
Monitor error rates, latency, alert quality
Immediate rollback capability if issues detected

Phase 3: Gradual Rollout (4 weeks)

Week 1: 25% traffic
Week 2: 50% traffic
Week 3: 75% traffic
Week 4: 100% traffic (full deployment)

Ongoing Monitoring

Validation doesn't end at deployment. We continuously monitor model health:

Daily Checks

Alert Volume: Sudden spikes or drops indicate problems
Score Distribution: Should remain stable day-to-day
Latency: Inference time within acceptable bounds
Error Rates: Failed predictions, timeouts, exceptions

Weekly Analysis

Feature Drift: Are input features changing distribution?
Prediction Drift: Are model outputs shifting?
Analyst Feedback: Review true/false positive labels from investigations
Precision/Recall Trends: Calculate on labeled cases

Monthly Review

Confusion Matrix: Detailed breakdown of TP, FP, TN, FN
Error Analysis: Deep dive into false positives and false negatives
Feature Importance: Has it changed? Why?
Regulatory Review: Present findings to compliance team

Monitoring Dashboard Metrics

Real-Time

• Requests per second
• p50, p95, p99 latency
• Error rate
• Queue depth

Daily Aggregates

• Total transactions scored
• Alerts generated (by severity)
• Score distribution histogram
• Feature value ranges

Detecting Model Degradation

Models degrade over time as the world changes. Key warning signs:

Data Drift

Input feature distributions shift from training data. Use statistical tests:

Kolmogorov-Smirnov Test: Compare current vs training distributions
Population Stability Index: Quantify distribution drift
Alert Threshold: PSI > 0.25 triggers retraining

Concept Drift

Relationship between features and outcomes changes (criminals adapt tactics):

Performance Degradation: Precision/recall decline over time
New Typologies: Emerging schemes model wasn't trained on
Regulatory Changes: New thresholds or requirements

Challenger Models

Always maintain alternative models for comparison:

Simpler Baseline: Logistic regression as sanity check
Rule-Based System: Compare to legacy approach
Alternative Architecture: Different ML approach (e.g., XGBoost vs Neural Network)
Ensemble Challenger: Combination of multiple models

Monthly Challenger Comparison

Model	Precision	Recall	Alerts/Day
Production GNN	88.2%	94.1%	187
Challenger XGBoost	86.7%	92.3%	203
Baseline Logistic	72.1%	88.9%	412

Documentation Requirements

Regulatory compliance requires comprehensive documentation:

Model Card: Intended use, training data, performance, limitations
Validation Report: Pre-deployment testing results
Monitoring Logs: Ongoing performance metrics
Incident Reports: Model failures and remediation
Retraining Logs: When and why models are updated
Independent Review: Third-party validation findings

Conclusion

Model validation is not a checkbox exercise—it's an ongoing commitment to quality, safety, and regulatory compliance. At nerous.ai, where ingenuity and brilliance define our approach, we've built validation into every stage of the model lifecycle.

The result: ML models that maintain 95%+ recall with <5% false positive rates in production, backed by documentation that satisfies the most demanding regulators.

Model Validation for AML: Testing and Performance Metrics