Real-Time Fraud Detection AI | Top 20

Situation

22 Months and Two Failed Programs Later

When the bank's Chief AI Officer reached out to us in early 2025, the fraud detection AI program had already consumed 22 months of development time and two full SI engagements. The first attempt, a bespoke rule-engine hybrid, had been abandoned after 8 months when detection rates stalled at 71%. The second attempt, a major cloud AI platform implementation led by a global SI, had achieved 83% detection in controlled testing but failed to pass the bank's model risk management validation due to insufficient explainability documentation for SR 11-7 purposes.

The organization had a genuine urgency: card fraud losses in their Asia-Pacific markets alone were running at $2.8M per month. Their existing legacy system, a vendor-managed rules engine installed in 2014, was detecting only 69% of fraud events and generating a false positive rate of 1.4%. At their transaction volumes, 1.4% false positives meant 32,000 legitimate transactions declined daily, each one a customer experience failure.

The core problem was not technical. Both previous programs had the right data, the right platforms, and reasonably capable teams. The failure was architectural: neither program had been designed around the bank's specific model risk governance requirements from the start. Both had tried to retrofit explainability and audit trail capability after the fact, which is almost always fatal to a financial services AI deployment.

Challenge

Four Constraints That Previous Programs Had Ignored

The first thing we did was conduct a structured failure analysis of both prior programs. This is standard practice for us when taking on a rescue engagement. You need to understand exactly why the previous attempts failed before committing to a new approach. In this case, four constraints had been systematically underweighted:

SR 11-7 Model Risk Management compliance: Every model required documented pre-implementation conceptual soundness review, ongoing performance monitoring, and outcome analysis. The previous programs had produced model performance documentation but not the conceptual soundness documentation that the bank's Model Risk Management team required for sign-off.
Sub-100ms inference latency: The bank's payment processing infrastructure required fraud scoring decisions within 85ms of transaction initiation. This ruled out several architectures that could have delivered higher accuracy but would have added latency.
Multi-jurisdiction feature constraints: Eight markets meant eight different sets of permissible data attributes for scoring. What was legally available for fraud scoring in the UK differed from Singapore differed from Australia. Previous programs had built for a single feature set and then tried to strip features out market-by-market, causing severe accuracy degradation.
Explainability requirements for disputes: When a transaction was declined, customer service representatives needed a human-readable explanation. Previous programs had produced model output but no interpretable reasoning layer for operational use.

Understanding these four constraints properly changed the entire architecture. We did not begin with the question "what model produces the best detection rate?" We began with the question "what model architecture can satisfy all four constraints simultaneously?"

Approach

Architecture-First, Compliance-In, Not Retrofit

Our approach was structured around a single principle: build SR 11-7 compliance into the model architecture and development process from week one, not as a final validation step. This sounds obvious. In practice, virtually every AI program in financial services treats regulatory compliance as a documentation exercise at the end of development. That is why programs fail.

Model Architecture: Gradient Boosted Ensemble with SHAP Explainability

We selected a gradient boosted decision tree ensemble (XGBoost base, LightGBM secondary) as the primary scoring layer. This decision was not about optimizing for detection rate in isolation. GBDTs produce inherently interpretable feature contributions that map cleanly to SHAP (SHapley Additive exPlanations) values, satisfying the explainability requirement at both the model validation level and the operational customer service level.

A secondary neural scoring layer was added for complex pattern sequences (card-not-present fraud chains across multiple merchants) where the GBDT ensemble showed weaker performance. The neural layer outputs were passed through a calibration module that converted raw probabilities to bank-interpretable risk bands, which the MRM team had pre-agreed as acceptable for the SR 11-7 conceptual soundness review.

Feature Engineering: Market-Stratified from Day One

Rather than building a single global feature set, we designed a market-stratified feature engineering pipeline from the outset. Each market had a defined permissible feature set reviewed by local legal counsel. A feature availability matrix was published in week two and became the governing document for all feature engineering work. When a feature was permissible in 6 of 8 markets, the model architecture accommodated its absence for the remaining 2 rather than treating it as a degradation.

Core features included: transaction velocity patterns (24h, 7d, 30d windows), merchant category sequence analysis, device fingerprint matching, geographic displacement scoring, and time-of-day behavioral anomaly detection. The real-time inference pipeline was built on Apache Kafka with a feature serving layer that pre-computed 180-day rolling behavioral baselines, ensuring that even complex feature lookups stayed within the 85ms latency budget.

Model Risk Documentation: Concurrent, Not Sequential

Every model development decision was documented concurrently with its implementation. The conceptual soundness documentation was written in real time by a practitioner who was simultaneously participating in the technical decisions. This is the single most important process change we made. It is also the most unusual: most programs treat documentation as a separate workstream handled by a different team at the end of development.

By week 10, the full SR 11-7 documentation package was complete, reviewed by the bank's MRM team, and approved in principle. This had never happened in either previous program, where MRM review was treated as a final gate that invariably identified documentation gaps requiring weeks of remediation.

Production Deployment: Blue-Green with Shadow Mode Validation

Rather than a direct cutover from the legacy system, we deployed the new models in shadow mode alongside the live legacy system for three weeks. Shadow mode means the new model scored every transaction but did not take action on those scores. This produced a 21-day comparison dataset showing exactly where the new model agreed with the legacy system, where it disagreed, and what the outcomes of those disagreements were.

The shadow mode data was the single most powerful artifact in the final MRM sign-off. Seeing 3.2 million real transactions where the new model would have detected $2.1M in additional fraud while generating 94% fewer false positives resolved every remaining internal skepticism about the program.

14-Week Deployment Timeline

Wks 1-2

Failure Analysis and Architecture Design

Structured review of both prior programs. Feature permissibility legal review across all 8 markets. Architecture decision with SR 11-7 compliance mapped in. MRM team alignment meeting and documentation framework agreed.

Wks 2-5

Data Pipeline and Feature Engineering

Market-stratified feature engineering pipeline built on Kafka. 180-day behavioral baselines computed for 14M active cardholders. Feature availability matrix finalized and approved by legal. Latency testing confirms 42ms p99 inference time.

Wks 5-9

Model Development and Concurrent Documentation

GBDT ensemble trained and validated against 18-month historical fraud labels. SHAP integration and operational explainability layer built. Neural layer for complex sequence patterns developed. All decisions documented concurrently for SR 11-7.

Wks 9-11

Shadow Mode Deployment and MRM Review

New model deployed in shadow mode alongside legacy system. 3.2M live transactions scored and compared. Full SR 11-7 documentation package submitted to MRM. MRM approval received in week 11, first time in program history.

Wks 11-14

Live Production Cutover Across 8 Markets

Phased cutover: Singapore and Hong Kong week 11, UK and Australia week 12, Germany and France week 13, US and Canada week 14. Monitoring dashboards configured for daily performance reporting to MRM and operational teams.

Measured Results at 6 Months Post-Deployment

Primary Outcome 94.7%

Fraud detection rate across all 8 markets combined. Up from 69% on legacy system. Benchmarked against industry average of 88% for comparable transaction profiles.

Annual Fraud Savings $12M

Direct fraud loss reduction attributable to improved detection. Asia-Pacific region alone accounts for $4.8M. Calculated on 6-month actual post-deployment data extrapolated to annual rate.

False Positive Rate 0.03%

Down from 1.4% on legacy system. Represents 29,000 fewer legitimate transaction declines daily. Customer satisfaction scores in fraud-impacted cohorts improved 14 points (NPS).

Regulatory Standing 100%

SR 11-7 compliant with full conceptual soundness, performance monitoring, and ongoing validation documentation. First AI model in the bank to achieve clean MRM sign-off on initial submission.

5 Things This Engagement Proved

Compliance-in beats compliance-retrofit, always. Both previous programs failed at the final MRM gate. We passed on first submission because we treated SR 11-7 as an architectural constraint, not a documentation exercise. This single decision change is responsible for half the time savings versus prior attempts.

Shadow mode is not optional for financial services AI. The 21 days of shadow mode data were worth more than 6 months of internal program advocacy. Seeing $2.1M in incremental fraud detection on real transactions, with no production risk, resolved every remaining concern the MRM team and CRO had about the program.

Multi-jurisdiction design from day one, not day 80. Designing for market-stratified features at the start meant the system performed well across all 8 markets. Any architecture that treats international deployment as a late-stage feature stripping exercise will produce models that are either non-compliant in some markets or severely degraded in others.

Latency is a first-class model constraint, not an engineering afterthought. The 85ms requirement eliminated several architectures that would have produced marginally higher detection rates. Accepting that constraint early produced a system that actually works in production. Ignoring it would have produced a system that performs well in isolation and fails in operation.

The best fraud model is the one that gets deployed. Two previous programs built technically capable models that never reached production. A 94.7% detection rate running in production delivers more value than a 97% detection rate still in the MRM review queue. Deployment feasibility is as important as model performance.

"We had been at this for nearly two years and had nothing in production. In 14 weeks, AI Advisory Practice not only shipped a production system but it was the first AI model our MRM team had ever approved on the first submission. The shadow mode data alone was worth the entire engagement fee. We now have a playbook we are using for every subsequent AI deployment."

Chief AI Officer

Top 20 Global Bank

Real-Time Fraud Detection for a Top 20 Global Bank: From Failed Program to Production in 14 Weeks