1. Context and Background
In an investment bank, Value at Risk (VaR) is a critical daily metric used by Risk Managers to assess market exposure. VaR is calculated for T-1 Close of Business and must be delivered within strict timelines to support:
- Regulatory compliance
- Risk oversight
- Trading and capital decisions
The VaR calculation depends on multiple upstream data feeds, each with defined SLAs. A delay or failure in any single feed can block the entire batch, leading to late or missed VaR reporting.
2. Problem Statement
Problem:
SLA breaches in upstream feeds were often detected too late, only after the VaR batch was already delayed, leaving RTB teams with little time to act.
Why this was critical:
- VaR is time-sensitive and market-critical
- Late detection caused panic-driven firefighting
- Downstream systems were also impacted
- No proactive visibility into which feed caused the delay
3. Existing Workflow (Before the Change)
- VaR batch runs for T-1 COB
- Multiple feeds load sequentially
- If a feed fails or breaches SLA:
- Issue is often discovered at the end of the process
- RTB teams react late, under time pressure
- VaR delivery and downstream processes are delayed
This was a reactive system, heavily dependent on manual investigation.
4. User Personas
Primary Users
Run-the-Bank (RTB) Support Teams
- Responsible for feed monitoring and issue resolution
- Need early signals to act quickly
Secondary Users
Risk Managers & Downstream Consumers
- Depend on timely VaR delivery
- Impacted indirectly by upstream delays
5. Root Cause Analysis (PM Thinking)
The core issue was lack of real-time SLA visibility.
Key gaps:
- SLA performance was calculated but not surfaced proactively
- No immediate alert when a breach occurred
- No prioritisation of issues based on business criticality
- No simple signal indicating system health
6. Opportunity Identification
This was an opportunity to shift from:
Reactive incident handling → Proactive risk prevention
By detecting SLA breaches as soon as they occurred, RTB teams could:
- Act earlier
- Resolve issues before VaR timelines were impacted
- Reduce operational stress and escalation cycles
7. Proposed Solution
Feature: SLA Breach Monitoring & Alerting System
A traffic-light–based alerting mechanism was introduced to monitor feed completion against SLAs and notify RTB teams immediately upon breach.
Key Capabilities
1. SLA Tracking
- SLA defined per feed
- Daily comparison of actual completion time vs SLA threshold
2. Traffic Light Status
- 🟢 Green: Feed completed within SLA
- 🟡 Amber: Approaching SLA threshold
- 🔴 Red: SLA breached
3. Real-Time Notifications
- Automated alerts sent to RTB teams via:
- Internal messaging channels (e.g., Themes)
4. Actionability
- Alert clearly identified:
- Affected feed
- SLA breached
- Potential impact on VaR timelines
8. Why This Was the Right Solution
- Minimal disruption to existing systems
- No change to VaR calculation logic
- Focused on visibility and early intervention
- Aligned with bank-wide operational risk principles
This was a low-risk, high-impact internal product enhancement.
9. Success Metrics
North Star Metric
On-Time VaR Delivery Rate
Input Metrics
- Number of SLA breaches detected proactively
- Average time between SLA breach and RTB action
- Reduction in manual investigation time
Outcome Metrics
- Reduction in VaR delays caused by upstream feeds
- Reduction in downstream processing delays
- Improved operational stability during EOD processing
Guardrail Metrics
- False-positive alert rate
- Alert fatigue for RTB teams
10. Impact & Results (Qualitative, Safe to Share)
- RTB teams received early visibility into feed issues
- Panic-driven escalations were significantly reduced
- Issues were resolved earlier in the processing window
- Downstream systems experienced fewer cascading delays
- Overall confidence in VaR delivery improved
11. Risks & Trade-offs
- Too many alerts → mitigated with threshold tuning
- Alert fatigue → mitigated via traffic-light prioritisation
- Dependency on SLA accuracy → SLAs reviewed and standardised
12. Final Impact Statement (Portfolio-Ready)
By introducing proactive SLA breach alerting with a traffic-light system, the VaR reporting pipeline shifted from reactive firefighting to early risk mitigation. This significantly improved operational resilience, reduced last-minute escalations, and helped ensure timely delivery of one of the bank’s most critical risk metrics.