
What an airline seat upgrade problem taught me about constrained decision-making and why the same framework applies everywhere from airline seats to oil and gas, and last-mile delivery.
Every high-stakes industry shares a common problem: how do you make the right decision, at the right moment?
An airline with 10 empty seats in business class a few hours before departure faces a choice: discount and provide attractive upgrades to fill the plane, or hold price and risk flying with revenue left on the table.
Now scale that up. Imagine handling, not a single flight, but thousands of flights per day across an airline ticketing system. Here, each traveler checks in before their flight has different loyalty history, balances, and price sensitivity. The system has to pick the right promotional offer for this passenger, right now, so that they make an upgrade.
Simple enough on the surface. But here’s what makes it challenging: you can’t just maximize revenue. The question is: what’s the right offer, for the right person, at the right time?
This sounds like a pricing problem. It’s actually a sequential decision-making problem. And solving it well requires a fundamentally different class of thinking than standard machine learning. I built a system that solves exactly this. Here’s the story of how I designed an adaptive policy that learns which offer to show each passenger, in real time, without ever crossing the business constraint . Here’s the story of what I learned, and why the framework matters well beyond airlines or loyalty programs.
The Tension That Makes This Interesting
Imagine, the business objective is to maximize revenue and the system offers three upgrade discount tiers: 40%, 45%, and 50% off. However there are constraints, deeper discounts can fill more seats but they erode the revenue per seat. Let’s say the business requires us in order to be profitable to keep revenue per seat above a threshold at all times. The 50% discount is the most compelling to passengers, but its unit price falls below that threshold. A system that simply chases the highest upgrade rate would hand out 50% discounts to everyone and blow through the pricing floor within hours.
On the other hand, playing it safe with only 40% discounts would leave empty seats.
The real opportunity is in the middle: figure out which passengers need the deeper push and which will upgrade at a smaller discount, then allocate accordingly all while keeping the average revenue per seat above the floor at every point in the campaign.
This isn’t a prediction problem. It’s a decision problem. And the constraint has to hold in real time, not just on average at the end.
Sound familiar? This is how a lot of industries still operate, defaulting to the most generous offer because it’s the safest bet, not because it’s the optimal one. The question is: can you build a system that finds the optimal offer for each passenger without breaking the rules?
Why Standard ML Gets This Wrong
Before building anything, I had to be honest about what kind of problem this actually is. The question here you might ask, the airline has catered to millions of passengers and have years of behavioral data. Why not just frame this as a prediction problem? Train a model on historical conversions, pick the offer with the highest expected revenue. Clean, familiar, auditable?
But there are three key things break this framework.
You can’t observe the road not taken. The issue is counterfactuality! If a member was shown the 50% offer and bought, you have no idea whether they would have bought at 40%. This is the fundamental problem of causal inference: you only ever see one branch of the decision tree. An airline knows that a passenger paid $450 for seat 14A. It has no idea whether they would have paid $520.
Historical data encodes historical decisions. The training data reflects the choices what the old system made, not a fair experiment. If the previous system heavily favored one offer over the others, any model trained on that data inherits the skew, it has far more evidence about what that offer does than the alternatives. You’re not learning about the world. You’re learning about what one biased policy decided to do.
Constraints don’t care about averages. A pricing floor that holds in expectation can still breach in any specific rolling window. An airline hitting its pricing targets across the network can still be giving away business class on a specific route. Real-time constraints require real-time enforcement not retrospective averaging.
The Framework: Learn, Predict, Constrain
To solve this problem, I used a historical storefront dataset which had 10,000 sessions, each recording the passenger’s context (loyalty tier, point balance, purchase history, visit recency), which discount they were shown, and whether they upgraded. Not a large dataset by modern standards, which made every design choice matter more.
I built the system around three components that work together on every single decision.
Learn — Thompson Sampling
Thompson Sampling is the decision engine. It maintains a probabilistic belief about how well each offer performs for different member types and it updates that belief continuously as new outcomes arrive. It is the quintessential exploration vs exploitation problem.
The key property: exploration is automatic and proportional to uncertainty. If the 40% offer has sparse historical data, TS explores it aggressively because it genuinely doesn’t know how members respond to it. The 50% offer has thousands of sessions, TS exploits that knowledge confidently. No manual tuning, no hyperarameter adjustment, no exploration schedule to maintain.
Think of it like a geologist deciding where to drill. With limited seismic data in a new basin, you explore widely. As evidence accumulates around a productive formation, you focus your drilling program there. TS does the same thing but in real time, on every decision.
Predict — Two-Stage Reward Modeling
Before TS can make a decision, it needs to estimate the value of each offer for this specific member. I use a two-stage model:
Stage 1 predicts purchase probability, which is essentially predicting how likely a person is to make a buy decision. Stage 2 predicts spend conditional on purchase, which would tell us what would be the revenue if they make a purchase. Separating them is important 90% of sessions end with zero revenue, which would dominate a single-stage model and obscure the signal.
Both models are arm-specific: the same member gets three different predictions, one per offer, using one-hot encoded arm indicators in the feature vector. This means the system genuinely learns that high-balance members respond differently to the 40% vs 50% offer not just that they’re more likely to buy in general.
Constrain — Three-Layer PPP Enforcement
Revenue maximization alone would converge on the 50% offer for everyone. The constraint mechanism prevents that without sacrificing adaptivity.
Layer 1 — Hard guardrail: Before serving the 50% offer, simulate its PPP impact using the calibrated purchase model. If it would push campaign or rolling window PPP below $0.016, block it. No exceptions.
Layer 2 — Lagrangian penalty: An adaptive multiplier λ that increases as PPP approaches the floor. This softly steers the policy away from the boundary before the hard block kicks in — like a speed limiter that activates before the brakes.
Layer 3 — Usage floor: A rolling 200-session window enforces minimum 10% session share per offer. Prevents starvation. Ensures all three offers continuously accumulate real outcome data — which feeds better posterior estimates, which feeds better decisions.
These three layers operate in priority order on every session. Usage floor first. PPP guardrail second. TS among the remaining feasible offers third.
The Honest Evaluation
Building the policy is half the problem. The other half and arguably the harder half is answering a deceptively simple question: does it actually work? You can obviously deploy it and find out! But before that you have to prove it on the same historical data that was generated by the system you’re trying to replace. Evaluating an adaptive policy on biased historical data is harder than it sounds and where most systems get sloppy.
Given TS is an adaptive policy it would have made different decisions than the historical policy. In our case, for 75% of the 3,000 test sessions, it chose a different offer than the logged policy meaning there’s no real observed outcome for those sessions. You can’t evaluate a counterfactual with the data that actually exists.
I handled this with a two-pass approach. Pass 1 runs TS adaptively through all test sessions, using real outcomes where available and stochastically simulated outcomes (from the reward models) where not. Pass 2 evaluates the frozen decisions using three Off-Policy Evaluation estimators with increasing model dependence:
- Rejection Sampling — real outcomes only, from the 25% of sessions where TS matched the historical policy. Most trustworthy. Rests on roughly 75 actual purchase events.
- Doubly Robust — all sessions, IPS correction on matched sessions, model predictions elsewhere. Unbiased if either the propensity model or reward model is correct.
- Model-Conditional — all sessions, all model-predicted. Smoothest estimate, least trustworthy.
All three agree directionally: $17–$19 per session versus an $11.38 historical baseline. Three independent methods with different failure modes pointing the same way is the strongest signal offline evaluation can provide.
What I’m Confident About and What I’m Not
This split matters more than the headline number.
Confirmed:
- Zero PPP violations across 2,800 rolling windows in all 13 sensitivity configurations
- Offer mix rebalances from 78% to 22% on the 50% arm, directly observable, no model dependence
- Sub-4ms decision latency, production ready
Directional but unconfirmed:
- Revenue lift is positive across all three OPE estimators but confidence intervals overlap with baseline. The +63% DR estimate is a model-dependent offline estimate. I wouldn’t present it as a proven result.
Unknown until A/B test:
- True member response to offer rebalancing at scale
- Feedback loop dynamics over weeks of real traffic
The wide confidence intervals aren’t a methodological failure, they’re an honest reflection of how much you can actually know from fixed historical data when 75% of test outcomes are counterfactual. Any system that claims tight uncertainty estimates in this setting is hiding something.
The Hard Parts
A few things I found genuinely difficult.
Bias propagates through everything. The reward models train on historically skewed data. In simulation, biased predictions feed into TS posterior updates. The policy you observe offline may partly reflect model bias rather than genuine learning. Calibration, the usage floor, and the separation of evaluation from learning all mitigate this but the only real fix is real traffic.
Calibration is load-bearing, not cosmetic. The upstream purchase probability feature predicted 46% conversion when reality was 14% a 3x overestimate. Using that directly would have made the PPP guardrail dangerously optimistic. Isotonic calibration isn’t a nicety, it’s what makes the constraint mechanism actually work.
You’re always evaluating a hypothesis, not a fact. 75% of test outcomes are model-simulated. The ground truth in this evaluation rests on roughly 75 real purchase events. That’s honest. It’s also why the A/B test isn’t optional.
What Comes Next
Shadow deployment → 10% A/B test → ramp if PPP holds → full rollout.
If the test confirms the lift, the natural evolution is richer context modeling LinUCB or neural bandits if member features prove strongly predictive of arm-specific response. Eventually, reinforcement learning with lifetime value as the reward signal once multi-visit member trajectories are available. That’s probably 18–24 months out. But the modular architecture doesn’t block it.
The more interesting question is scale. A system like this running on millions of members, across multiple products, with dynamic offer structures that’s where the framework’s generality becomes its biggest asset. The constraint mechanism is arm-count and objective agnostic. The evaluation methodology transfers. The core bandit logic doesn’t care whether the arms are loyalty offers, fare classes, or well operating configurations.
The Same Framework, Everywhere
Here’s why I find this problem genuinely exciting beyond loyalty programs, the framework is domain-agnostic.
Airlines face this every time they decide whether to offer a discount fare, upgrade a passenger, or hold inventory for late-booking business travelers. The constraint isn’t PPP, it’s load factor, yield targets, or minimum revenue per available seat mile. The bandit framework handles all of these. The arms are fare classes. The context is booking window, route, member tier, competitive pricing. The constraint is yield.
Oil and gas industry faces the same structure in operational optimization. During hydraulic fracturing, multiple pumping units work in tandem to maintain the pressure and flow rate needed to fracture the formation. An operator has to continuously decide how to configure each unit adjusting gear, modulating load and horsepower, to sustain the target pump rate without overstressing the equipment. Push too hard and you burn out an engine mid-job, which can stall the entire operation and cascade into costly maintenance cycles. Play it too safe and you lose production efficiency. The arms are operating configurations. The context is real-time pressure data, engine load, and wellbore conditions. The constraint is equipment health. I’ve worked in this space directly and these decisions are still made by experienced operators reading gauges and relying on intuition. The bandit framing maps cleanly onto them.
Energy trading and optimization — battery operators deciding in real time when to charge and discharge face exactly this structure. The reward is arbitrage revenue. The constraint is battery degradation limits or grid stability requirements. The arms are charge/discharge rates. The context is spot price forecasts, state of charge, time of day.
Climate and carbon markets — as carbon credit markets mature, similar frameworks could optimize the timing and size of credit issuance or retirement decisions subject to regulatory constraints and market price floors.
The common thread: a real-time decision with multiple feasible actions, a measurable reward signal, and a hard operational constraint that can’t be violated. That’s the problem class. The algorithm is the same.
The Bigger Point
Sequential decision-making under constraint is one of the most common and most underserved problem classes in applied ML.
Most industries still solve it with rules: always offer this tier to this segment, hold inventory until this threshold, produce at this rate until this price. Rules are auditable, explainable, and often wrong. They don’t update. They don’t explore. They don’t learn that a certain type of member is actually more price-sensitive than the historical data suggests.
Bandits do. And with the right constraint architecture, they do it without violating the operational boundaries that make the business work.
That gap between what rules-based systems do today and what adaptive constrained learning could do is where I think some of the most valuable applied ML work lives right now. Not just in loyalty programs. In every industry where the right decision depends on context, the constraint is real, and the data to learn from already exists.



Follow Me