Results

How to read this page

Sharpe ratios on this page are reported in two complementary forms:

  • Sharpe (excess), the headline. (annualized return − annualized risk-free) / annualized volatility.
  • Deflated Sharpe (DSR), corrects for selection bias when many baskets are tested. Uses an implied-independent-trials count rather than the raw trial count, since adjacent baskets share underlying-instrument exposure and inflate the apparent number of independent searches.

Other reported metrics:

  • PBO (Probability of Backtest Overfitting) via CSCV, the probability that the in-sample-best basket underperforms in OOS, computed by leave-out partitioning the per-instrument return matrix.
  • n_trades and win rate per instrument.
  • Max drawdown (signed, negative) and annualized return (geometric).

Risk-free rate: The risk-free is the CBOE 13-week T-bill yield (IRX), averaged over the OOS window 2018-2024 = 2.33% annualized. Excess Sharpe = (annualized return − 2.33%) / annualized volatility. The risk-free reflects the actual realized T-bill yield over the test period (which included near-zero rates 2018-2021 and 4-5% rates 2022-2024).

Acceptance gates we set:

  • DSR (PSR threshold) ≥ 0.95 against the implied-independent-trials benchmark
  • PBO via CSCV ≤ 0.30 on the per-instrument return matrix

Headline

4-instrument cross-asset put-credit-spread basket with the halt framework engaged and a calibrated regime-stress ML overlay that scales book exposure by (1 − p_stress) where p_stress is the model’s probability of a stress event in the next quarter. Short strikes at 16-delta, 5-point wing protection, weekly Monday entries.

Composition Asset class Strike grid
AAPL Single-stock equity $1
MSFT Single-stock equity $1
WMT Single-stock equity $1
GLD Commodity ETF $1
Headline metric Value
Sharpe ratio (excess of risk-free, ML overlay engaged) +0.371
Sharpe ratio (excess), without ML overlay +0.359
SPX baseline (single-instrument, predecessor implementation) +0.286
Δ vs SPX baseline +0.085
Risk-free baseline (avg IRX 2018-2024) 2.33%
Geometric mean return (GMRR, annualized) +2.41%
Annualized volatility 0.21% (with ML) / 0.23% (without)
Alpha vs SPY (annualized OLS intercept × 252) +2.37%
Beta vs SPY (OLS slope, daily simple returns) +0.0014
Correlation with SPY (daily returns) 0.12
Max drawdown over OOS (7 years) -0.12% (with ML) / -0.13% (without)
Trades total (sum across 4 instruments) 437
Average trades per year 62.8
Average return per trade +8.25% (median +$17.85 net P&L per spread)
Win rate (basket aggregate) 73.0%
OOS span 2018-01-01 → 2024-12-31 (1,760 trading days)
ML overlay Regime-stress scaler (1 − p_stress) applied to daily book exposure
ML acceptance gate Brier-score reduction +14.1% vs naive baseline (gate at ≥5%) ✓

Mapping to the HW3 / HW4 rubric: GMRR is the geometric mean return (annualized) row; Alpha and Beta are computed by daily OLS of basket returns vs SPY over the full 1,760-day OOS window (per the project’s src/metrics/portfolio.py); Sharpe, Annualized Volatility, Max Drawdown, Avg Return per Trade, Trades per Year, and Total Trades are reported above. Beta is essentially zero because the strategy harvests volatility risk premium on a defined-risk wing-protected structure rather than holding directional equity beta.

Multiple-testing validation

The basket was selected from a tested universe of 12 instruments (3 ETFs: SPX, TLT, GLD; 9 single-stocks: AAPL, MSFT, GOOGL, JNJ, KO, PG, WMT, JPM, PEP). Selection bias is corrected via DSR + PBO:

Correction Value Acceptance Pass?
Raw trials tested (N) 12 informational n/a
Avg pairwise return correlation (ρ̄) 0.261 informational n/a
Implied independent trials (N̂) 9 informational n/a
Deflated Sharpe (PSR) 1.0000 ≥ 0.95
PBO via CSCV (S=16, 12,870 logits) 0.0402 ≤ 0.30

Both gates pass. The headline is statistically significant after correction for the 12-trial selection.

Equity curve

The headline ends at 1.18× starting equity over 7 years (CAGR 2.41%). The SPX baseline (Sharpe 0.286, ann_ret 2.49%) ends very close in absolute terms. The Sharpe advantage (+0.085 with the ML overlay engaged) shows up not in absolute return but in volatility: the headline runs at 0.21% annualized volatility versus the SPX baseline’s 0.55%, so the same return is more risk-efficient.

Drawdown

Maximum drawdown of -0.13% over the 7-year OOS sample. The defined-risk put-credit-spread structure plus the halt framework absorbs every named stress event without measurable equity damage.

Per-instrument breakdown

Ticker Sharpe (excess) n_trades Win rate Max DD Ann. return Final $ (from $50K start)
AAPL +0.264 112 74.1% -0.50% +2.44% $59,170
MSFT +0.269 112 73.2% -0.37% +2.42% $59,124
WMT +0.263 107 69.2% -0.18% +2.40% $59,017
GLD +0.138 106 75.5% -0.23% +2.40% $59,008
Aggregate (equal-weight 4) +0.359 437 73.0% -0.13% +2.41% $231,313

The book aggregate has higher Sharpe than the per-instrument average because cross-instrument correlation is low (mean pairwise 0.26), so equal-weighting reduces volatility without proportional return loss.

Anchor comparison vs SPX baseline

Strategy Architecture Sharpe (excess) Trades OOS window
SPX baseline (predecessor implementation, historical reference) SPX put-only with halts engaged +0.286 210 2018–2024
Headline 4-instrument basket, put-only, halts engaged, regime-stress ML overlay +0.371 437 2018–2024
Δ vs SPX baseline (same OOS window, both halts engaged put-only) +0.085 +227 2018–2024

The new headline beats the SPX baseline on excess Sharpe, with roughly 2× the trade sample, comparable max drawdown, and cross-asset diversification (equity + commodity) the baseline lacks. The SPX baseline (the predecessor implementation) is included as a historical reference; reproducing its exact equity curve is not supported on the current engine version because the engine has materially evolved since the baseline was recorded.

Variants tested: alternative baskets

All baskets below are run with halts engaged, equal-weighted at $50,000 per instrument. None were selected as headline; they are shown to demonstrate what the headline is being chosen against.

Basket n_inst Sharpe (excess) AnnRet AnnVol MaxDD Δ vs SPX baseline (+0.286)
(A) SPX put-only alone 1 -0.347 +2.04% 0.84% -0.78% -0.633
(B) 3-name basket (AAPL+MSFT+WMT) 3 +0.350 +2.42% 0.26% -0.19% +0.064
(C) 3-ETF basket (SPX + TLT + GLD put-only) 3 -0.312 +2.22% 0.35% -0.31% -0.598
(D) the 4-instrument basket ← HEADLINE 4 +0.359 +2.41% 0.23% -0.13% +0.073
(E) Full 6-instrument book (3-ETF basket + 3-name basket) 6 -0.036 +2.32% 0.26% -0.17% -0.322

Pattern: adding SPX or TLT to the headline drags the Sharpe down because their individual put-credit-spread excess Sharpes are negative (-0.35 and -0.43 respectively over OOS). The cross-asset diversification benefit of including GLD (which is barely positive on its own) outweighs the volatility-reduction cost. Adding more drag-instruments (SPX, TLT) does not produce a net benefit.

Variants tested: iron condor architecture

We tested both architectures on the same engine and underlying universe. Iron condor (put + call wings on the same expiry) underperforms put-only across every comparable comparison in OOS 2018-2024:

Mode Universe Sharpe (excess) Δ vs SPX baseline Notes
Iron condor, SPX only 1 inst -1.882 -2.168 Call wing destroyed by post-2020 SPX rally
Iron condor, 3-instrument cluster SPX/TLT/GLD -2.292 -2.578 Same pattern across all three IC instruments
Put-only, SPX 1 inst -0.347 -0.633 Same engine, IC removed
Put-only, 3-instrument cluster SPX/TLT/GLD -0.369 -0.655 Same engine, IC removed

Iron condor underperforms put-only in every comparison. The call-wing leg of the iron condor systematically lost in 2018-2024 due to the trending equity-index regime. Reported as a tested extension that did not add value, not as the headline.

Halts engaged vs disengaged

Demonstrates that the halt framework is doing real work, every instrument has higher Sharpe (excess) when halts are active vs naked.

Ticker Sharpe with halts disengaged Halts Sharpe Δ from halts
AAPL +0.155 +0.264 +0.109
MSFT +0.037 +0.269 +0.232
WMT +0.063 +0.263 +0.200
GLD -0.971 +0.138 +1.109

The halt framework’s contribution is measurable per instrument and consistently positive. GLD has the largest gap because its naked exposure is fully on through every regime; the halt framework gates the worst stretches.

Stress-event behavior

Computed on the headline basket equity curve, halts engaged, equal-weight $50K per instrument (basket starting equity $200K). Drawdown is peak-to-trough WITHIN each window using a running cumulative max.

Event Window probed Net P&L Peak-to-trough DD Trough date
Volmageddon 2018-01-22 → 2018-02-16 -$35 -0.099% 2018-02-07
Q4 2018 selloff 2018-11-26 → 2019-01-24 +$783 0.000% (curve monotonic up)
COVID crash 2020-02-24 → 2020-04-23 +$102 0.000% (curve monotonic up)
2022 bear market 2022-01-03 → 2022-12-30 +$4,223 0.000% (curve monotonic up)
Banking crisis 2023-02-13 → 2023-04-13 +$1,652 0.000% (curve monotonic up)

The 0.000% peak-to-trough entries are not measurement error. They reflect a structural feature of the halt-gated put-credit-spread architecture: during these stress windows the halt framework reduced or paused new entries, the open positions either expired profitably or hit stop-loss within their wing-width bound, and the unutilized capital continued earning the realized T-bill rate. Net trading P&L plus cash carry was positive on every trading day through these windows, so the equity curve never made a new low.

Volmageddon is the one exception. The early-2018 timing meant the basket was fully deployed when the VIX spike hit, and the resulting -$35 net P&L (-0.099% peak-to-trough) is the largest intra-event dip the strategy registered across all five named events.

Trade fates and rates

Every trade exits one of five fates. The distribution across the 437-trade headline blotter:

Fate Trigger Count % of trades
profit_target Exit debit ≤ 50% of entry credit 280 64.1%
stop_loss Exit debit ≥ 200% of entry credit (gap-aware fill) 81 18.5%
time_exit DTE ≤ 21 75 17.2%
emergency |short_delta| > 0.50 1 0.2%
eos_force End-of-OOS forced close 0 0.0%
Total 437 100%

Reading the rates the rubric asks for:

Rate Value
Success rate (P&L > 0) 73.0%
Stop-loss rate 18.5%
Timeout rate (time_exit) 17.2%
Emergency-exit rate 0.2%

Per-trade summary statistics

Metric Value
Total trades 437
Winning trades 319
Losing trades 118
Mean P&L per spread +$3.42
Median P&L per spread +$17.85
Standard deviation of P&L $37.66
Largest single win +$49.50
Largest single loss -$186.00
Mean trade return +8.25%
Mean trade lifetime 6.5 days
Median trade lifetime 7 days
Profit factor (gross win / gross loss) 1.26

The 6.5-day mean lifetime reflects how the strategy actually deploys capital: profit-target exits fire quickly in calm regimes, and the basket spends most of its capital sitting on T-bill carry between trade cycles. Median holding period is 7 days; the 25th-to-75th percentile window is 3 to 10 days; no trade exceeds 14 days because the time-exit rule forces a close at DTE ≤ 21 against the 30-45 DTE entry window.

Per-instrument breakdown

Ticker Trades Wins Win rate Mean P&L Total P&L
AAPL 112 83 74.1% +$4.11 +$460.23
MSFT 112 82 73.2% +$3.54 +$396.33
GLD 106 80 75.5% +$3.06 +$324.70
WMT 107 74 69.2% +$2.92 +$312.87

P&L per trade across OOS

Trade-return distribution

The distribution is right-skewed by design: a 16-delta short put expires worthless ~84% of the time under lognormal assumptions, and profit_target closes wins early at 50% of credit. Losses are bounded by the wing-width stop. The 1.26 profit factor reflects mean-reversion of variance to the realized, not directional alpha on the underlying.

Ledger (monthly P&L sample)

Month Trades closed Net P&L Cumulative P&L
2018-01 7 +$36.75 +$36.75
2018-02 5 -$193.88 -$157.13
2018-05 9 +$97.05 -$60.08
2018-08 17 +$191.15 +$179.58
2018-09 12 +$175.50 +$355.08
2018-10 11 -$290.25 +$64.83
2019-07 19 +$315.85 +$369.58
2019-12 18 +$321.70 +$529.38
2020-03 14 -$11.20 +$612.10
2024-08 5 -$361.65 +$1,237.08
2024-11 12 +$181.80 +$1,418.88
2024-12 12 +$75.25 +$1,494.13

Showing 12 of 84 months from January 2018 to December 2024. Net P&L in dollars per spread (per-contract basis at $100 multiplier). Full monthly ledger: monthly_ledger.csv (renders as a sortable table on GitHub).

Blotter

Random sample of 10 trades from the 437-row blotter (seed=42).

trd_prd Entry Exit Ticker Side Qty Entry credit Exit debit Fate P&L Return % Success
2018.19 2018-05-07 2018-05-16 WMT P 1 $22.45 $0.90 profit_target $+21.55 +96.0% True
2018.39 2018-09-24 2018-10-05 WMT P 1 $35.80 $60.55 time_exit $-24.75 -69.1% False
2018.40 2018-10-01 2018-10-11 GLD P 1 $15.05 $4.50 profit_target $+10.55 +70.1% True
2019.07 2019-02-11 2019-02-15 MSFT P 1 $38.70 $15.30 profit_target $+23.40 +60.5% True
2019.30 2019-07-22 2019-08-01 WMT P 1 $38.85 $88.55 stop_loss $-49.70 -127.9% False
2019.48 2019-11-25 2019-12-04 WMT P 1 $29.85 $17.75 profit_target $+12.10 +40.5% True
2019.52 2019-12-23 2019-12-26 GLD P 1 $19.70 $9.95 profit_target $+9.75 +49.5% True
2024.07 2024-02-12 2024-02-23 MSFT P 1 $52.80 $34.25 time_exit $+18.55 +35.1% True
2024.23 2024-06-03 2024-06-07 GLD P 1 $54.40 $147.50 stop_loss $-93.10 -171.1% False
2024.23 2024-06-03 2024-06-05 WMT P 1 $14.45 $6.20 profit_target $+8.25 +57.1% True

The trd_prd index encodes year + ISO week as a single decimal. Entries on the same Monday share the same trd_prd.

Showing 10 of 437 trades. Full blotter: blotter.csv (all 437 entries, renders as a sortable table on GitHub).

How will you know the strategy is performing as expected?

A rolling 60-trade window of the realized win rate is compared against the OOS baseline of μ = 0.730. The Hoeffding inequality bounds the probability that observed underperformance is due to chance: while the bound stays at or above 50%, the strategy is operating within its modeled regime and trading continues at full size. Backtested over the 2018-2024 OOS sample, the bound was at or above 50% on 88% of post-warmup trades.

How will you quantify when the strategy stops working?

The same Hoeffding bound. When the bound drops below 25% the position size is cut, when it drops below 10% entries are halted entirely and the strategy is reviewed. The thresholds are pre-set, distribution-free, and apply uniformly across the 4-instrument basket. The OOS sample produced no critical signal across 1,760 trading days; full bound-trace and worked example on the Live Monitoring page.

For the data sources behind these numbers, see Data and Sources.