Verdict Thresholds: From Score to Action

The Strategy Health Card converts a numeric composite into a four-level verdict: HEALTHY, MONITOR, REVIEW, RECONSIDER. The bucketing is intentionally coarse because a single score does not have enough resolution to support fine distinctions. This article explains the choice of cutoffs, what each verdict implies for what to do next, and why a "RECONSIDER" verdict on a high-return backtest is often the correct read.

The cutoffs

≥ 70 — HEALTHY. Move to paper trading; design risk overlay.
50–69 — MONITOR. Identify the weakest sub-score; targeted fix.
30–49 — REVIEW. Two or more pillars weak; redesign, do not tune.
< 30 — RECONSIDER. Not deployable; abandon or reframe entirely.

Why these cutoffs?

The cutoffs are calibrated against empirical observation across thousands of factor strategies in academic literature and practitioner backtests. The boundaries align with three implicit thresholds:

70 is the level at which all three pillars must be at least adequate. Below 70, a strategy is either failing one pillar entirely or weak across the board.
50 is the level at which the strategy passes a coin-flip test: it has at least one pillar that is genuinely strong, but at least one that is genuinely weak.
30 is the level at which most strategies are below the threshold of "would even pass robust statistical scrutiny." Bailey, Borwein, López de Prado & Zhu (2014) document that the vast majority of published backtests, when stress-tested for backtest overfitting, fall into this band.

What HEALTHY does and does not mean

A HEALTHY verdict means: this strategy is structurally sound across factor signal, risk discipline, and cost survivability, in the period sampled, with the configuration sampled. It does NOT mean:

The strategy will work going forward (out-of-sample performance is not measured here).
The strategy is appropriate for your capital base, time horizon, or tax status (those are situational).
The strategy is original (you may be re-discovering a known anomaly that is already arbitraged out).
The strategy will work in your live execution environment (slippage, latency, fill quality differ).

HEALTHY is necessary, not sufficient. The next step is paper trading (see Paper Trading) or walk-forward validation if applicable.

RECONSIDER on a high-return backtest

The most common surprise to new users is a RECONSIDER verdict on a backtest reporting +120% total return. The composite catches this because the score weighs structural soundness, not return level. Three classic patterns produce this:

Single-regime miracle. All gains compressed into one regime (e.g., 2020 Q2 rebound). Risk sub-score collapses on regime-breadth penalty.
Concentrated three-name bet. Effective N near 3 even with 30 holdings (one stock has 80% weight via market-cap weighting + a runaway). HHI penalty hits Risk sub-score hard.
Factor-rotation overfit. Backtest accidentally enables rotation that depends on knowing the future macro regime. Factor sub-score collapses on negative IC + low specific return.

In all three, the headline return is real (it happened in the data) but the structure that produced it is not repeatable. The verdict catches this where the headline number does not.

Sub-score weighting

composite = 0.40 × factor + 0.35 × risk + 0.25 × cost

A useful intuition: a strategy with sub-scores (factor=80, risk=50, cost=40) lands at:

0.40 × 80 + 0.35 × 50 + 0.25 × 40 = 32 + 17.5 + 10 = 59.5 → MONITOR

The same total return profile with (factor=50, risk=80, cost=70) lands at:

0.40 × 50 + 0.35 × 80 + 0.25 × 70 = 20 + 28 + 17.5 = 65.5 → MONITOR

Both land in MONITOR — but the actionable next step differs. The first case demands a redesign of the risk overlay; the second demands deeper investigation of where the factor signal actually lives.

The verdict is not a destination

The most disciplined users treat the verdict as a triage decision: HEALTHY → continue, others → diagnose. The mistake to avoid is treating MONITOR as "almost HEALTHY" and tweaking parameters until the verdict flips. Parameter-tuning to flip a verdict is the textbook mechanism of backtest overfitting (Bailey & López de Prado 2014). The correct response to MONITOR is to understand the weak sub-score, not to tune around it.

Bailey, D. H. & López de Prado, M. (2014). The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality. Journal of Portfolio Management, 40(5), 94–107.
Bailey, D. H., Borwein, J. M., López de Prado, M. & Zhu, Q. J. (2014). Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance. Notices of the AMS, 61(5), 458–471.

Textbook references

Pardo, R. (2008). The Evaluation and Optimization of Trading Strategies (2nd ed.). Wiley.
Grinold, R. C. & Kahn, R. N. (1999). Active Portfolio Management: A Quantitative Approach for Producing Superior Returns and Controlling Risk (2nd ed.). McGraw-Hill.

Try it in QuanterLab

Treat MONITOR or REVIEW as a request to diagnose, not to tune. The discipline of asking "why is this sub-score weak?" before changing any parameter is what separates a research process from a backtest-fitting loop.