The Counterfactual Sweep re-runs the same strategy across a grid of parameter values — varying top-N (number of holdings) or rebalance stride (how many periods between rebalances) — and plots how Sharpe responds. A robust strategy shows a flat Sharpe-vs-parameter curve. A fragile strategy shows a sharp peak at the chosen value and collapse elsewhere. This is one of the cleanest tests for parameter overfitting.
What "counterfactual" means
The original backtest selected a single (top_n, frequency) pair. The counterfactual asks: what would have happened across the realistic alternative choices? If Sharpe is similar across all alternatives, the chosen configuration is not load-bearing. If Sharpe varies wildly, the chosen configuration is the dominant determinant — and you don't know how you arrived at it.
The two sweep variables
Top-N sweep
Varies the number of holdings selected per period. Default values: 15, 25, 50, 75, 100. The sweep keeps factor weights, rebalance frequency, and universe constant; only top_n changes.
Interpretation:
- Sharpe flat across top_n: Strategy works across reasonable concentration choices. Robust.
- Sharpe peaks at small top_n and collapses for large: Strategy concentrates alpha in the top few names. Concentrated bet dressed up as a factor strategy.
- Sharpe peaks at large top_n and collapses for small: Strategy benefits from broad diversification — alpha is in the cross-section, not in the top names.
- Sharpe peaks at one specific top_n: Likely parameter overfit. The chosen top_n is the one that happened to work in-sample.
Stride sweep
Varies how many periods between rebalances. Default values: 1, 2, 4, 8 (so a quarterly backtest with stride=2 effectively rebalances semi-annually). Tests whether the original rebalance frequency was correct.
Interpretation:
- Sharpe flat across strides: Rebalance cadence is not critical — the signal is slow enough to survive holding longer or shorter.
- Sharpe peaks at stride=1 (original cadence) and falls at longer strides: Signal is fast-decaying. Holding longer loses material edge. Cross-check with Factor Decay.
- Sharpe rises at longer strides: Strategy was over-rebalancing. Slower cadence reduces turnover and tax drag without losing signal.
The robustness verdict
FM103 summarises the sweep with a Sharpe range metric:
- Range < 0.30: High robustness. The strategy is parameter-insensitive.
- Range 0.30–0.70: Moderate. The strategy has parameter sensitivity but no collapse.
- Range > 0.70: Low. The strategy is parameter-fragile; the chosen configuration is doing material work.
Why this matters
Bailey, Borwein, López de Prado & Zhu (2014) and Pardo (2008) both emphasise that backtest overfitting most often manifests as parameter brittleness. The strategy that works at exactly top_n = 30 and quarterly rebalance, but breaks at 25 or 35 holdings or at monthly/semi-annual cadence, is overfit to the search procedure that found it. A sweep that flattens the Sharpe response is a cheap structural test for this pathology.
Reading the cumulative-return overlay
The sub-pill plots both Sharpe and cumulative return per sweep point. A pattern to watch: Sharpe similar across sweep points but cumulative return varying widely. This usually indicates the configurations have different variance levels — the Sharpe ratio normalises this out, but the absolute return matters for funding decisions. A more concentrated configuration may have similar Sharpe with much higher cumulative return at the cost of much higher drawdowns.
Best configuration label
The sub-pill highlights the configuration with the highest Sharpe within the sweep. This is informative but should not be used as a tuning target. Picking the best sweep point and treating it as a new chosen configuration is the start of an in-sample optimisation loop — the exact pattern Bailey et al. warn against. The label is a description of where Sharpe peaks, not a recommendation to switch.
Stride sweep + cost overlay
Stride sweep Sharpe ignores transaction costs (it uses gross returns). A stride that delivers higher Sharpe at higher turnover will lose more to costs. Cross-reference with the Cost sub-pill: net Sharpe at the high-turnover stride may underperform the lower-turnover stride after costs even when gross Sharpe disagrees.
Caveats
- Sweep is in-sample. All sweep points use the same historical data. A different out-of-sample period may produce a different best-stride.
- Sweep is not exhaustive. The default grids cover reasonable ranges but not all conceivable values. Edge cases (top_n = 5, stride = 16) are not tested.
- Sweep keeps everything else constant. Joint variation (e.g., top_n × sectors_excluded) is not tested; only the marginal effect of one variable.
Further Reading
Foundational papers
- Bailey, D. H., Borwein, J. M., López de Prado, M. & Zhu, Q. J. (2014). Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance. Notices of the AMS, 61(5), 458–471.
- Bailey, D. H. & López de Prado, M. (2014). The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality. Journal of Portfolio Management, 40(5), 94–107.
- Harvey, C. R. & Liu, Y. (2015). Backtesting. Journal of Portfolio Management, 42(1), 13–28.
Textbook references
- Pardo, R. (2008). The Evaluation and Optimization of Trading Strategies (2nd ed.). Wiley.
- López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley.
Related QuanterLab articles
Try it in QuanterLab
A Sharpe range < 0.30 across the top_n sweep is the most reassuring single robustness check available. A range > 0.70 should make you suspicious of the chosen configuration.