p-Hacking and Curve-Fitting in Quant Research

If you torture the data long enough, it will confess to anything. p-hacking — the practice of running many tests and reporting only the ones that worked — is the largest source of false positives in quantitative research. It is usually unintentional and almost always invisible to the person doing it.

What p-Hacking Looks Like

You build a strategy. The first backtest shows Sharpe 0.6 — disappointing. So you:

  • Try a different RSI period. Sharpe 0.8.
  • Add a volume filter. Sharpe 1.2.
  • Switch from daily to 4-hour bars. Sharpe 1.5.
  • Tweak the stop loss. Sharpe 1.8.
  • Restrict to "high-momentum" names. Sharpe 2.4.
  • Save. Move on.

Each individual change felt like an improvement. Each was tested in good faith. The final result looks beautiful. And the result is, with high probability, almost entirely overfitting.

The Garden of Forking Paths

You did not run "one backtest." You ran a tree of dozens of possible backtests, and the data guided you to one branch of that tree. The Sharpe of "the branch the data picked" is much higher than the Sharpe of "any random branch" — even when no real edge exists. This is the garden of forking paths, named by the statistician Andrew Gelman.

The Effective Number of Trials

Every parameter you tweaked is a trial. Every indicator you swapped is a trial. Every dataset, every timeframe, every universe filter. The total effective trial count for a typical retail backtest run is often in the dozens — sometimes hundreds.

With 100 effective trials and no real edge, the expected best Sharpe is around 2.0. Reporting 2.4 from such a search is reporting noise.

Detection: Was Your Process p-Hacked?

  • Did you change anything in response to seeing IS results? If yes, you are partly p-hacking. The fix is to commit to choices before looking, or use walk-forward to validate after.
  • How many backtests did you run today before settling on "the one"? If > 5, your effective p-value is much worse than you think.
  • Does your strategy use unusual or specific parameters? RSI(14) is innocent — it's a Wilder default and not data-derived. RSI(13) is suspicious — it likely came from a sweep.
  • Did the strategy improve monotonically as you tweaked? Real edges show up early or not at all. A strategy that needed 20 incremental refinements to look good is mostly a record of those refinements, not of an underlying edge.

Prevention: The Disciplined Workflow

  1. Decide before you measure. Specify the strategy in advance — indicator, period, threshold, stop — and run it once. Whatever Sharpe falls out is honest. If it disappoints, accept it and move on.
  2. Sweep deliberately, not iteratively. A planned sweep over a defined parameter grid (then DSR-corrected) is honest. Ad-hoc tweaks one at a time are not.
  3. Use walk-forward. Walk-forward neutralizes most p-hacking because each fold re-optimizes from scratch. The composite OOS Sharpe is what is left after the search bias washes out.
  4. Use Time Travel. Building strategies as if you were on a past date prevents the most insidious form of p-hacking: subconsciously selecting strategies that you already know worked in the recent past.
  5. Track DSR, not headline Sharpe. DSR explicitly accounts for the search you performed. A high Sharpe with low DSR is a data confession, not a finding.

The Bottom Line

p-hacking is not a moral failing — it is the natural consequence of an inquisitive researcher with a flexible toolkit. The fix is not to be smarter; it is to use procedures (walk-forward, DSR, Time Travel) that are robust to your inquisitiveness. The platform's rigor stack exists to protect you from yourself.

Further Reading

Foundational papers

  • Bailey, D. H., Borwein, J. M., López de Prado, M. & Zhu, Q. J. (2014). Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance. Notices of the AMS, 61(5), 458–471.
  • Harvey, C. R. & Liu, Y. (2015). Backtesting. Journal of Portfolio Management, 42(1), 13–28.
  • Sullivan, R., Timmermann, A. & White, H. (1999). Data-Snooping, Technical Trading Rule Performance, and the Bootstrap. Journal of Finance, 54(5), 1647–1691.

Textbook references

  • López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley.
  • Pardo, R. (2008). The Evaluation and Optimization of Trading Strategies (2nd ed.). Wiley.

Related QuanterLab articles

Try it in QuanterLab

Track how many distinct backtests you ran today before settling on "the one." If the count is > 5, your effective p-value is much worse than the headline Sharpe suggests. Use DSR or a robustness sweep to recover honest significance.

Back to QuanterLab
Report
Loading report...
Article
Loading article...