Walk-forward vs in-sample backtesting
Why walk-forward backtests are honest and in-sample backtests are not. A short technical post with code.
There's a single most-common fake-alpha generator in retail backtesting: using full-sample data to set thresholds and then "backtesting" the strategy on that same sample. This post explains exactly why that's wrong, what to do instead, and how to verify your code isn't accidentally doing it.
The core problem
Suppose you have 540 days of price data and you want to backtest a mean-reversion strategy that swaps when the log-ratio hits the 5th or 95th percentile of its historical range.
The naive (and wrong) approach:
# WRONG: in-sample backtest
p5 = np.percentile(log_ratio, 5) # uses ALL 540 days
p95 = np.percentile(log_ratio, 95) # including future relative to t=0
for t in range(540):
position = (log_ratio[t] - p5) / (p95 - p5)
# decisions based on FUTURE percentiles
The bug is subtle: at t=0, the threshold p5 was computed using
data from t=0 through t=540, including all the future data your
strategy at t=0 wouldn't have known. You're effectively cheating.
The result: backtest returns inflate by 10-30% because at every decision point you're using "what we now know about the full range" to decide where the boundaries are.
The walk-forward fix
Walk-forward only uses data up to (but not including) the decision point:
# RIGHT: walk-forward backtest
LOOKBACK = 180 # days
for t in range(LOOKBACK, len(log_ratio)):
window = log_ratio[t - LOOKBACK:t] # NOT [t - LOOKBACK:t + 1]
p5 = np.percentile(window, 5)
p95 = np.percentile(window, 95)
position = (log_ratio[t] - p5) / (p95 - p5)
# decisions based ONLY on past data
The critical detail: the slice ends at t, not t + 1. If you
include index t, you've included today's price in today's
threshold — which is mild lookahead but still bias.
The thresholds get recomputed every day on the trailing window. This honestly simulates what you'd have seen running the strategy in real time.
How to verify your backtest is honest
The simplest test: corrupt the future and check that past trades don't change.
def test_no_lookahead(backtest_fn, prices, midpoint=300):
"""Past decisions must not change when future data is corrupted."""
# Run on clean data
trades_clean = backtest_fn(prices)
# Replace future data with garbage
prices_corrupt = prices.copy()
prices_corrupt[midpoint:] = 999_999.0
trades_corrupt = backtest_fn(prices_corrupt)
# Filter to trades before the midpoint
clean_before = [t for t in trades_clean if t.time < midpoint]
corrupt_before = [t for t in trades_corrupt if t.time < midpoint]
assert clean_before == corrupt_before, "Lookahead bias detected"
If this test fails, your backtest is using future information to make past decisions. Find where and fix it.
We have this exact test in our open-source utility's CI: github.com/pairscan/ratio-mean-reversion/tests/test_no_lookahead.py. It's the gate that prevents subtle bias bugs from making it into production.
Practical implications
If a backtest claims +60% returns over 360 days but doesn't show walk-forward methodology — discount the number heavily. Real-world returns will be 10-30% lower at minimum, often more.
If you're evaluating a screener or strategy:
- Ask whether the backtest is walk-forward.
- Ask what lookback window is used for percentiles.
- If they can't answer or don't know, treat published returns as marketing, not data.
PairScan publishes its full backtest implementation (open source, MIT). Every pair shown on the dashboard uses walk-forward percentile bounds with a 180-day rolling window, no lookahead, with the explicit corruption test in CI. The same code that runs in the production screener runs in the test suite.
A note on lookback choice
The lookback window itself involves a trade-off:
- Short lookback (60-90 days): percentiles adapt quickly to regime changes. But you have less data to set robust thresholds, so signal noise increases.
- Long lookback (360+ days): stable thresholds, less noise. But slow to adapt to genuine regime changes.
We use 180 days as a compromise. It's long enough to capture seasonal-ish patterns but short enough to update on regime shifts within a quarter or two. This is a parameter you can tune for your own use case if you fork the open-source utility.
What to avoid
Three more common mistakes worth flagging:
1. Look-ahead via dependent variables. If your strategy uses a volatility-adjusted entry threshold, make sure the volatility is also computed only from past data. Easy to forget when there are multiple inputs.
2. Survivorship bias. Backtesting only on pairs that exist today excludes pairs that delisted along the way. The realistic universe should include pairs that broke during the period.
3. Optimistic execution assumptions. Walk-forward fixes lookahead but doesn't address slippage, fees, or partial fills. Assume real returns are 10-30% lower than even an honest backtest would suggest.
Bottom line
Walk-forward isn't optional for honest backtesting. Any retail "backtest" that doesn't use it is either marketing or a bug. The no-lookahead test takes 10 minutes to write and protects you from months of bad decisions.
If you're using a third-party tool, ask the question. If you're building your own, write the test before writing the strategy. The test is the gate that turns a backtest into evidence.