๐Ÿ‘€ PairScan

ยท 10 min read ยท #methodology #hurst #adf #walk-forward #backtest #python

Hurst, ADF and walk-forward backtest: a practical guide

Practical walkthrough of Hurst exponent, ADF test, and walk-forward backtesting for crypto pairs trading. With Python code and real data.

Most "pair trading" content for crypto leans on one of two extremes: either it's pure marketing copy ("our proprietary algorithm finds winning trades!") or it's an academic paper aimed at PhDs. This post sits in the middle. If you've heard of mean-reversion testing but never coded it, this should walk you through enough math to be dangerous, with working Python at every step.

We'll cover three things, in order: how to test whether a pair mean-reverts (Hurst exponent + ADF), how to combine those tests with operational filters (range width, alternating touches), and how to backtest the resulting strategy without lookahead bias.

All code is runnable. All references are to actual academic papers. By the end, you should be able to evaluate any crypto pair yourself in about 50 lines of Python.

What "mean-reversion in a ratio" actually means

Take two assets, A and B. Compute the log-ratio:

import numpy as np
log_ratio = np.log(price_a / price_b)

The log here is important. Without log, the ratio is multiplicative and the deviation from "fair value" doesn't sit symmetrically around a mean. With log, deviations are additive and standard statistical tests work cleanly.

A pair "mean-reverts" if when log_ratio drifts above its average, it tends to come back down (and vice versa). Visually: imagine the log-ratio bouncing inside a horizontal band, never escaping. That's the picture.

If the log-ratio drifts in one direction without returning โ€” like a trending asset would โ€” the pair is not mean-reverting and pair trading on it loses money.

So the question becomes: how do you tell mean-reversion from trending statistically?

Test 1: Hurst exponent

The Hurst exponent (H. E. Hurst, 1951, originally developed for Nile river floods) measures long-term memory of a time series. It's a single number, typically between 0 and 1:

  • H < 0.5: anti-persistent / mean-reverting (we want this)
  • H = 0.5: random walk (no useful pattern)
  • H > 0.5: persistent / trending (we filter out)

There are several ways to estimate Hurst. The most classical is rescaled range (R/S) analysis:

import numpy as np

def hurst_exponent(series, max_lag=20):
    """Hurst via R/S analysis. < 0.5 = mean-reverting."""
    series = np.asarray(series)
    if len(series) < max_lag * 2:
        raise ValueError("Series too short")

    lags = range(2, max_lag)
    tau = []
    for lag in lags:
        diff = series[lag:] - series[:-lag]
        tau.append(np.sqrt(np.std(diff)))

    poly = np.polyfit(np.log(lags), np.log(tau), 1)
    return poly[0] * 2.0

Apply this to the log-ratio, not raw prices:

log_ratio = np.log(price_a / price_b)
h = hurst_exponent(log_ratio)
print(f"Hurst: {h:.3f}")  # < 0.5 = mean-reverting

Practical observations from running this on real crypto pairs:

  • Memes (DOGE, SHIB, PEPE, WIF) almost never show H < 0.5. They trend in one direction violently.
  • Cross-asset pairs (crypto vs tokenized equities) cluster around H โ‰ˆ 0.4 โ€” solidly mean-reverting, because the underlying drivers (BTC dominance, USD strength) reset frequently.
  • Pure crypto-vs-crypto pairs cluster around H โ‰ˆ 0.45, with high variance.

Hurst R/S has known sensitivity to max_lag choice. Different lags give slightly different values on the same series. Default max_lag=20 works well for daily data; experiment if you're doing intraday.

Test 2: Augmented Dickey-Fuller

ADF (Dickey & Fuller, 1979) is a unit-root test. The intuition: if a series has a unit root, it behaves like a random walk and doesn't return to a stable mean. If we can reject the unit-root hypothesis (low p-value), the series is stationary, meaning it has a stable mean to revert to.

We use statsmodels:

from statsmodels.tsa.stattools import adfuller

def adf_pvalue(series):
    result = adfuller(series, autolag='AIC')
    return result[1]  # p-value

p = adf_pvalue(log_ratio)
print(f"ADF p-value: {p:.3f}")  # lower = more stationary

Here's where I diverge from textbook practice. Standard ADF threshold is p < 0.05. For crypto data, that's too strict.

Crypto time series are noisier than equity series โ€” single bad candles, low-volume gaps, exchange-specific quirks all introduce noise that affects ADF p-values without affecting actual mean-reversion behavior. If you use p < 0.05 strictly, you reject genuinely mean-reverting pairs.

We use p < 0.7 combined with other filters. Loose threshold plus complementary filters does the discrimination job better than strict ADF alone. This is what we landed on after running the filter on hundreds of pairs and comparing classifications against visual inspection of the log-ratio charts.

If you only had ADF and nothing else, p < 0.05 would be defensible. With Hurst plus operational filters in the mix, p < 0.7 is the right level.

Operational filters: range width and alternating touches

Two filters that aren't statistical but matter operationally:

Range width. The log-ratio must span at least 40% of its historical range. If the ratio only oscillates ยฑ5%, even perfect timing won't beat 0.1% ร— 2 fees per round-trip.

def range_width(log_ratio):
    return np.max(log_ratio) - np.min(log_ratio)

Alternating boundary touches. The series must touch both top and bottom of its range multiple times alternately. This distinguishes a genuinely oscillating series from one that visited an extreme once and then settled on one side (which would generate a single trade, not a strategy).

def alternating_touches(log_ratio, p_low_pct=5, p_high_pct=95):
    low = np.percentile(log_ratio, p_low_pct)
    high = np.percentile(log_ratio, p_high_pct)

    touches_low, touches_high = 0, 0
    last = None  # 'low', 'high', or None

    for value in log_ratio:
        if value <= low and last != 'low':
            touches_low += 1
            last = 'low'
        elif value >= high and last != 'high':
            touches_high += 1
            last = 'high'

    return touches_low, touches_high

We require โ‰ฅ 2 alternating touches per side. Plus an underlying volume filter ($1M+ daily spot for both legs) to keep slippage under control.

Combining the filters

def is_mean_reverting(price_a, price_b,
                     hurst_max=0.5,
                     adf_max=0.7,
                     min_range=0.4,
                     min_touches=2):
    log_ratio = np.log(np.asarray(price_a) / np.asarray(price_b))

    h = hurst_exponent(log_ratio)
    if h >= hurst_max:
        return False, f"Hurst {h:.3f} >= {hurst_max}"

    p = adf_pvalue(log_ratio)
    if p >= adf_max:
        return False, f"ADF p {p:.3f} >= {adf_max}"

    rw = range_width(log_ratio)
    if rw < min_range:
        return False, f"Range {rw:.3f} < {min_range}"

    low, high = alternating_touches(log_ratio)
    if low < min_touches or high < min_touches:
        return False, f"Touches {low}/{high} < {min_touches}"

    return True, f"H={h:.3f}, p={p:.3f}, range={rw:.3f}, touches={low}/{high}"

About 30-40% of pairs pass the Hurst and ADF filters in any given window. That drops to 10-15% after adding range and volume filters. So the four-filter combination is restrictive enough to give you confidence, but not so strict that you have nothing to trade.

Now the harder part: walk-forward backtest

This is the section where most retail-targeted pair trading content makes a critical mistake. They run an "in-sample" backtest where the entry/exit thresholds are computed from the full sample, including future data. This produces inflated returns because at decision time t, you're effectively cheating by knowing the full historical range.

The correct approach is walk-forward. At each decision point t, only data up to t is used to set the thresholds. The percentile bounds get recomputed every day on a trailing window. This is the only way to honestly simulate "what would have happened if I'd been running this in real time."

Here's the difference, in code. The wrong way:

# WRONG โ€” uses full-sample percentiles, has lookahead
p5_full = np.percentile(log_ratio, 5)
p95_full = np.percentile(log_ratio, 95)

for t in range(len(log_ratio)):
    # uses p5_full and p95_full โ€” derived from FUTURE data!
    position = (log_ratio[t] - p5_full) / (p95_full - p5_full)
    ...

The right way:

# RIGHT โ€” rolling percentiles, no lookahead
def walk_forward_backtest(price_a, price_b,
                          lookback=540,
                          entry_low=0.2,
                          entry_high=0.8,
                          fee_pct=0.001,
                          initial_a=100.0):
    log_ratio = np.log(np.asarray(price_a) / np.asarray(price_b))

    a_qty = initial_a
    b_qty = 0.0
    holding_a = True
    trades = []

    for t in range(lookback, len(log_ratio)):
        # CRITICAL: only data up to (not including) t
        window = log_ratio[t - lookback:t]
        p5 = np.percentile(window, 5)
        p95 = np.percentile(window, 95)

        current = log_ratio[t]
        if p95 > p5:
            position = (current - p5) / (p95 - p5)
        else:
            position = 0.5

        if position < entry_low and not holding_a:
            ratio = np.exp(current)
            a_qty = b_qty * ratio * (1 - fee_pct)
            b_qty = 0.0
            holding_a = True
            trades.append(('B->A', t, position))

        elif position > entry_high and holding_a:
            ratio = np.exp(current)
            b_qty = a_qty / ratio * (1 - fee_pct)
            a_qty = 0.0
            holding_a = False
            trades.append(('A->B', t, position))

    return a_qty, b_qty, trades

The critical line is window = log_ratio[t - lookback:t]. If you write [t - lookback:t + 1] instead โ€” including index t โ€” you've introduced lookahead. Your backtest results will be 10-30% better than real-world execution would be. This is the #1 source of fake alpha in retail backtesting.

How to verify your backtest doesn't have lookahead bias

Most subtle bug. Easy to miss in code review. Easy to test for explicitly.

Here's the test: run your backtest twice. Second run, replace all prices after some midpoint with garbage values. Then compare trades before the midpoint between the two runs. They must be identical.

def test_no_lookahead(backtest_fn, price_a, price_b, midpoint=700):
    """If trades before midpoint change when future data is corrupted,
    the backtest has lookahead bias."""

    pa_corrupted = price_a.copy()
    pb_corrupted = price_b.copy()
    pa_corrupted[midpoint:] = 999999.0
    pb_corrupted[midpoint:] = 0.0001

    _, _, trades_clean = backtest_fn(price_a, price_b)
    _, _, trades_corrupted = backtest_fn(pa_corrupted, pb_corrupted)

    trades_clean_before = [t for t in trades_clean if t[1] < midpoint]
    trades_corr_before = [t for t in trades_corrupted if t[1] < midpoint]

    assert trades_clean_before == trades_corr_before, "LOOKAHEAD BIAS"

If this test passes, your backtest is honest. If it fails, you have a bug somewhere that's leaking future information into past decisions.

I've open-sourced the full implementation including this exact test at github.com/pairscan/ratio-mean-reversion under MIT license. The tests/test_no_lookahead.py file is the one that matters most โ€” it's the gate that prevents subtle bias bugs from making it to production.

Putting it all together: real example

Let's run the full pipeline on ETH/USDT and BTC/USDT, 540 days of daily candles from Binance:

import ccxt
import numpy as np

binance = ccxt.binance()
since = binance.parse8601('2024-11-01T00:00:00Z')

eth = binance.fetch_ohlcv('ETH/USDT', '1d', since=since, limit=540)
btc = binance.fetch_ohlcv('BTC/USDT', '1d', since=since, limit=540)

eth_close = np.array([c[4] for c in eth])
btc_close = np.array([c[4] for c in btc])

# Step 1: Does this pair mean-revert?
passed, reason = is_mean_reverting(eth_close, btc_close)
print(f"Pair ETH/BTC: {passed}, {reason}")

# Step 2: If yes, walk-forward backtest
if passed:
    a, b, trades = walk_forward_backtest(eth_close, btc_close)
    print(f"Final ETH: {a:.2f}, BTC: {b:.5f}")
    print(f"Trades: {len(trades)}")

A typical result for ETH/BTC over 540 days might show H โ‰ˆ 0.43, ADF p โ‰ˆ 0.45, range width โ‰ˆ 50%, 4 alternating touches, passing the filter โ€” and 2-3 trades during the walk-forward backtest. That's not high-frequency. That's "swap once every couple of months when the ratio extends."

Where this approach fails

A few honest limitations:

Sample size. Below 200 days you're getting noise, not signal. Below 540 you have to use very loose Hurst/ADF thresholds. Above 3 years you start running into regime changes that make the test itself less meaningful.

Selection bias. Pairs that pass the filter today are not necessarily pairs that have always passed. Some pairs that used to pass have now broken (delistings, narrative collapses). Survivorship bias is real.

Tests are descriptive, not predictive. A pair that was mean-reverting in the last 540 days might not be in the next 540. Walk-forward addresses this only partially โ€” markets do regime-shift.

Hurst R/S has variance. Different max_lag choices give different Hurst values on the same series. Robust to lag choice when pair is strongly mean-reverting; sensitive when borderline.

Real execution differs from backtest. Slippage, exchange downtime, partial fills, withdrawal limits โ€” none of these are in the simple backtest. Real returns are typically 10-30% lower than backtest claims.

What's next

If you want to go deeper, the references below are where I learned this material:

  • Lo, A.W. & MacKinlay, A.C. (1988). Stock Market Prices Do Not Follow Random Walks. Foundational random-walk vs mean-reversion paper.
  • Gatev, Goetzmann & Rouwenhorst (2006). Pairs Trading: Performance of a Relative-Value Arbitrage Rule. The classic empirical study on equity pairs.
  • Krauss, C. (2017). Statistical Arbitrage Pairs Trading Strategies: Review and Outlook. Comprehensive review of the field including modern approaches.
  • For a deeper treatment of cointegration (which we don't use here but is closely related), Engle & Granger (1987) is the foundational paper.

If you want to skip the implementation work: the screener at pairscan.io runs the full four-filter pipeline plus walk-forward backtest across 170+ pairs every 6 hours, including cross-asset pairs with tokenized US equities. Free tier shows top 3 pairs daily.

But the math is fully open. Anyone can implement this in an afternoon with the code above. Run it on your own data. Run it on synthetic Ornstein-Uhlenbeck and Geometric Brownian Motion series to verify the filter classifies them correctly. The point of this post is to show that the methodology isn't proprietary โ€” it's seventy years of public statistics applied to a relatively new asset class.