2026-05-11 · 10 min read · #methodology #cointegration #adf #mean-reversion #pair-trading #python

Cointegration vs correlation: the trap that kills pair trading

Why high correlation doesn't make a profitable pair trade — and what cointegration actually tests instead. Engle-Granger walkthrough with runnable Python.

If you've correlated 50 crypto pairs over 540 days, sorted by Pearson r, picked the top entries and tried to trade their ratio — you've probably already learned that high correlation doesn't make a profitable pair trade. This is the most common mistake in pair trading, and it kills more retail attempts at the strategy than slippage and fees combined.

Correlation tells you that two series tend to move in the same direction at the same time. It does not tell you that their ratio returns to a stable value. Those are different statistical properties, and pair trading on a ratio depends on the second one, not the first.

The right test is cointegration. This post walks through what cointegration actually means, how to test for it in Python, and how it relates to the Hurst + ADF approach we cover in our walk-forward methodology guide. By the end, you should be able to tell a tradable pair from a correlated-but-doomed pair without writing a single backtest.

What correlation actually measures

Pearson's correlation coefficient r measures the linear relationship between two series after each is centered on its own mean and scaled by its own standard deviation. The formula:

r = np.corrcoef(returns_a, returns_b)[0, 1]

What that gives you: a number between −1 and +1. High positive r means that when one series is above its mean, the other tends to be above its mean too. That's it.

Three things r does not tell you:

Whether the ratio is stable. Two series can both be trending upward at similar rates with r near 1.0 while their ratio drifts monotonically in one direction.
Whether the level relationship is consistent. Correlation is scale-invariant. A series that's 100× the magnitude of another can have r = 0.99 if their daily moves line up.
Whether mean-reversion exists in the spread. Correlation is a property of returns; mean-reversion is a property of levels (or of the log-ratio of levels). Different domains.

This last point is where the confusion comes from. People run np.corrcoef on prices (not returns), see a high number, and conclude the pair is suitable for trading. Correlation on prices of two trending series is almost always high — not because the pair mean-reverts, but because both series share an underlying trend component.

The killer counter-example

Here's a synthetic case that shows why correlation alone is misleading. Two geometric Brownian motion series with similar drift and similar volatility:

import numpy as np
import pandas as pd

np.random.seed(42)
n = 540
dt = 1.0 / 365.0

drift_a = 0.6   # ~60% annualized drift
drift_b = 0.5
sigma = 0.8

shocks_a = np.random.normal(0, sigma * np.sqrt(dt), n)
shocks_b = np.random.normal(0, sigma * np.sqrt(dt), n)

log_price_a = np.cumsum((drift_a - 0.5 * sigma**2) * dt + shocks_a)
log_price_b = np.cumsum((drift_b - 0.5 * sigma**2) * dt + shocks_b)

price_a = 100 * np.exp(log_price_a)
price_b = 100 * np.exp(log_price_b)

# Correlation on prices
r_prices = np.corrcoef(price_a, price_b)[0, 1]
print(f"Correlation on prices: {r_prices:.3f}")

# Correlation on returns
returns_a = np.diff(log_price_a)
returns_b = np.diff(log_price_b)
r_returns = np.corrcoef(returns_a, returns_b)[0, 1]
print(f"Correlation on returns: {r_returns:.3f}")

# Now the log-ratio
log_ratio = np.log(price_a / price_b)
print(f"Log-ratio drift: {log_ratio[-1] - log_ratio[0]:.3f}")

Typical output:

Correlation on prices: 0.972
Correlation on returns: -0.018
Log-ratio drift: 1.840

Correlation on prices is 0.97. Looks great. Correlation on returns is essentially zero — there's no shared shock structure between the two series. And the log-ratio drifts by 1.84 over the window, which in price terms means the ratio more than tripled. This is a series that's nowhere near mean-reverting, and any pair-trade entry based on "high correlation" would have bled money the entire time.

The lesson: high correlation on prices is mostly a statement about shared trend, not about tradable spread behavior.

What cointegration actually tests

Cointegration (Engle & Granger, 1987) tests something different. The intuition: if two non-stationary series A and B are cointegrated, there exists a linear combination A − β·B whose residuals are stationary. Stationary means the residual series has a stable mean and bounded variance — exactly the property pair trading needs.

In plainer terms: A and B can each wander off on their own (both are "I(1)" — integrated of order one), but if you combine them with the right hedge ratio β, the combination stays in a stable band.

For pair trading on a log-ratio with equal weights, this reduces to: is log(A/B) stationary? That's what the standard Augmented Dickey-Fuller (ADF) test on the log-ratio answers. It's a special case of cointegration where β is fixed at 1 in log-space.

For a more general cointegration test, you fit β from the data first, then ADF the residuals. This is the Engle-Granger 2-step.

The Engle-Granger 2-step in code

Here's the standard procedure. We use statsmodels for both the OLS regression and the ADF test:

import numpy as np
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller


def engle_granger(price_a, price_b):
    """Engle-Granger 2-step cointegration test.

    Returns (beta, adf_pvalue). p < 0.05 → cointegrated."""

    log_a = np.log(np.asarray(price_a))
    log_b = np.log(np.asarray(price_b))

    # Step 1: regress log(A) on log(B) to find hedge ratio
    X = sm.add_constant(log_b)
    model = sm.OLS(log_a, X).fit()
    beta = model.params[1]
    residuals = model.resid

    # Step 2: ADF test on residuals
    adf_stat, p_value, *_ = adfuller(residuals, autolag='AIC')

    return beta, p_value

Run it on the synthetic series from above:

beta, p = engle_granger(price_a, price_b)
print(f"Hedge ratio β: {beta:.3f}")
print(f"ADF p-value on residuals: {p:.3f}")

For the GBM example, p-value will typically come back well above 0.05 — the ADF test correctly fails to reject the unit-root hypothesis on the residuals. The series are not cointegrated, even though their price correlation was 0.97. This is exactly what we want the test to do.

A practical caveat worth knowing: the Engle-Granger test is asymmetric. If you regress A on B, you get one β; if you regress B on A, you get a different one (not just the reciprocal). The residuals and the p-value can also differ. Convention is to use the more liquid leg as the regressor B, but in crypto where liquidity is similar across the pair, you should run both directions and take the more conservative p-value.

For a more symmetric and statistically robust test, the Johansen test handles both directions simultaneously and extends naturally to more than two assets. We don't use it in production because it's overkill for two-asset pair trading on liquid crypto, but it's worth knowing about for portfolio-level work.

Hurst on log-ratio vs ADF on residuals

Three different tests come up repeatedly in pair-trading literature, and they're easy to confuse:

Hurst exponent on the log-ratio. Tests for long-term memory. H < 0.5 = anti-persistent / mean-reverting.
ADF on the log-ratio directly. Tests stationarity assuming fixed β=1 in log-space. Equivalent to "are equal-weighted swaps appropriate?"
ADF on residuals from log(A) ~ log(B). Tests cointegration with a fitted hedge ratio. More general.

The first two test the same object (log-ratio) using different statistical frameworks. They tend to agree on strongly mean-reverting pairs and disagree on borderline cases. We've found Hurst more forgiving on noisy crypto data — strict ADF (p < 0.05) on the log-ratio rejects too many pairs that visually look fine.

The third test is doing something fundamentally different: it lets β absorb any persistent drift in the level relationship between A and B, so the residuals just have to be stationary around zero. A pair that fails ADF on the log-ratio (because the ratio is drifting slowly) can still pass Engle-Granger if the drift is captured by β.

In our experience running these on crypto pairs over 540-day windows:

The two log-ratio tests (Hurst and ADF) agree about 80% of the time on whether a pair is mean-reverting. Disagreements concentrate around H ∈ [0.45, 0.55] and ADF p ∈ [0.05, 0.4].
Engle-Granger flags ~15-20% more pairs as "cointegrated" than the log-ratio ADF flags as "stationary", because Engle-Granger absorbs drift into β. Whether those extra pairs are tradable in practice depends on whether the fitted β is stable over time — usually it isn't.

When you actually need full cointegration

For most crypto-vs-crypto pairs, equal-weight log-ratio (β=1 in log-space) is the right choice and the simpler test is sufficient. Reasons:

Liquidity is roughly comparable across major crypto pairs, so swapping equal log-units is operationally clean.
A fitted β that's not exactly 1 implies a position with non-equal exposure to each leg, which complicates hedging and rebalancing.
A fitted β estimated from 540 days of data is not stable. Re-fit monthly and you'll see β drift by 10-30% on most pairs. That drift is unaccounted-for risk in a backtest.

You start needing the full cointegration framework when:

The two assets have very different price levels and very different volatility profiles (e.g., a high-cap and a small-cap), where equal-weight log-ratio over-weights the smaller leg.
You're trading a basket (3+ assets) and need a Johansen-style estimate of multiple cointegrating relationships.
You're doing index arbitrage or basis trading where the hedge ratio is a known construct (e.g., delta from an options model) rather than a fitted statistical artifact.

For the standard "two liquid crypto assets, log-ratio, mean-reversion on a 540-day window" setup that PairScan focuses on, log-ratio with β=1 is the cleaner and more honest choice. The hedge ratio you can't estimate stably is a hedge ratio you shouldn't be relying on.

What PairScan uses and why

The screener at pairscan.io does not run a cointegration filter as such. It runs:

Hurst exponent on the log-ratio (H < 0.5)
ADF on the log-ratio (p < 0.7 — looser than textbook because combined with other filters)
Range width filter (≥ 40% of historical range)
Alternating boundary touches (≥ 2 per side)
Volume filter ($1M+ daily spot per leg)

That's a four-filter combination on the log-ratio plus a liquidity gate. Engle-Granger ADF on residuals would add a sixth test. We've experimented with adding it and the marginal benefit was small — pairs that pass our four filters mostly pass Engle-Granger too, and the ones that pass Engle-Granger but fail our filters tend to be trending pairs with a fitted β absorbing the trend (which is exactly the failure mode we want to avoid).

The honest answer: cointegration as a separate filter is rigorous in theory but adds compute cost and false positives in our specific domain. We may add it as an optional advanced filter in a future release for users who want to overlay it manually.

If you want to run cointegration tests on your own data alongside the PairScan filters, the open-source utility at github.com/pairscan/ratio-mean-reversion includes the Hurst + ADF + walk-forward stack under MIT license. The Engle-Granger 2-step from this post drops in cleanly as an additional function — about 10 lines on top of statsmodels.

Where this analysis fails

A few honest limitations of cointegration as a tool:

β is not stable over time. The hedge ratio you estimate from the last 540 days is not the hedge ratio that will hold for the next 540. Markets regime-shift, volatility profiles change, narratives reweight underlying drivers. A "cointegrated" pair last year can be just two correlated trends this year.

ADF has low power on short crypto histories. Standard econometrics literature uses 30+ years of monthly data. We have 1-3 years of daily data on most crypto pairs, less on cross-asset. The ADF test simply can't distinguish a slow mean-reverting series from a slow random walk on samples this short. P-values around 0.3 are statistically indistinguishable from each other in practice.

Cointegration is descriptive, not predictive. A pair that was cointegrated in the last window might not be in the next one — same as the failure modes we cover for mean-reversion in general. Regime changes break cointegration just like they break Hurst-based filters.

The choice of regressor matters. Engle-Granger asymmetry is real. On crypto, where neither leg is obviously the "right" regressor, running both directions and being conservative is the responsible move — but it doubles the false-rejection risk.

References

Engle, R.F. & Granger, C.W.J. (1987). "Co-integration and Error Correction: Representation, Estimation, and Testing". Econometrica, 55(2), 251-276.
Dickey, D.A. & Fuller, W.A. (1979). "Distribution of the Estimators for Autoregressive Time Series with a Unit Root". Journal of the American Statistical Association, 74(366), 427-431.
Lo, A.W. & MacKinlay, A.C. (1988). "Stock Market Prices Do Not Follow Random Walks: Evidence from a Simple Specification Test". The Review of Financial Studies, 1(1), 41-66.
Krauss, C. (2017). "Statistical Arbitrage Pairs Trading Strategies: Review and Outlook". Journal of Economic Surveys, 31(2), 513-545.
Gatev, E., Goetzmann, W.N., Rouwenhorst, K.G. (2006). "Pairs Trading: Performance of a Relative-Value Arbitrage Rule". The Review of Financial Studies, 19(3), 797-827.

Try it

The screener at pairscan.io runs the four-filter pipeline plus walk-forward backtest across 170+ crypto and cross-asset pairs every 6 hours. Free tier shows the top 3 pairs each day. Personal at $19/mo gives you the full backtest with trade markers and Telegram alerts on zone entries.

If you'd rather implement and verify the methodology yourself, github.com/pairscan/ratio-mean-reversion has the whole stack (filters, backtest, no-lookahead test) under MIT license. The Engle-Granger code from this post sits in ~15 lines on top — drop it in and run cointegration tests alongside the existing filters on your own data.

The point of this post isn't that cointegration is the One True Test. It's that correlation isn't, and that the difference matters more than most pair-trading content acknowledges. If you're going to pick a pair based on a single number, pick a stationarity test on the spread, not a correlation on the levels.