Backtesting Trend Following Systems: Key Metrics for Reliable Results

This is an amateur website and It’s not a professional publication. Pages are written on an occasional basis and are free to read. Contents herein do not predict economic scenarios or financial outcomes and to the best knowledge of the author they represent the current consensus in technical and academic research and are presented for educational purpose only and under any circumstance they are not financial advice or solicitation to trade. Pages contain paid links. The whole content of this website is not intended for residents of Chile, Andorra, Italy, Spain, France, Germany, Turkey, Greenland or any individual under legal age.

The Anatomy of a Robust Backtest: Metrics That Matter for Trend Followers

Backtesting a trend-following system is an exercise in statistical skepticism. A strategy that looks like a straight line on a historical chart often fails in live markets due to overfitting, survivorship bias, or poor risk management. For trend followers, the core challenge is distinguishing between a genuinely robust system and one that merely memorized past volatility. This requires a deep dive into specific, non-obvious metrics that measure not just profitability, but resilience.

1. Profit Factor and the ‘Skinny Tail’ Problem

Profit Factor (PF) is the gross profit divided by the gross loss. A PF of 1.5 means you earn $1.50 for every $1.00 lost. While simple, its interpretation for trend followers is nuanced. Most trend systems have a low win rate (30-45%) but high reward-to-risk ratios. A PF above 2.0 is often considered excellent in this domain, but the distribution of the wins matters more.

The Skinny Tail Danger: A system can achieve a high PF through a single, massive outlier trade (e.g., a 2008 crash short). In a robust backtest, the PF should remain above 1.5 even when you remove the top 10% of winning trades. If the PF collapses below 1.0 after that removal, your system is fragile—it is a “call option” on black swans, not a reliable trend strategy. Recalculate PF on a rolling 12-month basis to see if the system has periods of sustained non-performance indicative of a strategy that no longer fits market regimes.

2. Average Trade Net of Slippage & Commissions (AT)

AT is the arithmetic mean of all trade returns. However, the crucial variable is the Slippage Model. For trend following, which often trades breakouts or moving average crossovers on daily or weekly timeframes, slippage is asymmetric. Slippage is higher on the entry (chasing momentum) and lower on the exit (if using stops).

The Half-Bar Rule: A robust backtest must use a conservative slippage model. For a daily system, assume you cannot get filled better than the average of the open, high, low, and close of the entry bar. This “Half-Bar” slippage instantly kills many overfitted systems. If your AT drops by more than 30% under this model compared to a simple close-to-close model, the strategy is likely exploiting microstructure artifacts or unrealistic order filling. For ETFs or liquid futures, use a fixed slippage of $10-$20 per contract per side; for less liquid markets, scale this to 0.5% of the instrument’s average daily range.

3. Maximum Drawdown (MDD) and the ‘Time Under Water’

MDD is the peak-to-trough decline of the equity curve. For trend followers, MDDs of 25-40% are normal, even for the best systems. The metric that separates the durable from the lucky is Time Under Water (TUW) —the number of calendar days (or trades) the system takes to recover from its maximum drawdown.

Why TUW is a System Killer: A system that recovers from a 30% drawdown in 3 months has very different risk characteristics than one taking 18 months. Long TUW periods indicate the strategy is not mean-reverting; it relies on a regime change that may never come. Calculate the ratio of MDD to TUW. A healthy trend system should recover its MDD within 2x the length of the drawdown period. Backtest across different market cycles (2008, 2015, 2022). If the TUW in choppy sideways markets (e.g., 2015-2017) exceeds that of volatile trends, the system has poor market regime adaptability.

4. Sharpe Ratio and the ‘Distributional Dependency’

The Sharpe Ratio (SR) measures risk-adjusted return by dividing the excess return by the standard deviation of returns. A SR above 1.0 is desirable; above 2.0 is exceptional. However, standard deviation assumes a normal distribution. Trend-following returns are leptokurtic—they have fat tails (extreme wins) and skinny shoulders (many small losses).

Modified Sharpe (Sortino & Calmar): Ignore the standard Sharpe for trend following. Use the Sortino Ratio, which only penalizes downside deviation (negative returns). A high Sortino (above 2.0) indicates the system’s volatility comes from large wins, not random losses. The Calmar Ratio (annualized return / maximum drawdown) is arguably more predictive. A Calmar Ratio above 1.0 is strong. For a 10-year backtest, a Calmar of 1.5 means the system generates 1.5x the annual return of its worst drawdown. A system with a high Sharpe but low Calmar is likely over-optimizing for small daily moves, not the structural trends you want.

5. Percent Profitable (Win Rate) vs. Payoff Ratio

Trend systems win infrequently. Awin rate above 50% is suspicious—it often signals overfitting to noise or a system that takes profits too early, missing the big trends. The Payoff Ratio (average win size / average loss size) should be at least 2:1, ideally 3:1 or higher.

The ‘Expected Value’ Check: The formula (Win% * Avg Win) - (Loss% * Avg Loss) must be positive. More importantly, plot the distribution of trade durations. A robust trend system should have a few trades held for 100-200 days (the big winners) and many held for 2-5 days (failed breakouts). If the holding period is uniform, the system is not capturing trends; it is scalping or overtrading. A high win rate (>55%) combined with a low payoff ratio (<1.5) is a red flag for a system that will fail in a strong directional move because it exits too early.

6. MAR Ratio (Managed Account Reports) and Consistency

The MAR Ratio is equivalent to the Calmar Ratio but calculated on the total equity curve including reinvested returns. For a system that compounds, the MAR is the ultimate measure of long-term stability. A MAR above 3.0 is institutional-grade, but rare.

Segment Analysis: Do not calculate one MAR for the entire test. Divide the backtest into 3-5 distinct market regimes (rising interest rates, falling rates, high volatility, low volatility). A robust system should have a positive MAR in at least 70% of regimes. It can have negative periods (e.g., in 2017 for a long-only volatility system), but it must recover. If the MAR is positive only because of a single, massive regime (e.g., 2020 pandemic), the system is regime-specific, not robust.

7. Correlation to the Underlying Asset and Benchmarks

Trend followers often perversely benefit from crashes. A system that shorts or goes long based on momentum should have a negative correlation to the buy-and-hold return of the traded asset over short windows (e.g., 1-month rolling correlation). This is its edge: it is contrarian to position-holding.

Beta Decay Analysis: Calculate the rolling 3-month beta of the system’s equity curve to the S&P 500 and the Bloomberg Commodity Index (BCOM). A robust trend system should have a beta near zero to the S&P 500 (acting as a diversifier) but a positive beta (0.3-0.6) to BCOM, because trends often appear in commodities. If the beta to the S&P 500 is consistently above 0.5, you are not trend-following; you are long-biased momentum investing. This metric alone filters out 60% of amateur systems.

8. Monte Carlo Simulation: The 95th Percentile Path

Historical backtests give one path. Monte Carlo simulation scrambles the order of trade returns (or their sequence with correlation intact) to generate thousands of potential equity curves. The key metric is the 95th Percentile Drawdown.

The ‘Worst 5%’ Test: Look at the worst 5% of Monte Carlo paths. If any of those paths shows a drawdown exceeding 60% or goes to zero, the system is not risk-controlled, regardless of its historical performance. A robust system should survive the worst 5% of trade sequences with a drawdown less than 50% and an equity curve that does not go permanently underwater. Also check the Median Monte Carlo Path. If the median path shows a lower total return than the historical path by more than 20%, your system is highly sensitive to trade sequence—a dangerous property for a trend follower.

9. Parameter Stability Heatmaps

Overfitting is the cardinal sin. If your system uses moving averages (e.g., 50-day and 200-day), run a heatmap of all possible combinations (e.g., 10-100 and 150-300). Plot the Profit Factor or Sharpe Ratio.

The ‘Plateau’ Not the ‘Peak’: You are looking for a plateau of high performance, not a single peak. A robust system will have a contiguous zone (e.g., moving averages from 40-60 and 180-220) where performance is within 20% of the optimum. If the optimal parameter is a sharp spike (e.g., exactly 49 and 197), the system is overfitted to historical noise. Furthermore, test on out-of-sample data (e.g., the last 20% of the dataset). The parameter set that works best in-sample should also be within the top 10% of performance out-of-sample. If it falls to the bottom half, the system is statistically invalid.

10. K-Ratio and Equity Curve ‘Legs’

The K-Ratio measures the consistency of the equity curve’s growth. It calculates the slope of the cumulative return line divided by its standard error of the slope. A high K-Ratio (>3.0) indicates a steady, linear equity curve. A low K-Ratio (<1.0) suggests a record of flat periods punctuated by dramatic spikes.

The ‘Equity Curve Acceleration’ Factor: For trend followers, a parabolic equity curve (slow growth, then rapid acceleration) is typical during a trend regime. The issue is deceleration. Use the ratio of the last 12 months’ return to the prior 12 months’ return. If this ratio is below 0.5, the system is losing steam. A robust system should have a K-Ratio that is stable (within 0.5 points) across the entire backtest, not declining in the most recent period. This flags when a regime has changed and the system is no longer adapting.

11. Market Regime Adaptability (Volatility-Targeting)

Trend following inherently performs poorly in low-volatility, range-bound markets (e.g., 2017). A robust system must have a mechanism for adjusting position size based on volatility.

The Volatility-Adjusted Return: Compare the raw equity curve to a volatility-scaled version (where position size is inversely proportional to trailing 20-day volatility). The VAMI (Value-Added Monthly Index) Ratio—the ratio of the volatility-scaled end value to the raw end value—is critical. If the VAMI ratio is above 1.2, the raw system is underperforming a simple volatility target. If it is below 0.8, your volatility scaling is destroying returns in low-vol regimes. A robust system should have a VAMI ratio between 0.95 and 1.15, indicating that the inherent strategy (not the scaling) is generating alpha.

12. Detection of Survivorship & Look-Ahead Bias

No backtest is complete without a behavioral audit. The key metrics here are not numerical but structural.

The ‘Rolling Start Date’ Test: A robust trend system should show a positive return for at least 80% of arbitrary start dates (e.g., every month starting from January 2000 to January 2010). If the system is profitable only when started in specific years (e.g., 2003, 2009), it is exploiting a specific market cycle. Calculate the Bootstrap Return Period—the percentage of random start dates (10,000 samples) that produce a positive equity curve after 3 years. This number should be above 70%. Below 50% means the system will likely give you three losing years in a row during live trading. Also, ensure you are using point-in-time data for index constituents, dividends, and corporate actions. Using adjusted close prices from today for historical tests introduces look-ahead bias by correcting for future splits and dividends that were not known at the time.