The Ultimate Backtesting Checklist Every Trader Needs

The Ultimate Backtesting Checklist Every Trader Needs

Backtesting is the critical process of evaluating a trading strategy against historical data to gauge its viability. Without a rigorous, systematic approach, backtesting becomes a breeding ground for false confidence and confirmation bias. This checklist is designed to force discipline, uncover hidden flaws, and ensure your strategy has a statistical edge before risking real capital.

1. Data Integrity & Source Verification

The foundation of any backtest is the data. Garbage in, garbage out is the most expensive lesson in trading.

  • Clean for Survivorship Bias: Ensure your dataset includes stocks, ETFs, or futures that have been delisted, bankrupt, or acquired. Backtesting only current securities inflates performance because it ignores the losers.
  • Adjust for Corporate Actions: Verify that your data provider has accurately applied stock splits, reverse splits, dividends, and spin-offs. A raw price chart without adjustments will show false gaps and invalidate entry/exit prices.
  • Check for Bad Ticks & Gaps: Look for single-day price spikes (e.g., a stock jumping from $50 to $5,000 incorrectly), missing data sessions, or stale prices. Run a “sanity check” on the % daily change distribution—any outlier beyond 15x the standard deviation warrants investigation.
  • Use Clean, Natively Adjusted Data: Prefer using “total return” or “adjusted close” data from reputable sources (Norgate, Polygon, Tiingo) over Yahoo Finance for serious testing. For forex and futures, ensure the data is continuous and correctly handles contract rollover (e.g., back-adjusted vs. proportional adjustment).
  • Time Zone & Session Alignment: For intraday strategies, confirm that your timestamps align with the exchange’s operating hours. A 9:30 AM EST entry on a stock is meaningless if your data uses UTC timestamps without conversion.

2. Transaction Cost Realism

Paper trading or backtesting without costs is a fantasy. Real slippage and commissions destroy fragile equity curves.

  • Commission Structure: Apply the exact commission scale you will trade (e.g., $0 per trade vs. $0.65 per contract for futures). For high-frequency strategies, even a single cent per share matters.
  • Slippage Model: Do not assume fill at the open or close price. Model slippage as a fixed percentage (e.g., 0.1%) or, better yet, as a function of volatility (e.g., 20% of the average true range). For low-liquidity stocks, simulate a delay of 1–5 minutes in execution.
  • Bid-Ask Spread Impact: For options and penny stocks, subtract half the average bid-ask spread from your entry and add it to your exit. A strategy that wins on the mid-price can lose catastrophically if it pays the full spread every time.
  • Margin & Interest Costs: For leveraged strategies (strategies trading futures, margin accounts, or selling puts), include the cost of borrow and margin interest. If holding overnight, deduct the risk-free rate (e.g., SOFR or T-bill yield) on capital used.

3. Position Sizing & Risk Management Rules

A strategy is only as good as its risk controls. Your backtest must enforce the same rules your live account will.

  • Fixed Fractional Sizing (The Gold Standard): Backtest with a constant % risk per trade (e.g., 1% of account equity). Do not use fixed share sizes—they lead to rapid overexposure at lower prices and underutilization at higher prices.
  • Maximum Drawdown Shut-Off: Incorporate a hard rule that halts trading if the account drawdown exceeds a threshold (e.g., 20%). Then, backtest the recovery period. A strategy that survives a 95% drawdown in the past is not robust—it simply had a high return before the crash.
  • Correlation & Portfolio Heat: If testing a multi-asset system, check that your backtest engine accounts for correlated drawdowns. A strategy that shorts bonds and buys stocks may appear great until a flight-to-safety event (e.g., March 2020) kills both sides.
  • Market Hours & Session Filters: Explicitly code when you are allowed to trade. A system that buys during the opening auction (high volatility) will perform differently than one that trades only in the last hour.

4. Statistical Rigor & Performance Metrics

Do not just look at “Total Return.” You must dissect the statistical DNA of your equity curve.

  • Check for Stationarity: Run an Augmented Dickey-Fuller test on the strategy’s daily returns. If the returns are non-stationary, your strategy is likely overfitted to a regime that will not repeat.
  • Analyze the Sharpe & Sortino Ratios: Target a Sharpe ratio above 1.5 for a single-asset strategy (above 1.0 for multi-asset). The Sortino ratio is more important—it only penalizes downside volatility, which is what matters for drawdown.
  • Calculate the Profit Factor: A profit factor (gross profit / gross loss) below 1.5 is generally weak. However, be suspicious of a profit factor above 4.0—it often indicates overfitting or a “lottery ticket” strategy with few large wins.
  • Examine Win Rate vs. Risk-Reward: A 90% win rate is meaningless if the 10% losses wipe out months of gains. Calculate the average win / average loss ratio. A system with a 40% win rate and a 3:1 reward-to-risk ratio is statistically more robust than a 70% win rate with a 1:1 ratio.
  • Monte Carlo Analysis: Run your backtest through 1,000+ Monte Carlo simulations that randomize the order of trades. Look at the 5th and 95th percentile outcomes. If the worst-case simulation shows a 90% drawdown, the strategy is unsafely dependent on trade order.

5. Overfitting & Walk-Forward Analysis

The single biggest mistake in backtesting is curve-fitting. Your strategy should be robust, not optimal.

  • In-Sample vs. Out-of-Sample Split: Divide your data into two parts. Optimize parameters only on the first 70% (in-sample). Then, test the optimized parameters on the unseen final 30% (out-of-sample) without any retouching. A drop in performance of more than 30% signals overfitting.
  • Walk-Forward Optimization (WFO): For dynamic strategies, use a rolling WFO. Optimize on a 2-year window (e.g., 2020–2021), test on the next 6 months (2022 H1), then roll forward. The consistency of performance across multiple out-of-sample periods is the true test of robustness.
  • Limit Parameter Count: A strategy with 3–5 parameters is generally safe. Every additional parameter (e.g., “buy if RSI < 30 and MACD is above zero and volume is 50% above average”) increases the risk of fitting to noise. Use the “Degree of Overfitting” statistic—aim for < 0.3.
  • Test with Random Walk Data: Generate synthetic random price data (e.g., a GARCH model) and run your strategy on it. If it generates positive returns on noise, your strategy is simply data-mining.

6. Realism in Execution Assumptions

Markets do not exist in a vacuum. Your backtest must account for the mechanics of actual trading.

  • Look-Ahead Bias: The most common killer. Ensure your backtest engine does not use data from the future. For example, calculating a moving average using today’s closing price to enter today’s open is invalid. All indicators must be calculated before the entry bar.
  • Limit Order Rejection: If your strategy uses limit orders, model a fill probability based on order book depth. A limit order that sits outside the spread for hours will rarely fill during fast markets.
  • Market Impact (for Large Accounts): If your position size exceeds 5% of the average daily volume (ADV), your entry and exit will move the market. Simulate this by filling orders incrementally over time or by applying a linear slippage penalty proportional to trade size.
  • Holiday & Half-Day Handling: Code in U.S. holidays, early closes (e.g., Black Friday), and irregular futures settlement dates. A strategy that trades on a half-day session may see drastically different volatility.

7. Psychological & Behavioral Edge

The best strategy will fail if you cannot execute it. Your backtest must expose the psychological pitfalls.

  • Run the Stress Test (Max Consecutive Losses): Record the maximum number of consecutive losing trades in your backtest. If it is 10, and your average loss is 1% per trade, can you really take 10 losing trades in a row without abandoning the system? If not, the system is too risky for your personality.
  • Drawdown Duration: Measure the time (in calendar days) from the equity curve peak to the next new high. A system that takes 400 days to recover from a drawdown will be abandoned during the 200-day mark in live trading, even if it eventually recovers.
  • Trade Frequency & “Boredom Risk”: A system that generates 5 trades per year appears statistically sound but will bore a human trader into making discretionary “improvements” that destroy the edge. Backtest the impact of missing the top 3 trades—if your edge relies on them, it is fragile.

8. Benchmarking & Market Regime Sensitivity

A backtest is meaningless without a reference point.

  • Compare Against Buy-and-Hold: Does your strategy simply capture beta? Calculate the correlation of your strategy’s returns to the S&P 500. A correlation above 0.7 means you are just a leveraged ETF. A good strategy should have low or negative correlation to the broader market.
  • Segment by Regime: Break your backtest into distinct market phases: strong bull (2017, 2023), bear (2022), low volatility (2017–2018), high volatility (2020, 2008). A strategy that works only during low-volatility bull markets is a bull market crutch.
  • Test on Multiple Assets/Timeframes: If your strategy is designed for Apple (AAPL), test it on Microsoft (MSFT), Google (GOOGL), and a random set of 50 stocks from different sectors. If it only works on one stock, it is not a strategy—it is a coincidence.
  • For FX & Futures: Curve Compatibility: Ensure the strategy performs similarly on different contract months (e.g., front-month vs. second-month crude oil). If it only works on the front month, liquidity and rollover costs will kill it.

9. Code Validation & Re-creation

Humans make errors. Your backtest code is not immune.

  • Manual Verification of 20 Trades: Randomly select 20 trades from your backtest. Manually calculate the entry, exit, and P&L using a spreadsheet and raw price data. If more than 1 trade is off by more than 1%, your code has a bug.
  • Portfolio-Level Consistency: For multi-asset backtests, compare the sum of individual asset returns to the portfolio-level report. A mismatch indicates a rebalancing or compounding error.
  • Reproduce on a Second Platform: If possible, re-code a simplified version of the strategy on a different platform (e.g., TradingView vs. Python). If the equity curves diverge significantly, investigate the data handling or indicator calculations.

10. The “Will I Actually Trade This?” Audit

The final filter is brutally honest self-reflection.

  • Does the System Make Sense? Can you explain why it works in economic terms? A strategy that buys when the 50-day moving average crosses above the 200-day (the “Golden Cross”) has a logical basis in momentum. A strategy that buys when the moon is in Capricorn does not.
  • Is it Easy to Automate? If you need to manually check 15 different indicators and interpret a proprietary indicator, your backtest results will not match your live trading due to discretionary variation. Aim for 100% mechanical rules.
  • Liquidity & Capacity Check: Can you actually execute 100% of the backtested trades at the modeled prices? For futures, ensure contract liquidity (e.g., avoid backtesting E-mini S&P 500 trades on the most illiquid contract month). For stocks, check the average volume on the days the strategy traded.
  • Survivor’s Remorse: If your strategy has a catastrophic drawdown event that “never happened again” (e.g., a 2008-style crash), explicitly test it on 2008 data. If it fails catastrophically, you must either accept that risk or redesign the strategy to handle it.

Something went wrong. Please refresh the page and/or try again.

Discover more from DNS Research

Subscribe now to keep reading and get access to the full archive.

Continue reading