How to Choose the Right Benchmark

Brian Ernest Metzger on April 3, 2026

In backtesting, benchmark selection is not a reporting detail. It determines what the strategy is being compared against and, therefore, what question the results are actually answering.

At Backtested Strategies, a benchmark is not chosen because it makes a strategy look strong or weak. It is chosen because it is the fairest control portfolio for the strategy being tested. The goal is not to produce a flattering comparison. The goal is to make the strategy’s mechanism and tradeoff clear.

This article explains how BTS chooses benchmarks, what a benchmark is supposed to do, what it should not do, and when a strategy-specific benchmark is better than a generic market index.

Why benchmark choice matters.
What a benchmark is at BTS.
How this article relates to Methodology.
What the benchmark is supposed to do.
What the benchmark should not do.
How BTS chooses the right benchmark.
Primary, diagnostic, and context benchmarks.
When a strategy-specific benchmark is the right answer.
When a standard benchmark is still appropriate.
A GTAA-style example.
How to read the tradeoff honestly.
The BTS benchmark rule.

Why benchmark choice matters

Benchmark selection is not a cosmetic reporting choice. It defines the economic baseline a strategy is being measured against. Change the benchmark, and you may change the meaning of the entire comparison.

A weak benchmark can make an ordinary strategy look impressive. A misaligned benchmark can also make a good strategy look worse than it is by answering the wrong question. A diversified tactical allocation system, for example, may look brilliant against the wrong control and mediocre against the right one, or the reverse. In both cases, the problem is the same: the benchmark is not isolating the strategy’s actual contribution.

That is why BTS treats benchmark choice as part of the strategy’s published specification. The benchmark is not a decorative opponent. It is the control portfolio that tells readers what the active overlay changed in economic terms: what risk it reduced, what return it gave up, and in which market regimes those tradeoffs appeared.

What a benchmark is at BTS

At BTS, a benchmark is the fairest investable control portfolio for a strategy.

The core question is simple: if we keep the same opportunity set and the same broad economic assumptions, but remove the strategy’s active overlay, what remains?

That overlay may be timing, ranking, rotation, defensive routing, volatility targeting, leverage, security selection, a weighting scheme, or another rule that changes the passive baseline. The benchmark should remove that overlay without quietly changing the underlying investment problem.

In other words, the benchmark should preserve the passive objective while stripping away the active decision. That is what makes the comparison useful. It lets readers see what the strategy itself added, rather than comparing two different economic problems and pretending the gap is skill.

How this article relates to Methodology

This article governs benchmark selection. The Methodology page governs benchmark implementation unless a strategy explicitly states an override.

That distinction matters. This article answers questions such as: What is the right control portfolio? Should the benchmark be generic or strategy-specific? Is the SPY ETF the primary benchmark, a context line, or the wrong comparator altogether? The Methodology page answers different questions: when the benchmark starts, how it is priced, how dividends are handled, whether rebalances incur costs, and which execution and accounting conventions BTS applies by default. By default, benchmarks are reported as total return, with dividends reinvested, and any deviation is labeled explicitly.

In short, this article tells you which benchmark BTS believes is fair. The Methodology page tells you how that benchmark is modeled.

What the benchmark is supposed to do

A good benchmark has a demanding job. It shouldn’t make the strategy look better or worse than it is. It should simply make the mechanism unavoidable.

Isolate the value of the overlay. The benchmark should hold constant as much as possible so the remaining difference mainly reflects timing value, selection value, weighting value, defensive-routing value, or another clearly identifiable source.
Show the tradeoff honestly. The point is not just to declare a winner. The point is to show what the strategy bought, what it cost, and when those effects showed up.
Provide a fair head-to-head comparison. A skeptical reader should be able to look at the benchmark and accept that it is a coherent and investable control.
Expose opportunity cost. Defensive strategies should still show the upside they gave up. Participation-seeking strategies should still show the path risk they accepted.
Support the scorecard and the equity-curve narrative. The benchmark should help explain drawdown tradeoffs, volatility tradeoffs, relative windows, whipsaw regimes, defensive wins, and offensive penalties.

When the benchmark is right, the comparison becomes harder to mischaracterize. The tradeoff is visible even when it is inconvenient.

What the benchmark should not do

A bad benchmark can fail in several ways, and most of them make a strategy look cleaner than it really is.

It should not be a straw man. Do not choose a benchmark because it is easier to beat than the real passive alternative.
It should not answer a different question. A benchmark can be economically interesting and still be wrong if it is not tied to the strategy’s actual decision problem.
It should not smuggle in hidden overlays. Tactical cash rules, altered weighting schemes, or other embedded views can turn the benchmark into a second strategy instead of a control.
It should not blur accounting categories. Comparing total return to price return, or mixing benchmark and strategy accounting without stating the comparison object clearly, creates confusion rather than insight.
It should not hide opportunity cost. If the strategy’s benefit is defense, the benchmark must still show the foregone upside. If the strategy’s benefit is participation, the benchmark must still show the extra path risk accepted to earn it.

The simplest test is this: would the comparison still feel fair if the strategy underperformed? If the answer is no, the benchmark is probably wrong.

How BTS chooses the right benchmark

We use a simple framework. The benchmark should preserve the same opportunity set and broad realism standard while removing the strategy’s active overlay.

Identify the active overlay. Start by naming exactly what the strategy is doing that a passive implementation would not do. That may be a trend filter, rotation rule, risk target, selection model, weighting rule, or defensive routing rule.
Identify the passive economic baseline. Remove only that overlay and ask what portfolio still represents the same investment problem. In many cases, that portfolio is the best primary benchmark.
Match the mechanics that materially matter. The benchmark should use the same realism standard where relevant: evaluation window, execution convention, cost realism, dividend treatment category, financing or idle-cash treatment where relevant, the benchmark start rule for multi-instrument baskets, and rebalancing when weight drift would otherwise change the question. That standard should still match even when the overlay changes turnover or exposure. The goal is not to reproduce the strategy’s realized trade path. It is to match the same investment problem and realism standard while removing the active overlay.
Check whether the difference is now interpretable. If the remaining performance gap can be described mainly as timing value, selection value, weighting value, risk-management value, or defensive-routing value, the benchmark is probably close to right. If the difference still reflects a different opportunity set or a hidden second thesis, it is not.
Add secondary diagnostics only when they answer a separate question. Some strategies benefit from an additional unrebalanced companion, a mechanics-matched companion, or a broad market reference for context. For strategies with leverage or volatility targeting, an exposure-matched diagnostic can be useful, but it should not replace the primary control portfolio.

This framework is intentionally conservative. It doesn’t ask which benchmark is easiest to explain or market. It asks which benchmark makes the strategy’s mechanism clearest and hardest to spin.

The benchmark also has to meet the same feasibility discipline as the strategy. If a benchmark depends on index membership, that membership should be point-in-time. If it references an index or asset class that cannot be traded directly, BTS should use a clearly stated tradable proxy. If the benchmark cannot be modeled under the same realism standard as the strategy, it may still be useful as context, but it is not a clean primary control.

Primary, diagnostic, and context benchmarks

Not every benchmark on a chart or in a table is doing the same job. BTS separates them into three categories.

Primary benchmark. This is the single headline control portfolio for the strategy. It is the benchmark that should answer the main economic question and carry the main attribution burden.
Diagnostic benchmark. This is a secondary comparator used to answer a narrower implementation question. It may help isolate the effect of rebalancing, dividend treatment, cash deployment, or another mechanics choice, but it is not the main control.
Context benchmark. This is a broad market or category reference used for reader orientation. It may be useful for context, but it should not be mistaken for the benchmark that isolates the strategy’s overlay.

Only the primary benchmark should be the headline comparison. Diagnostic and context benchmarks may appear on charts or in supporting discussion, but they should not replace the primary benchmark in the main scorecard unless they are explicitly relabeled. This distinction matters because SPY can be useful as context even when it is the wrong primary benchmark. Readers often want to know how a strategy behaved relative to a familiar market line. That’s reasonable. It’s just a different question from the one the control portfolio is supposed to answer.

BTS also distinguishes between a headline benchmark and a mechanics-matched companion. The headline benchmark may use the standard benchmark convention described in Methodology, including total-return reporting. A mechanics-matched companion may be shown alongside it to answer a narrower implementation question, such as whole-share trading or dividend cash retention. It is diagnostic only and must not replace the fair primary control portfolio. Both can be useful, but they are not the same comparison object and should not be blurred together.

When a strategy-specific benchmark is the right answer

A strategy-specific benchmark is appropriate when the strategy’s overlay acts on a portfolio or sleeve structure that would be misrepresented by a generic market index.

This often applies to multi-asset allocation systems, sleeve-based tactical portfolios, defensive-routing systems, and fixed-basket ETF strategies. In those cases, the benchmark will often be the same underlying basket held continuously, with no timing overlay, under clearly stated implementation rules.

The reason is simple: if the strategy is acting on a defined basket, comparing it to a different basket often changes the economic problem instead of isolating the overlay. The result may still be interesting, but it is not the cleanest control.

There is an important nuance for stock-selection systems. If the strategy chooses securities from a larger eligible universe, the passive control may be the eligible universe itself, or a neutral-weight version of that universe, rather than the names selected after the fact. The passive baseline should reflect the opportunity set available before the active rule is applied.

When a standard benchmark is still appropriate

A standard benchmark can still be the correct primary control when the strategy is explicitly trying to improve on that standard baseline.

A single-asset SPY timing model, for example, may reasonably use SPY total return as its primary benchmark. A long-only stock-selection strategy operating within the S&P 500 may reasonably use an S&P 500 baseline if the strategy’s overlay is security selection within that universe. In cases like these, the familiar benchmark is not being used because it is familiar. It is being used because it is the correct passive baseline.

But standard does not mean easier. The benchmark still needs mechanics honesty. It should be investable, matched to the same realism standard, and clear about what is being compared. Familiarity is never a substitute for fit.

A GTAA-style example

Global tactical asset allocation, or GTAA, is a useful example because it shows why the most familiar benchmark is not always the right one.

If the benchmark for a diversified five-ETF tactical allocation system were SPY, the comparison would mostly answer a different question: how did the strategy compare with U.S. equities? That may be useful as context, but it does not isolate what the strategy’s overlay added over the same diversified opportunity set.

For a GTAA-style system, the cleaner primary benchmark is the same five-ETF risk basket held continuously with no timing overlay. In practice, that means a fixed 20% sleeve allocation to each ETF, rebalanced monthly, so the benchmark preserves the strategy’s underlying basket while removing the tactical rule.

The monthly rebalance matters. Without it, weight drift becomes part of the comparison, and the benchmark stops being a clean control for the intended equal-weight risk basket. An unrebalanced companion can still be useful as a diagnostic, but it answers a different question and should not replace the headline control portfolio.

The defensive asset matters too. If the strategy routes out-of-signal sleeves into a defensive instrument, that defensive route is part of the overlay. Putting that defensive instrument directly into the primary benchmark would blur the very mechanism the comparison is supposed to reveal.

That is why the right GTAA-style headline benchmark is not a generic equity line and not a benchmark that quietly includes the strategy’s defense. It is a mechanics-matched diversified control portfolio: equal-weight five-ETF benchmark, monthly rebalanced, no timing overlay.

How to read the tradeoff honestly

The benchmark is not there to declare a winner. It is there to reveal the price and benefit of the strategy’s objective.

For defensive strategies, the benchmark should make it possible to see smaller crash drawdowns against rebound lag, lower volatility against lower terminal wealth, smoother path against less participation, and lower left-tail risk against higher tracking error. For more offensive strategies, it should make the opposite tradeoff just as clear.

This is why the benchmark must support not only the scorecard, but also the equity-curve narrative. It should help identify the biggest defensive win, the biggest offensive penalty, the clearest whipsaw regime, the worst relative window, and the opportunity cost of the strategy’s objective.

When the benchmark is fair, the interpretation becomes more honest. The reader can see not just whether the strategy outperformed, but how it earned that outcome and what it gave up to get there.

The BTS benchmark rule

The BTS benchmark rule is simple:

A benchmark is the fairest investable control portfolio for a strategy. It should preserve the same opportunity set and broad economic assumptions while removing the strategy’s active overlay. That way, the comparison reveals what the strategy changed in economic terms: what risk it removed, what return it gave up, and in which regimes those tradeoffs appeared.

Every strategy’s primary benchmark should also be specified precisely enough to be repeatable. That specification should state the benchmark instruments or basket, target weights, rebalance cadence, benchmark accounting convention, execution and cost convention where relevant, and the rule for when the benchmark starts. If any of those choices change, the economic interpretation of the strategy changes with them.

Because of that, benchmark choice is part of the strategy’s published specification. Changing the primary benchmark is not an editorial tweak. It changes the meaning of the published comparison. The primary benchmark should be selected as part of the strategy specification, not swapped after reviewing comparative results.

As a practical check, ask three questions:

Does this benchmark preserve the same underlying opportunity set?
Does it remove the active overlay without adding a hidden second thesis?
Would the comparison still feel fair if the strategy underperformed?

If the answer to those questions is yes, the benchmark is probably close to right. If not, the benchmark is probably answering the wrong question.

Category: Uncategorized

Table of contents