When Noise Overwhelms Signal – Sorting out Sorts Review

In his 1998 paper, Jonathan Berk illustrated that by sorting stocks based on a variable (e.g. B/E ratio) correlated to a known variable (e.g. beta), the power of the known variable to predict expected return within each group diminishes when tested with cross-sectional regression. This is very likely why Fama and French found the explanatory power of beta disappeared (1992) and Daniel and Titman discovered that stock characteristics matter more than covariances (1997). For researchers and data analysts, this is a perfect example of how seemingly harmless manipulation of data can cause meaningful loss of information. If not careful, such loss can lead to confusing or even completely wrong conclusions.

The intuition behind this issue is rather simple: when data gets divided into smaller groups and tested separately, the error of beta estimation becomes “louder” as the sample size gets smaller. The error-minimizing advantage from using a large sample diminishes as the sample is divided into smaller groups, as the error of estimation overwhelms the useful information in each group.

Getting the intuition is one thing, identifying where exactly the issue occurs and tracing it through the proof is a different story.


Assume CAPM holds: E[R_{i}] = r + \beta_{i}(E[R_{m}]-r), in which the systematic risk of stock i is \beta_{i}\sim\mathcal{N}(1,\sigma^{2}). Realized return \hat{R_{i}} is the same as expected return E[R_{i}]

Scenario 1: CAPM is tested cross-sectionally with a full sample with infinite number of stocks and there’s no estimation error between theoretical beta and estimated beta. i.e., \hat{\beta_{i}} \equiv \beta_{i}. The coefficient of this regression is:

\frac{cov(\hat{R_{i}}-r, \hat{\beta_{i}})}{var(\hat{\beta_{i}})} = \frac{cov(E[R_{i}]-r, {\beta_{i}})}{var({\beta_{i}})} = \frac{\sigma^2}{\sigma^2 + 0}(E[R_{m}]-r) = 1* (E[R_{m}]-r)

Interpretation: stock returns are perfectly linear (coef = 1) to their exposure to the market risk premium; beta is the perfect predictor of stock returns.

Scenario 2: there’s error in estimated beta, i.e., \hat{\beta_{i}} = \beta_{i} + \epsilon_{i}, \epsilon_{i} \sim\mathcal{N}(1,\theta^2). This is where the trouble originates. The existence of \epsilon_{i} gave birth to the original noise \theta, which will get passed down through the rest of the test. As we can see, the coefficient of the same test is already contaminated:

\frac{cov(\hat{R_{i}}-r, \hat{\beta_{i}})}{var(\hat{\beta_{i}})} = \frac{cov(E[R_{i}]-r, {\beta_{i}}+\epsilon_{i})}{var({\beta_{i}}+\epsilon_{i})} = \frac{\sigma^2}{\sigma^2 + \theta^2}(E[R_{m}]-r)

* Assuming estimated and observed returns are the same for convenience.

Interpretation: \frac{\sigma^2}{\sigma^2 + \theta^2} < 1, stock returns are less sensitive to how much systematic risk they are bearing; beta is less of a perfect predictor of stock returns.

Scenario 3: now all stocks are sorted into N fractiles by a variable linearly correlated to beta. Within the jth fractile, the conditional variance of \beta is now redefined:

\sigma_{j}^{2} \equiv var(\beta_{j}|i\in j) = \sigma^{2}g(j),

where g(j) is a concave-up function that “shrinks” \sigma^{2} when all stocks are in the jth fractile (a partial integral of the full sample).

Run the regression test again the coefficient is now:

\frac{\sigma^{2}g(j)}{\sigma^{2}g(j) + \theta^2}(E[R_{m}]-r) = \frac{\sigma^{2}}{\sigma^{2} + \theta^{2}/g(j)}(E[R_{m}]-r)

Interpretation: g(j), a term born from the sorting process, is now serving as a “noise amplifier”. It enhances \theta^{2} when it gets smaller and dampens the coefficient as a result. As a concave-up function, it gets smaller when N is larger and/or j moves closer to the middle among groups. The graph below shows how the coefficient changes with g(j) when E[R_{m}-r is fixed at 1, \sigma^{2} = 0.10 and \theta^{2} = 0.05



To illustrate with actual data, 2,000 stock betas are randomly generated with mean 1 and standard deviation 0.50; 2,000 expected returns are calculated using these betas, market return 6.00% and risk-free rate 1.00%; estimated betas are calculated by adding 2,000 random errors with mean 0 and standard deviation 0.05. All estimated returns are ranked from low to high and this will be used as the basis for sorting. In summary:

  • Number of stocks k = 2,000
  • \beta_{i}\sim\mathcal{N}(1,0.5)
  • E[R_{i}] = 0.01 + \beta_{i}(0.06 - 0.01) = 0.01 + \beta_{i}(0.05)
  • \epsilon_{i}\sim\mathcal{N}(0,0.05)
  • \hat{\beta_{i}} = \beta_{i} + \epsilon_{i}

Test for scenario 1. Run regression E[R_{i}] = \alpha + \lambda \beta_{i} + \varepsilon_{i}. We get \alpha = 0.0100, \lambda = 0.0500, R-squared = 1.00. Essentially perfect fit.

Test for scenario 2. Run regression E[R_{i}] = \alpha + \lambda \hat{\beta_{i}} + \varepsilon_{i}. We get \alpha = 0.01068, \lambda = 0.04943, R-squared = 0.9902.

Test for scenario 3. Run regression E[R_{i}|i \in j] = \alpha_{j} + \lambda_{j} \hat{\beta_{i}} + \varepsilon_{ij}; j \in [1, N].

By setting N = 5, 10, 20, 50, respectively, the coefficients in each group are as follows:



The results are consistent with Berk’s findings. The more groups the stocks are sorted into, the less predictive power beta has on expected returns; the further away j moves towards the center among all groups, the more pronouncing this effect gets.



A Quick Review – the Math Behind the Black Scholes Model


This is a high-level quick review on the derivation of the Black Scholes model, i.e., I will not spend time on putting down rigorous definitions or discussing the assumptions behind the equations.They are definitely important, but I’d rather focus on one thing at a time.

Suppose the change of a stock’s price dS_{t} follows a geometric Brownian motion process:

dS_{t} = \mu S_{t}dt + \sigma S_{t}dW_{t},        (1)

Where \mu is the drift constant and \sigma is the volatility constant.

According to Ito’s Lemma (Newtonian calculus doesn’t work here because of the existance of stochastic term W_{t}), the price of an option of this stock, which is a function C of S and t, must satisfy:

dC(S,t) = (\mu S_{t} \frac{\partial C}{\partial S} + \frac{\partial C}{\partial t} + \frac{1}{2} \sigma^2 S^2 \frac{\partial^2 C}{\partial S^2}) dt + \sigma S_{t} \frac{\partial C}{\partial S} dW_{t},        (2)

Now theoretically, instead of buying an option, we could replicate the payoff of an option by actively and seamlessly allocating our money between a risk-free asset B_{t} and the stock underlying said option. Given the no-arbitrage assumption, the value of this replicating portfolio, P_{t}, should be exactly the same as the option price. Therefore we have:

P_{t} = a_{t}B_{t} + b_{t}S_{t},        (3)

dP_{t} = a_{t}dB_{t} + b_{t}dS_{t}

= ra_{t}B_{t}dt + b_{t}(\mu S_{t}dt + \sigma S_{t}dW_{t})        replace dS_{t} with (1)

= (ra_{t}B_{t} + b_{t}\mu S_{t})dt + b_{t}\sigma S_{t}dW_{t}, and        (4)

dP_{t} = dC_{t},        (5)

where a and b represent the portions of money allocated in each asset; r is the risk-free rate.

Knowing (5), we can map some terms in (2) and (4) to get

b_{t} = \frac{\partial C}{\partial S}, and        (6)

ra_{t}B{t} = \frac{\partial C}{\partial t} + \frac{1}{2}\sigma^2 S^2_{t}\frac{\partial^2 C}{\partial S^2}        (7)

Feed (6) and (7) into (3), we get the Black Scholes partial differential equation:

rC_{t} = rS_{t}\frac{\partial C}{\partial S} + \frac{\partial C}{\partial t} + \frac{1}{2}\sigma^2 S^2_{t}\frac{\partial^2 C}{\partial S^2}        (8)

Apparently, a couple of Nobel Laureates solved this equation here, and now we have the Black Scholes pricing model for European options, which means if K is the strike price, C(S,T) = max(S-K, 0), C(0, t) = 0 for all t and C(S, t) approaches S as S approaches infinity.

C(S,t) = S_{t}\Phi(d_{1}) - e^{-r(T-t)}K\Phi(d_{2})        (9)


d_{1} = \frac{\ln{\frac{S_{t}}{K}} + (\frac{r-\sigma^2}{2}) (T-t)}{\sigma\sqrt{T-t}}

d_{2} = d_{1} - \sigma\sqrt{T-t}

\Phi{(.)} is the CDF of the standard normal distribution.

Using this general approach, we should be able to model any derivatives as long as we have some basic assumptions about the underlying process. Of course, it’s pointless to use these models in a dogmatic way because almost all of the assumptions behind them are not true. An investment professional’s job is not to follow the textbook and blindly apply the formula in the real world and hope it sticks, instead it’s to investigate the discrepancies between the theoretical model and empirical evidences and figure out which assumptions are violated and if so, can we translate these violations into trading opportunities.


A Case of Ambiguous Definition

“Managers of government pension plans counter that they have longer investment horizons and can take greater risks. But most financial economists believe that the risks of stock investments grow, not shrink, with time.” – WSJ

This statement mentioned “risks” twice but they actually mean different things. Therefore the second sentence is correct by itself but cannot be used to reject the first one.

The first “risk” is timeless. The way it’s calculated always scales it down to 1 time unit, which is the time interval between any two data points in the sample.

Risk_1 = \sigma^2 = \frac{1}{N} \sum_{i=1}^{N}(R_i - \bar{R})^2

When “risk” is defined this way, a risky investment A and a less risky investment B have their returns look like this:


The second “risk” is the same thing but gets scaled for N time units. It’s not how variance is defined but people use it because it has a practical interpolation (adjust for different time horizons).

Risk_2 = N * \sigma^2 = \sum_{i=1}^{N}(R_i - \bar{R})^2

Under this definition, the possible PnL paths for A and B look like this:


A’s Monte Carlo result is wider than B, but both A and B’s “risk” by the second definition increases through time, while by the first definition never changed.

I have intentionally avoided mentioning time diversification because doing so would probably make things more confusing. For more details on this please see Chung, Smith and Wu (2009).