# A New Post

First of all, apologies to anyone who were expecting new posts or left a comment here but didn’t get a reply from me. There were quite a few changes in my life and I simply had to move my time and energy on blogging somewhere else. Now I’m trying to get back to it.

Because of reasons I will avoid writing about specific investment strategies, factor descriptors or particular stocks. I will write more about my thoughts (or thoughts stolen from people smarter than me) on generic techniques and theories. In an attempt to be rigorous (or more realistically, less sloppy), I will try to stay on one main track: hypothesis -> logical explanations -> supporting data or observations. This time I will use math and programming to make sense of things instead of just getting results on paper.

Roy

# Neural Network Algorithm

This post is a token of gratitude to Andrew Ng at Stanford University. I’ve learned a lot of fascinating concepts and practical skills from his Machine Learning course at Coursera. Among these algorithms neural network is my favourite so I decided to convert this class assignment built in Octave to functions in R. I will post my code below but if anyone who’d like to play with an exported package with documentations, please contact me by email, I will send you the tarball file. The code is also available on my github repository.

Comparing to the more comprehensive package “neuralnet” on CRAN, this is a lite version which consists all the basic components: single hidden layer, random initialization, sigmoid activation, regularized cost function, forward-propagation and backward-propagation. Originally it only works as a classifier, but I modified it a bit to create a similar algorithm to handle absolute values. I’m not sure if it’s the best practice for this kind of task but it worked as expected.

This is how the algorithm looks like, along with its source of inspiration, an actual human neural neuron. By mimicking the way a human brain (which is the most powerful computer in the universe) adjusts itself to recognize patterns and estimates the future, an algorithm should be able to solve similar tasks.

For the classifier, the cost function tracks the forward propagation process is defined below.

$J(\Theta) = \frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{K} \left [ -y_{k}^{(i)}log((h_{\Theta}(x^{(i)}))_{k})-(1-y_{k}^{(i)})log(1-(h(x^{(i)}))_{k}) \right ]$

Where $J$ is the total cost we want to minimize, $m$ represents number of training examples, $K$ represents number of labels (for classification). When training for the ith example with label $k$ and weights $\Theta$, all corresponding correct output values will be converted to 1 and the rest will become 0, so $-y_{k}^{(i)}$ and $(1-y_{k}^{(i)})$ works as switches for different scenarios. Function $h_{\Theta}(x^{(i)}))_{k}=g(z)$ calculates the training output as $z=\Theta^{T}x^{i}$, then processes it with sigmoid function $g(z)=\frac{1}{1+e^{-z}}$ in order to bound it between 0 to 1 as likelihood of being 1 (which is equivalent to probability of being true).

To avoid extreme bias caused by large weights, we add parameter regulation to the cost function. Defined as

$\frac{\lambda}{2m}\left [ \sum_{j=1}^{h}\sum_{k=1}^{n}(\Theta_{j,k}^{(1)})^{2}+\sum_{j=1}^{o}\sum_{k=1}^{h}(\Theta_{j,k}^{(2)})^{2} \right ]$

where $n$, $h$ and $o$ equal to numbers of nodes in the input layer, hidden layer and output layer respectively.

During the backward propagation process, we push the calculated outputs backwards through the algorithm and accumulate the gradient from each layer. Because there are only one hidden layer, we can calculate the gradient of the output layer as $\delta^{3}=a_{k}^{3}-y_{k}$ and hidden layer as $\delta^{2}=(\Theta^{2})^{T}\delta^{3}{\ast}g^{'}(z^{(2)})$, then accumulate it $\Delta^{(l)}=\Delta^{(l)}+\delta^{(l+1)}(a^{(l)})^{T}$ to finally get the total unregularized gradient $\frac{\partial }{\partial \Theta^{(l)}_{ij}}J(\Theta)=\frac{1}{m}\Delta^{(l)}_{(ij)}$. In the end, we just need to regularize it like we did to the cost function by adding $\frac{\lambda}{m}\Theta^{(l)}_{ij}$.

The absolute value algorithm is very similar. Instead of using a 0/1 (True/False) switch to train examples with same labels all at once, it trains one example at a time and calculate the cost by squared error of each prediction $(y^{(i)}-h_{\Theta}(x^{(i)}))^{2}$.

Now we can wrap up a cost function and a gradient function to test it by running optim() provided by default R package “stat”. In this case I choose L-BFGS-B method because it utilizes everything that we’ve got in hand and seems learns faster than most of the others.

The classifier first. Three blocks of data are randomly generated and labelled for training. Because we intentionally created them for testing purpose, it’s not necessary to divide them for cross-validation and out-of-sample testing contrast to using it in real world.

Then for absolute values. I reset y values for this test and tried several values for learning speed $lamda$ and number of iterations to get results below.

One day I got a question from an academic star with almost perfect GPA in our university, “I did everything my professor asked us to do, I did regression on PE, PB ratio and stuff for prediction, why my target price for Goldman Sachs is still around $300? That’s too much off from everybody else in the market, I don’t understand.” I stared at him for more than 10 seconds, speechless. As Arthur Benjamin proposed in his speech (link below), calculus has been on the top of the math pyramid for too long, now it’s time for statistics to take over. This matters because in general our mind is skewed too much towards the calculus-style, deterministic way of thinking. For one thing, it is much more intuitive for human mind to understand things that are either right or wrong; for another, people who claim they know everything and can “prove” that by making predictions always have much more audiences than those who say “this might be true, I could be wrong but it’s the closest we can get”. Also from my own experience, our education systems are mostly designed to reward people who get desired certainties, not people who comprehend things sensibly. Consequentially, students started to pick up the habit of “sacrifice reality for elegance” (Paul Wilmott) and the line between doing scientific research and confirming collective bias is blurred. Einstein was proven wrong about “God doesn’t play dice”, but regretfully that doesn’t stop ordinary people believing it’s possible to eliminate uncertainty in their own life. On the opposite, not saying statistics is better, but it does focus more on observation and self-evaluation. Its purpose is not about to find the only perfect answer, instead it’s about seeing things from as many dimensions as possible (which is a very unnatural process to human brain). One accurate prediction doesn’t make a statistical model work; a long enough series of predictions under relatively bias free conditions with acceptable level of error does (now I kind of understand why people don’t like it…). Again, I don’t think it’s better than calculus, but I think this is the key to problems such as “if I did it right, how come I’m still losing money?” Recommended Reading: Fooled by Randomness (N.Taleb) Recommended Video: Arthur Benjamin’s formula for changing math education # Optimizing Ivy Portfolio Early on I posted a simple live version of GTAA strategy. It demonstrated the effectiveness of the Ivy Portfolio (M.Faber, 2009) rationale in recent market with a small sample. Again, the rationale is very simple and powerful: screen a wide range of asset classes each week/month, then invest in those that have shown the strongest momentum. Last time I tracked 39 ETFs’ 9-month SMA and equally allocated portfolio assets to the top 8. Although I got pretty good results, the sample was relatively small and those ETFs are quite different in terms of time of inception, liquidity, tracking error, etc. And above all, equal allocation seems a bit, for lack of a better word, boring. This time I want to use a more general sample to see how we can improve this by implementing some optimization strategies I’ve shown in my previous post Backtesting Portfolio Optimization Strategies. Some equipment check before we launch the test. Asset Classes: 1. SPX Index: S&P 500 LargeCap Index 2. MID Index: S&P 400 MidCap Index 3. SML Index: S&P 600 SmallCap Index 4. MXEA Index: MSCI EAFE Index 5. MXEF Index: MSCI Emerging Markets Index 6. LBUSTRUU Index: Barclays US Agg Total Return Value Unhedged USD (U.S. investment grade bond) 7. XAU Curncy: Gold/USD Spot 8. SPGSCI Index: Goldman Sachs Commodity Index 9. DJUSRE Index: Dow Jones U.S. Real Estate Index 10. GBP Curncy: GBP/USD Spot 11. EUR Curncy: EUR/USD Spot 12. JPY Curncy: JPY/USD Spot 13. HKD Curncy: HKD/USD Spot Rules: 1. Rebalance monthly 2. Rank 12-month SMA; invest in the top 3 3. For each asset, minimum weight = 5%; maximum weight = 95% 4. Use CVaR optimization to construct the portfolio each month; confidence level = 1% Fortunately, our test didn’t fall apart and crash into the Pacific Ocean. The CVaR model seems did a good job improving the original strategy. However, it has to be pointed out that not all optimization models are better than an equal-weighted one. As demonstrated below, the minimum-variance and maximum-sharpe ratio models didn’t make much difference. roy # Backtesting Portfolio Optimization Strategies Recently I’m trying to develop some handy tools to help backtest and analyse user-specified portfolio strategies. Now I’d like to do a quick demonstration through testing four portfolio strategies (minimum variance, maximum Sharpe Ratio, minimum CVaR, and equal-weighted) that rebalance weekly from 1990 to 2012. Instead of specific stocks or ETFs, 10 S&P sector indices by GICS will be used as hypothetical assets. As aggregated equity trackers, they are similar in terms of market efficiency, liquidity, macro environment etc. And by using them we eliminated factors that jeopardize the quality of data such as IPO or survivorship bias. Moreover, most ETFs only emerged several years ago, with equity indices one can go back much further for a bigger sample. Firstly, I got the data from Bloomberg and load it from a csv. file. Please note in order to transfer the raw data into a time-series object (zoo), we need to convert the dates in the csv. file into integers. ### load packages require(TTR) require(tseries) require(quadprog) require(Rglpk) require(quantmod) raw_data <- read.csv(file.choose(), header = TRUE) ## open the file directly raw_data[, 1] <- as.Date(raw_data[, 1], origin = '1899-12-30') ## convert numbers into dates raw_data <- zoo(raw_data[, -1], order.by = raw_data[, 1]) ## create a zoo object  With the data properly formatted, we can perform backtests based on our strategies. minvar_test <- backtest(raw_data, period = 'weeks', hist = 52, secs = 10, model = 'minvar', reslow = .1 ^ 10, reshigh = 1) sharpe_test <- backtest(raw_data, period = 'weeks', hist = 52, secs = 10, model = 'sharpe') cvar_test <- backtest(raw_data, period = 'weeks', hist = 52, secs = 10, model = 'cvar', alpha = .01) equal_test <- backtest(raw_data, period = 'weeks', hist = 52, secs = 10, model = 'eql')  I’m not posting function backtest here because it’s quite bulky and has other optimization functions nested inside. But as you can see what it does is just take a zoo object, ask what strategy the user wants to test, and perform a backtest accordingly. By setting argument daily.track = TRUE, You can also track the portfolio’s position shift on daily basis. But due to time and space constrain I won’t show it this time neither. Here’s the PnL curves of the tests. And to see their position transitions, we need another function. ## returns a transitional map of a backtest strategy transition <- function(allo, main = NA) { cols = rainbow(ncol(allo)) x <- rep(1, nrow(allo)) plot(x, col = 'white', main = main, ylab = 'weight', ylim = c(0, 1), xlim = c(-nrow(allo) * .2, nrow(allo))) polygon(c(0, 0, 1:nrow(allo), nrow(allo)), c(0, 1, x, 0), col = cols[1], border = FALSE) for (i in 2:ncol(allo)) { polygon(c(0, 0, 1:nrow(allo), nrow(allo)), c(0, 1 - sum(allo[1, 1:i]), x - apply(allo[, 1:i], 1, sum), 0), col = cols[i], border = FALSE) } legend('topleft', colnames(allo), col = cols[1:ncol(allo)], pch = 15, text.col = cols[1:ncol(allo)], cex = 0.7, bty = 'n') } ## visualize the transitions par(mfrow = c(4, 1)) transition(cvar_allo, main = 'CVaR Portfolio Transition') transition(minvar_allo, main = 'Min-Variance Portfolio Transition') transition(sharpe_allo, main = 'Sharpe Portfolio Transition') transition(equal_allo, main = 'Equally Weighted Portfolio Transition')  Scalability was taken into consideration when these functions were built. I look forward to nesting more strategies into the existing functions. roy # Exploring Optimal f Recently I’ve been reading Ralph Vince’s Risk Opportunity Analysis. His exploration in Optimal f and Leverage Space Model is quite fascinating and coincidently, his theory generalized the findings we’ve seen in my previous posts Monty Hall Paradox Test and Is Your Faith Fat-tailed? In a nutshell, Ralph’s theory incorporates the factor of time into a decision-making & bet-sizing process through implementing probability-weighted median and Optimal f, as oppose to using probability-weighted mean and Kelly Criterion. It’s a very powerful combination. In a hypothetical game with negative expected return, as well as a high chance to win small amounts and a very little chance to lose a lot (such as 90% of win 1 and 10% of loss -10), using probability-weighted median could prevent the player rejecting highly probable and small profits while Optimal f could protect the player from remotely possible disasters. Here I’d like to mirror what he did in his book using a 2:1 fair coin tossing game to illustrate these ideas. Since the expected return is 2 * 0.5 + (-1) * 0.5 = 0.5, we are going to play this game anyway. But the expected return, which has a linear relationship with the amount you bet f, only tells half of the story. As shown below in R. With your time horizon expands, the relationship bends. (well…) # sets up the function HPR <- function(win = 2, loss = -1, hor = 2, lev = 1, try = 101) { # hor = time horizon # lev = max leverage f <- seq(0, lev, length.out = try) ghpr <- rep(0, length(f)) for (i in 1:length(ghpr)) { win_ret <- 1 + f[i] * win loss_ret <- 1 + f[i] * loss mat <- matrix(c(win_ret, loss_ret)) mat <- mat %*% t(mat) if (hor > 2) { for (j in 2:(hor - 1)) { mat <- cbind(mat * win_ret, mat * loss_ret) } } ghpr[i] <- sum((mat ^ (1 / hor) - 1)) / (2 ^ hor) } ghpr[ghpr<0] <- 0 data.frame(f = f, GHPR = 1 + ghpr, HPR = (ghpr + 1) ^ hor) } # play 2 times plot(HPR(hor=2)$f, HPR(hor=2)$HPR, type='l', ylim = c(1, 1.8), main = 'Time Horizon Illustration', xlab = 'f', ylab = 'HPR') # play 1 time lines(c(0, 1), c(1, 1.5)) # and more lines(HPR(hor=3)$f, HPR(hor=3)$HPR) lines(HPR(hor=4)$f, HPR(hor=4)$HPR) lines(HPR(hor=5)$f, HPR(hor=5)$HPR) lines(HPR(hor=6)$f, HPR(hor=6)$HPR) lines(HPR(hor=7)$f, HPR(hor=7)$HPR) lines(HPR(hor=8)$f, HPR(hor=8)$HPR) text(0.8, 1.2, '2') text(0.65, 1.18, '3') text(0.5, 1.1, '8') text(1, 1.55, '1') grid()  As we continue to play (despite the time horizon), the optimal bet size f keeps approaching to 0.25, which could also be derived out from the objective function of Optimal f: . Now we can examine the compounding effect with using Optimal f. # this function simply implements Optimal f with 100 observations Opt.f <- function(win = 2, loss = -1, p = .5, lev = 1, obs = 100, plays = 1) { f <- seq(0, lev, 1/obs) rets <- rep(0, length(f)) for (i in 1:length(rets)) { rets[i] <- (((1 + f[i] * win / -loss) ^ p) * ((1 + f[i] * loss / -loss) ^ (1 - p))) ^ plays } data.frame(f = f, rets = rets) } # the results of 40 plays plot(Opt.f(plays=40)$f, Opt.f(plays=40)$rets, type='l', main = 'Optimal f Effect', xlab = 'f', ylab = 'returns') # 20 and 5 plays lines(Opt.f(plays=20)$f, Opt.f(plays=20)$rets) lines(Opt.f(plays=5)$f, Opt.f(plays=5)$rets) text(.25, 9, '40') text(.25, 3.5, '20') text(.25, 1, '5') grid()  Finally, instead of tossing one coin, now we have two coins for the same game and here’s the combined effect of 5 plays. # Two components function TCOpt.f <- function(win = 2, loss = -1, outcome = 4, lev = 1, obs = 100, plays = 1) { f1 <- seq(0, lev, 1/obs) f2 <- seq(0, lev, 1/obs) rets <- matrix(0, length(f1), length(f2)) for (i in 1:length(f1)) { for (j in 1:length(f2)) { s1 <- (1 + (f1[i] * win + f2[j] * loss) / -loss) ^ (1 / outcome) s2 <- (1 + (f1[i] * win + f2[j] * win) / -loss ) ^ (1 / outcome) s3 <- (1 + (f1[i] * loss + f2[j] * win) / -loss ) ^ (1 / outcome) s4 <- (1 + (f1[i] * loss + f2[j] * loss) / -loss ) ^ (1 / outcome) rets[i, j] <- (s1 * s2 * s3 * s4) ^ plays } } rets[is.na(rets)] <- 0 mat <- matrix(f1, nrow = length(f1), ncol=length(f1)) list(xmat = mat, ymat = t(mat), 'rets' = rets) } require(rgl) plays5 <- TCOpt.f(plays=5) x <- plays5$xmat
y <- plays5$ymat z <- plays5$rets
col <- c('white','blue')[ z+1 ]
persp3d(x, y, z, color=col, alpha=.6, xlab = 'coin 1', ylab = 'coin 2', zlab = 'returns')
grid3d(c('x', 'y', 'z'))
title3d("Two Components Coin Tossing", line=5)


There are a lot of other interesting properties of this system but I believe I’ve covered the gist of it. And apparently more works need to be done until we can fully utilize it in trading or investment.

roy

# Testing Kelly Criterion and Optimal f in R

Kelly Criterion and Optimal f are very similar models for geometric growth optimization. Ralph Vince’s article Optimal f and the Kelly Criterion  has explained their differences in detail and here are main takeaways.

1. Kelly Criterion does not yield the optimal fraction to risk in trading, Optimal f does
2. Kelly Criterion only generates a leverage factor which could go infinitely large; Optimal f is bounded between 0 and 1
3. The reconciliation between two models could be written as Optimal f = Kelly * (-W/Px), where W is the possible maximum loss on each trade, Px is the price per share. Both are in dollar amount

Inspired by Ralph’s article, I did a test in R to compare these two models. Considering a 50/50 coin flipping game that pays $2 on heads and -$0.5 on tails for every $1 you bet. Through optimizing Kelly’s objective function $\sum_{i=1}^{n}(\ln(1+R_i*f)*P_i)$ we should get optimal f = 0.75. While Optimal f, with objective function $\prod_{i=1}^{n}(1+f*(Px_i/-W))^{P_i}$, will give a different optimal f = 0.375. Let’s see if they can be consistent with observations in R. # kelly formula test kelly.opt <- function(win, loss, p, obs, lev) { # win = payout for a win # loss = payout for a loss # p = probability of winning # obs = number of observations # lev = maximum leverage allowed # set up different bet sizes for test f <- seq(0, lev, 1 / (obs - 1)) # generate trading results according to given win, loss and p ret <- rep(win, length(f)) ret[which(rbinom(length(f), 1, 1 - p) == 1)] <- loss #calculate accumulative pnl for different bet sizes respectively pnl <- f for (i in 1:length(f)) { pnl[i] <- sum(log(1 + ret * f[i])) } # find the optimal point of f results <- cbind(f, pnl) opt.kelly <- results[which(results[, 2] == max(results[, 2])), 1] # wrap up output <- list() output$opt.kelly <- opt.kelly
output$results <- results output } # optimal f test opt.f <- function(win, loss, p, obs, lev) { # similar as Kelly except using a different objective function f <- seq(0, lev, 1 / (obs - 1)) ret <- rep(win, length(f)) ret[which(rbinom(length(f), 1, 1 - p) == 1)] <- loss pnl <- f for (i in 1:length(f)) { pnl[i] <- prod(1 + ret / (-loss / f[i])) } results <- cbind(f, pnl) opt.f <- results[(which(results[, 2] == max(results[, 2]))), 1] output <- list() output$opt.f <- opt.f
output$results <- results output } # get statistics for kelly compare <- data.frame(kelly=1:5000, opt.f=1:5000) for (i in 1:5000) { compare$kelly[i] <- kelly.opt(2, -.5, .5, 500, 1)$opt.kelly } # get statistics for optimal f for (i in 1:5000) { compare$opt.f[i] <- opt.f(2, -.5, .5, 500, 1)\$opt.f
}

# generate graph
require(ggplot)
m <- ggplot(compare, aes(colour=compare)) + xlim(c(0, 1)) + xlab('distribution')
m + stat_density(aes(x=kelly, colour='kelly'), adjust = 2, fill=NA) +
stat_density(aes(x=opt.f, colour='opt.f'), adjust = 2, fill=NA)


It’s always a beauty to see test results and mathematical models showing consistency. And here we are happy to see two distributions nice and tightly surrounding their means: 0.75068 for Kelly Criterion, and 0.37558 for Optimal f. To wrap things up.

roy