This post is a token of gratitude to Andrew Ng at Stanford University. I’ve learned a lot of fascinating concepts and practical skills from his Machine Learning course at Coursera. Among these algorithms neural network is my favourite so I decided to convert this class assignment built in Octave to functions in R. I will post my code below but if anyone who’d like to play with an exported package with documentations, please contact me by email, I will send you the tarball file. The code is also available on my github repository.
Comparing to the more comprehensive package “neuralnet” on CRAN, this is a lite version which consists all the basic components: single hidden layer, random initialization, sigmoid activation, regularized cost function, forward-propagation and backward-propagation. Originally it only works as a classifier, but I modified it a bit to create a similar algorithm to handle absolute values. I’m not sure if it’s the best practice for this kind of task but it worked as expected.
This is how the algorithm looks like, along with its source of inspiration, an actual human neural neuron. By mimicking the way a human brain (which is the most powerful computer in the universe) adjusts itself to recognize patterns and estimates the future, an algorithm should be able to solve similar tasks.


For the classifier, the cost function tracks the forward propagation process is defined below.
![J(\Theta) = \frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{K} \left [ -y_{k}^{(i)}log((h_{\Theta}(x^{(i)}))_{k})-(1-y_{k}^{(i)})log(1-(h(x^{(i)}))_{k}) \right ] J(\Theta) = \frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{K} \left [ -y_{k}^{(i)}log((h_{\Theta}(x^{(i)}))_{k})-(1-y_{k}^{(i)})log(1-(h(x^{(i)}))_{k}) \right ]](http://s0.wp.com/latex.php?latex=J%28%5CTheta%29+%3D+%5Cfrac%7B1%7D%7Bm%7D%5Csum_%7Bi%3D1%7D%5E%7Bm%7D%5Csum_%7Bk%3D1%7D%5E%7BK%7D+%5Cleft+%5B+-y_%7Bk%7D%5E%7B%28i%29%7Dlog%28%28h_%7B%5CTheta%7D%28x%5E%7B%28i%29%7D%29%29_%7Bk%7D%29-%281-y_%7Bk%7D%5E%7B%28i%29%7D%29log%281-%28h%28x%5E%7B%28i%29%7D%29%29_%7Bk%7D%29+%5Cright+%5D&bg=ffffff&fg=444444&s=0)
Where
is the total cost we want to minimize,
represents number of training examples,
represents number of labels (for classification). When training for the ith example with label
and weights
, all corresponding correct output values will be converted to 1 and the rest will become 0, so
and
works as switches for different scenarios. Function
calculates the training output as
, then processes it with sigmoid function
in order to bound it between 0 to 1 as likelihood of being 1 (which is equivalent to probability of being true).
To avoid extreme bias caused by large weights, we add parameter regulation to the cost function. Defined as
![\frac{\lambda}{2m}\left [ \sum_{j=1}^{h}\sum_{k=1}^{n}(\Theta_{j,k}^{(1)})^{2}+\sum_{j=1}^{o}\sum_{k=1}^{h}(\Theta_{j,k}^{(2)})^{2} \right ] \frac{\lambda}{2m}\left [ \sum_{j=1}^{h}\sum_{k=1}^{n}(\Theta_{j,k}^{(1)})^{2}+\sum_{j=1}^{o}\sum_{k=1}^{h}(\Theta_{j,k}^{(2)})^{2} \right ]](http://s0.wp.com/latex.php?latex=%5Cfrac%7B%5Clambda%7D%7B2m%7D%5Cleft+%5B+%5Csum_%7Bj%3D1%7D%5E%7Bh%7D%5Csum_%7Bk%3D1%7D%5E%7Bn%7D%28%5CTheta_%7Bj%2Ck%7D%5E%7B%281%29%7D%29%5E%7B2%7D%2B%5Csum_%7Bj%3D1%7D%5E%7Bo%7D%5Csum_%7Bk%3D1%7D%5E%7Bh%7D%28%5CTheta_%7Bj%2Ck%7D%5E%7B%282%29%7D%29%5E%7B2%7D+%5Cright+%5D&bg=ffffff&fg=444444&s=0)
where
,
and
equal to numbers of nodes in the input layer, hidden layer and output layer respectively.
During the backward propagation process, we push the calculated outputs backwards through the algorithm and accumulate the gradient from each layer. Because there are only one hidden layer, we can calculate the gradient of the output layer as
and hidden layer as
, then accumulate it
to finally get the total unregularized gradient
. In the end, we just need to regularize it like we did to the cost function by adding
.
The absolute value algorithm is very similar. Instead of using a 0/1 (True/False) switch to train examples with same labels all at once, it trains one example at a time and calculate the cost by squared error of each prediction
.
Now we can wrap up a cost function and a gradient function to test it by running optim() provided by default R package “stat”. In this case I choose L-BFGS-B method because it utilizes everything that we’ve got in hand and seems learns faster than most of the others.
The classifier first. Three blocks of data are randomly generated and labelled for training. Because we intentionally created them for testing purpose, it’s not necessary to divide them for cross-validation and out-of-sample testing contrast to using it in real world.
X <- rbind(matrix(rnorm(100, 10, 1), 20, 5),
matrix(rnorm(200, 5, 2), 40, 5),
matrix(rnorm(150, 1, 0.1), 30, 5));
y_cla <- c(rep(3, 20), rep(1, 40), rep(2, 30));
test_cla <- nnet.train.cla(X, y_cla, hidden_layer_size = 25,
lambda = 5, maxit = 50);
pred_cla <- nnet.predict.cla(test_cla$Theta1, test_cla$Theta2, X);
plot(y_cla, col = "green", pch = 15);
points(pred_cla, col = "red", pch = 3);
legend("topright", legend = c("Original", "Predicted"), pch = c(15, 3), col = c("green", "red"));

Then for absolute values. I reset y values for this test and tried several values for learning speed
and number of iterations to get results below.
y_abs <- c(rep(-3, 20), rep(1, 40), rep(2, 30));
test_abs <- nnet.train.abs(X, y_abs, hidden_layer_size = 25,
lambda = 0.0001, maxit = 400);
pred_abs <- nnet.predict.abs(test_abs$Theta1, test_abs$Theta2, X);
plot(sigmoid(y_abs), type = 'l', col = "green", ylim = c(0, 1));
lines(pred_abs, col = "red");
legend("topright", legend = c("Original", "Predicted"), col = c("green", "red"), lwd = c(1, 1));

Source code of the algorithm. Apparently it’s not very concise. Like said, package is provided upon request.
## =============== random initialization ===============
randInitializeWeights <- function(L_in, L_out) {
theta <- matrix(0, L_out, 1 + L_in);
epsilon_init <- 0.12;
theta <- matrix(rnorm(L_out*(L_in + 1)),
L_out, (L_in + 1))*2*epsilon_init - epsilon_init;
return(theta);
}
## =============== sigmoid function ===============
sigmoid <- function(z) {
g <- 1 / (1 + exp(-z));
return(g);
}
## =============== sigmoid gradient function ===============
sigmoidGradient <- function(z) {
g <- matrix(0, dim(z));
g <- sigmoid(z);
g <- g * (1 -g);
return(g);
}
# ++++++++++++++++++ absolute value predictor functions ++++++++++++++++++
## =============== cost J and gradient ===============
nnCostFuncion.abs <- function(nn_params,
input_layer_size,
hidden_layer_size,
X, y, lambda) {
Theta1 <- matrix(nn_params[1:(hidden_layer_size *
(input_layer_size + 1))],
hidden_layer_size,
(input_layer_size + 1));
Theta2 <- matrix(nn_params[-1:-(hidden_layer_size *
(input_layer_size + 1))],
1,
(hidden_layer_size + 1));
m <- dim(X)[1];
J <- 0;
Theta1_grad <- matrix(0, nrow(Theta1), ncol(Theta1));
Theta2_grad <- matrix(0, nrow(Theta2), ncol(Theta2));
for (i in 1:m) {
y_sig <- sigmoid(y[i]);
## forward propagation
# first feed
a1 <- X[i, ];
a1 <- c(1, a1);
# first hidden layer
z2 <- Theta1 %*% a1;
a2 <- sigmoid(z2);
a2 <- c(1, a2);
# output layer
z3 <- Theta2 %*% a2;
a3 <- sigmoid(z3);
# add to cost function
J <- J + (a3 - y_sig)^2 / 2;
## backward propagation
delta3 <- a3 * (1 - a3) * (a3 - y_sig);
delta2 <- (t(Theta2) %*% delta3)[-1] * sigmoidGradient(z2);
Theta1_grad <- Theta1_grad + (delta2 %*% a1);
Theta2_grad <- Theta2_grad + (delta3 %*% a2);
}
J <- J / m;
Theta1_grad <- Theta1_grad / m;
Theta2_grad <- Theta2_grad / m;
# J regulization
reg_theta1 <- Theta1[, -1];
reg_theta2 <- Theta2[, -1];
J <- J + (lambda/(2*m)) * (sum(reg_theta1^2) + sum(reg_theta2^2));
# gradient regulization
Theta1_grad[, -1] <- Theta1_grad[, -1] + (lambda/m) * Theta1[, -1];
Theta2_grad[, -1] <- Theta2_grad[, -1] + (lambda/m) * Theta2[, -1];
# unroll gradients
grad <- c(c(Theta1_grad), c(Theta2_grad));
return(list(J = J, grad = grad));
}
## =============== cost J function for optimization ===============
costFunction.abs <- function(nn_params,
input_layer_size,
hidden_layer_size,
X, y, lambda) {
costJ <- nnCostFuncion.abs(nn_params = nn_params,
input_layer_size = input_layer_size,
hidden_layer_size = hidden_layer_size,
X = X, y = y, lambda = lambda)$J;
return(costJ);
}
## ========== cost J gradient function for optimization ==========
gradFunction.abs <- function(nn_params,
input_layer_size,
hidden_layer_size,
X, y, lambda) {
grad <- nnCostFuncion.abs(nn_params = nn_params,
input_layer_size = input_layer_size,
hidden_layer_size = hidden_layer_size,
X = X, y = y, lambda = lambda)$grad;
return(grad);
}
#################### end of utility functions ####################
## ================ train nerual network ===============
nnet.train.abs <- function(X, y, hidden_layer_size = 25,
lambda = 1, maxit = 50) {
m <- nrow(X);
input_layer_size <- ncol(X);
# ================ Initializing Pameters ================
initial_Theta1 <- randInitializeWeights(input_layer_size, hidden_layer_size);
initial_Theta2 <- randInitializeWeights(hidden_layer_size, 1);
initial_nn_params = c(c(initial_Theta1), c(initial_Theta2));
# =================== Training NN ===================
train_results <- optim(par = initial_nn_params,
fn = costFunction.abs,
input_layer_size = input_layer_size,
hidden_layer_size = hidden_layer_size,
X = X, y = y, lambda = lambda,
gr = gradFunction.abs,
method = "L-BFGS-B",
control = list(maxit = maxit, trace = TRUE, REPORT = 1));
nn_params <- train_results$par;
Theta1 <- matrix(nn_params[1:(hidden_layer_size*(input_layer_size+1))],
hidden_layer_size, (input_layer_size + 1));
Theta2 <- matrix(nn_params[-1:-(hidden_layer_size*(input_layer_size+1))],
1, (hidden_layer_size + 1));
## ================= show accuracy =================
pred = nnet.predict.abs(Theta1, Theta2, X);
cat("\nLogistic Prediction Error: ", var(pred - sigmoid(y)), "\n", sep = "");
## =============== return thetas ===============
return(list(Theta1 = Theta1, Theta2 = Theta2));
}
## =============== nerual network predict ===============
nnet.predict.abs <- function(Theta1, Theta2, X) {
m <- nrow(X);
if (is.null(m)) {
m <- 1; X <- t(X);
}
p = rep(0, m);
h1 <- sigmoid(cbind(rep(1, m), X) %*% t(Theta1));
h2 <- sigmoid(cbind(rep(1, m), h1) %*% t(Theta2));
return(h2);
}
# ++++++++++++++++++ classifier functions ++++++++++++++++++
## =============== cost J and gradient ===============
nnCostFunction.cla <- function(nn_params,
input_layer_size,
hidden_layer_size,
num_labels,
X, y, lambda) {
Theta1 <- matrix(nn_params[1:(hidden_layer_size *
(input_layer_size + 1))],
hidden_layer_size,
(input_layer_size + 1));
Theta2 <- matrix(nn_params[-1:-(hidden_layer_size *
(input_layer_size + 1))],
num_labels,
(hidden_layer_size + 1));
m <- dim(X)[1];
J <- 0;
Theta1_grad <- matrix(0, nrow(Theta1), ncol(Theta1));
Theta2_grad <- matrix(0, nrow(Theta2), ncol(Theta2));
for (i in 1:m) {
y_label <- rep(0, num_labels);
y_label[y[i] == as.integer(levels(factor(y)))] <- 1;
## forward propagation
# first feed
a1 <- X[i, ];
a1 <- c(1, a1);
# first hidden layer
z2 <- Theta1 %*% a1;
a2 <- sigmoid(z2);
a2 <- c(1, a2);
# output layer
z3 <- Theta2 %*% a2;
a3 <- sigmoid(z3);
# add to cost function
J <- J + sum(-y_label * log(a3) - (1 - y_label) * log(1 - a3));
## backward propagation
delta3 <- a3 - y_label;
delta2 <- (t(Theta2) %*% delta3)[-1] * sigmoidGradient(z2);
Theta1_grad <- Theta1_grad + (delta2 %*% a1);
Theta2_grad <- Theta2_grad + (delta3 %*% a2);
}
J <- J / m;
Theta1_grad <- Theta1_grad / m;
Theta2_grad <- Theta2_grad / m;
# J regulization
reg_theta1 <- Theta1[, -1];
reg_theta2 <- Theta2[, -1];
J <- J + (lambda/(2*m)) * (sum(reg_theta1^2) + sum(reg_theta2^2));
# gradient regulization
Theta1_grad[, -1] <- Theta1_grad[, -1] + (lambda/m) * Theta1[, -1];
Theta2_grad[, -1] <- Theta2_grad[, -1] + (lambda/m) * Theta2[, -1];
# unroll gradients
grad <- c(c(Theta1_grad), c(Theta2_grad));
return(list(J = J, grad = grad));
}
## =============== cost J function for optimization ===============
costFunction.cla <- function(nn_params,
input_layer_size,
hidden_layer_size,
num_labels,
X, y, lambda) {
costJ <- nnCostFunction.cla(nn_params = nn_params,
input_layer_size = input_layer_size,
hidden_layer_size = hidden_layer_size,
num_labels = num_labels,
X = X, y = y, lambda = lambda)$J;
return(costJ);
}
## ========== cost J gradient function for optimization ==========
gradFunction.cla <- function(nn_params,
input_layer_size,
hidden_layer_size,
num_labels,
X, y, lambda) {
grad <- nnCostFunction.cla(nn_params = nn_params,
input_layer_size = input_layer_size,
hidden_layer_size = hidden_layer_size,
num_labels = num_labels,
X = X, y = y, lambda = lambda)$grad;
return(grad);
}
#################### end of utility functions ####################
## ================ train nerual network ===============
nnet.train.cla <- function(X, y, hidden_layer_size = 25,
lambda = 1, maxit = 50) {
m <- nrow(X);
input_layer_size <- ncol(X);
num_labels <- length(levels(factor(y)));
# ================ Initializing Pameters ================
initial_Theta1 <- randInitializeWeights(input_layer_size, hidden_layer_size);
initial_Theta2 <- randInitializeWeights(hidden_layer_size, num_labels);
initial_nn_params = c(c(initial_Theta1), c(initial_Theta2));
# =================== Training NN ===================
train_results <- optim(par = initial_nn_params,
fn = costFunction.cla,
input_layer_size = input_layer_size,
hidden_layer_size = hidden_layer_size,
num_labels = num_labels,
X = X, y = y, lambda = lambda,
gr = gradFunction.cla,
method = "L-BFGS-B",
control = list(maxit = maxit, trace = TRUE, REPORT = 1));
nn_params <- train_results$par;
Theta1 <- matrix(nn_params[1:(hidden_layer_size*(input_layer_size+1))],
hidden_layer_size, (input_layer_size + 1));
Theta2 <- matrix(nn_params[-1:-(hidden_layer_size*(input_layer_size+1))],
num_labels, (hidden_layer_size + 1));
## ================= show accuracy =================
pred = nnet.predict.cla(Theta1, Theta2, X);
cat("\nTraining Set Accuracy: ", sum(pred == y)/length(pred)*100,
"%\n", sep = "");
## =============== return thetas ===============
return(list(Theta1 = Theta1, Theta2 = Theta2));
}
## =============== nerual network predict ===============
nnet.predict.cla <- function(Theta1, Theta2, X) {
num_labels <- nrow(Theta2);
m <- nrow(X);
if (is.null(m)) {
m <- 1; X <- t(X);
}
p = rep(0, m);
h1 <- sigmoid(cbind(rep(1, m), X) %*% t(Theta1));
h2 <- sigmoid(cbind(rep(1, m), h1) %*% t(Theta2));
p = apply(h2, 1, which.max);
return(p);
}
Like this:
Like Loading...