# Neural Network Algorithm

This post is a token of gratitude to Andrew Ng at Stanford University. I’ve learned a lot of fascinating concepts and practical skills from his Machine Learning course at Coursera. Among these algorithms neural network is my favourite so I decided to convert this class assignment built in Octave to functions in R. I will post my code below but if anyone who’d like to play with an exported package with documentations, please contact me by email, I will send you the tarball file. The code is also available on my github repository.

Comparing to the more comprehensive package “neuralnet” on CRAN, this is a lite version which consists all the basic components: single hidden layer, random initialization, sigmoid activation, regularized cost function, forward-propagation and backward-propagation. Originally it only works as a classifier, but I modified it a bit to create a similar algorithm to handle absolute values. I’m not sure if it’s the best practice for this kind of task but it worked as expected.

This is how the algorithm looks like, along with its source of inspiration, an actual human neural neuron. By mimicking the way a human brain (which is the most powerful computer in the universe) adjusts itself to recognize patterns and estimates the future, an algorithm should be able to solve similar tasks.

For the classifier, the cost function tracks the forward propagation process is defined below.

$J(\Theta) = \frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{K} \left [ -y_{k}^{(i)}log((h_{\Theta}(x^{(i)}))_{k})-(1-y_{k}^{(i)})log(1-(h(x^{(i)}))_{k}) \right ]$

Where $J$ is the total cost we want to minimize, $m$ represents number of training examples, $K$ represents number of labels (for classification). When training for the ith example with label $k$ and weights $\Theta$, all corresponding correct output values will be converted to 1 and the rest will become 0, so $-y_{k}^{(i)}$ and $(1-y_{k}^{(i)})$ works as switches for different scenarios. Function $h_{\Theta}(x^{(i)}))_{k}=g(z)$ calculates the training output as $z=\Theta^{T}x^{i}$, then processes it with sigmoid function $g(z)=\frac{1}{1+e^{-z}}$ in order to bound it between 0 to 1 as likelihood of being 1 (which is equivalent to probability of being true).

To avoid extreme bias caused by large weights, we add parameter regulation to the cost function. Defined as

$\frac{\lambda}{2m}\left [ \sum_{j=1}^{h}\sum_{k=1}^{n}(\Theta_{j,k}^{(1)})^{2}+\sum_{j=1}^{o}\sum_{k=1}^{h}(\Theta_{j,k}^{(2)})^{2} \right ]$

where $n$, $h$ and $o$ equal to numbers of nodes in the input layer, hidden layer and output layer respectively.

During the backward propagation process, we push the calculated outputs backwards through the algorithm and accumulate the gradient from each layer. Because there are only one hidden layer, we can calculate the gradient of the output layer as $\delta^{3}=a_{k}^{3}-y_{k}$ and hidden layer as $\delta^{2}=(\Theta^{2})^{T}\delta^{3}{\ast}g^{'}(z^{(2)})$, then accumulate it $\Delta^{(l)}=\Delta^{(l)}+\delta^{(l+1)}(a^{(l)})^{T}$ to finally get the total unregularized gradient $\frac{\partial }{\partial \Theta^{(l)}_{ij}}J(\Theta)=\frac{1}{m}\Delta^{(l)}_{(ij)}$. In the end, we just need to regularize it like we did to the cost function by adding $\frac{\lambda}{m}\Theta^{(l)}_{ij}$.

The absolute value algorithm is very similar. Instead of using a 0/1 (True/False) switch to train examples with same labels all at once, it trains one example at a time and calculate the cost by squared error of each prediction $(y^{(i)}-h_{\Theta}(x^{(i)}))^{2}$.

Now we can wrap up a cost function and a gradient function to test it by running optim() provided by default R package “stat”. In this case I choose L-BFGS-B method because it utilizes everything that we’ve got in hand and seems learns faster than most of the others.

The classifier first. Three blocks of data are randomly generated and labelled for training. Because we intentionally created them for testing purpose, it’s not necessary to divide them for cross-validation and out-of-sample testing contrast to using it in real world.

Then for absolute values. I reset y values for this test and tried several values for learning speed $lamda$ and number of iterations to get results below.