Making a Simple Neural Network

Overview

The idea in this tutorial is to create a very simple (and very inefficient) neural network. There are a number of exercises sprinkled throughout this tutorial. They are there to make it easy to try things out whil you are going through this tutorial. Ultimately, I would like you to create an RMarkdown file that contains your exercises and whatever setup code is necessary to make things work correctly (copy-and-paste from this document to your own is certainly acceptable). Be sure to submit both your RMarkdown and the knitted version (to HTML please)

In this tutorial we are going to produce a simple neural network that can calculate a fairly straightforward formula:

\[ f(x,y,z) = \left[\begin{array}{c} 2x+z^2 \\ xy - z\end{array}\right] \]

Since there are no upper bounds on the outputs we are going to use RELU as our activation functions.

Obviously our input layer will require 4 nodes and our output layer will require 2. We will have one hidden layer with 4 nodes. The layers will be fully connected.

Neural Net Architecture

The Calculation

The calculation occurs layer by layer, starting with the input layer.

If we treat each layer as a column vector then the weights connected two layers can be realized as a matrix. Moving from one layer to a hidden layer consists of matrix multiplication followed by coordinate wise application of the activation function.

Below is the (unexpanded) R code to assign some random values to the weights and perform the calculations that occur in a simple feed-forward neural network.

Create Neural Network (Click to expand R code)

set.seed(1233)
X=matrix(c(1,2,2),nrow=3)
W1=matrix(runif(12,-3,3),ncol=3)
W2=matrix(runif(8,-3,3),ncol=4)
RELU=function(x){ifelse(x>=0,x,0)} #Note, this function is vectorized

Now we’re ready to perform a calculation:

A=RELU
R= W1%*%X
X_1=A(W1%*%X)
(Y=W2%*%X_1)

##           [,1]
## [1,]  9.902427
## [2,] -3.827655

Exercise I: Feed Forward Calculation

Now you’re ready to do one of your own:

Replicate the example from the previous section but make your hidden layer contain 5 nodes instead of 4 and adjust the code accordingly. Please note that you will get much more out of this exercise if you type the commands yourself instead of using copy and paste. You can (and should) use RELU and A from the previous code example without typing it in yourself. Also note, at this stage, you aren’t doing anything other than calculating an output… we’ll work on training later in this tutorial. When done, copy-and-paste your work into the RMarkdown document that you will be submitting. Be sure to clearly indicate

set.seed(1234)

Closures

I don’t want to encourage bad habits when programming, so I think it would be better to create neural networks in a more flexible and self-contained fashion. So we are going to take advantage of the way that R handles “name spaces”. You don’t need to completely understand the code that I’m going to show below… but it wouldn’t hurt. The key idea is that a function in R is really something known as a closure. That means functions come equipped with their own name space. Let’s see a simple example before using it to encapsulate the neural network calculation:

iterator=function(){
  i<-0
  set=function(n=0){i<<-n}
  val=function(){i}
  inc=function(){i<<-i+1}
  dec=function(){i<<-i-1}
  return(list(set=set,val=val,inc=inc,dec=dec))
}

The variable $i$ defined in the function iterator() is part of the name-space of the function. The two internal functions set() and val() are defined inside of that namespace. They are returned as list elements, but when they are executed, they are executed inside the namespace in which they were defined. These namespaces form a tree structure. The namespace of cnt$inc() is inside the namespace associated to cnt (almost). So cnt$inc() can “see” the value of $i$ as it exists inside of cnt. If we tried to assign a value to $i$ inside of cnt$inc() it would be in the namespace of cnt$inc() and NOT change the value of $i$ inside of cnt. However the <<- operator recursively ascends the namespace tree until it finds a match (or reaches the root… ie the global namespace). If it finds a match, it makes the desired change. So i<<-i+1 changes the value of $i$ inside of cnt… making it visible to all the other internal functions… most importantly to cnt$val()

Let’s see how this work in practice:

cnt=iterator()
cnt2=iterator()
cnt$val()

## [1] 0

cnt$inc()
cnt$val()

## [1] 1

cnt2$set(10)
cnt2$val()

## [1] 10

cnt$val() #Note that the output of cnt$val() was NOT changed by cnt2

## [1] 1

Exercise II: Closure

Create a closure that contains internal an internal variable of type list called vars. (A template is provided in the exercise below)

Your closure should return two functions (defined internally):

a single setter called set()
and a single getter called get()

The setter should take two arguments: 1. A character string containing the name of the variable to set 2. The value to assign to that variable

The getter should take one argument: 1. The name of the variable to return

The getter should return NULL if the variable name is not found.

You should create two instances of your closure and verify that setting and getting one of the two instances has no influence on the getting and setting of the second instance (thad code is included below). Again, when you’re comfortable with what you’ve done, create a section in your Assignment RMarkdown and add your work there.

varSet=function(){
  vars=list()

  set=function(){      # Fix this line
                       # Add the function body
  }
  get=function(){      # Fix this line
                       # Add the function body
  }
  return(list(set=set)) #Fix this line
}


A=varSet()
B=varSet()
A$set("a",10)
B$set("a",15)
A$get("a")  # shoudl return 10
B$get("a")  # should return 15
B$get("badvar") #should return NULL

Generic Functions

Generic functions are part of R’s first object oriented system (there are at least three of them as of the writing of these notes). They’re pretty informal and pretty easy to understand.

Hadley Wickham Chief Scientist of RStudio, creator of the tidyverse and ggplot2 (amongst many other things) has made a number of books freely available online. You might find the object oriented chapters in his online book Advanced R to be useful.

The key thing about generic functions is that variables can be given a class attribute. eg:

myObj       = "Meaningless String"
class(myObj)="myClass"

and certain functions, such as summary and print, decide how to treat their input based upon the class of the arguments. Functions like these are known as generic functions. Typically, a generic function, will look at the class of an object and then use that class to make a good guess as to the name of a function that is designed to work with that class. If the generic function finds that name… then it calls the function for that class.

An example is helpful. Consider the R expression print(myObj). If myObj has a class attribute of myClass then print() will look to see if there is a function known as print.myClass(). If such a function exists then print(myObj) will actually call print.myClass(myObj)

print.myClass=function(x){cat("I don't care about",x,"\n")}
print(myObj)

## I don't care about Meaningless String

Note that, by default, an object entered into the console will invoke its print method:

> myObj
I don't care about Meaningless String

However, when invoked in RMarkdown that won’t always be the case.

We can use the function unclass() to reveal the fundamental object

x=myObj
unclass(x)  # Remove X's class

## [1] "Meaningless String"

This is particularly useful since just about every language construction is secretly a generic function. For example, by the use of backticks we can override the meaning of the subset operation:

`[.myClass`=function(x,...){nchar(x)}
print(myObj)

## I don't care about Meaningless String

myObj[10] #The value between [ and ] is meaningless in this example

## [1] 18

Exercise III: Generic Functions

Create a function new_badInt() that creates an object of class badInt. Such a function is analagous to a generator in more traditional object oriented classes.

Your generator should take a numeric value as an integer. There’s no internal variables, however, you should override + (I have included the outline of how to do this in the sample code below) to return the product of the two badInts. You should also create print.badInt to make the output of your terrible integers, look like any other integer. WARNING: you may need to run your code in the console to ensure the objects behave correctly.

new_badInt=function(n){
  # Fill in these details
  return(tmp)
}

`+.badInt`=function() #fix this line
`print.badInt`=function(){} #fix this line

badInt(3)           # should show [1] 3 in the console
badInt(3)+badInt(2) # should show [1] 6 in the console

Improving the Neural Net

I’m going to take advantage of closures to define a function nn() that will generate closures that encapsulate the parameters of a neural network. This way we can store the intermediate results of the calculations and make the steps of the calculations (and training) accessible to our observation.

I could choice to pass all the initialization parameters into the function nn(), but I’m going to forego that option (at least for the moment), in order to make the architecture of the neural networks a bit clearer. So we’ll define nn() to take two parameters:

nn(nin, nout) (nin is number of nodes in input and nout is number of nodes in output)

In order to have complete access to the intermediate stages of the calculations, we will store accumulation-phase and final-values for the nodes in the other layers. We’re also going to have to be a bit clever about how we deal with functions because we need to capture forward calculations and transpose-direction error propagations.

I’d like to make the interface as simple as possible, but since I’d also like to be able to deal with things (at a later date), like convolution and softmax calculations, I’ll need something fairly flexible… so we’re going to create the “extendedFunction” class as a closure. The idea is that an extended function exposes (usually)

A parameterized function
The total derivative of that function
An update function for modifying the parameters during training
An initialization function for randomly setting the parameters during setup

I waffled back and forth on whether or not I wanted to use extended functions for both layer-to-layer functions and for activation functions, but decided the extra abstraction probably wasn’t worth the effort.

new_ef() will create the extended functions. It requires as arguments:

f: Function taking argument, $x$, and collection of params
df: The total derivative of $f$ AT $x$
update: Function used during training to update params (need not exist)
init: set initial values for params (need not exist)

R code for extended function generator (click to expand)

new_ef = function(f,df,update=NULL,init=NULL){
  params = NULL # Needed for training
  df     = df   # Trust the initialization parameters to find df correctly
  init   = init
  update = update

  .update = function(x,delta,eta=0.01){
    if(is.null(update)){return(NULL)}
    params<<-update(x,delta,params,eta)
  }
  
  .init  = function(){
    if(is.null(init)){return(NULL)}
    params<<-init()
  }
  .init() #Initialize params
  .f  = function(x){f(x,params)}  #call the function
  .df = function(x){df(x,params)}    #df is supposed to 
  
  getParams = function(){params}
  setParams = function(new){params<<-new}
  results = list(
    f         = .f,
    df        = .df,
    init      = .init,
    update    = .update,
    params    = getParams,
    setParams = setParams
  )
  class(results)="extendedFunction"
  return(results)
}

So most of the components we’ll need for a straightforward calculation are already in place. Here’s the idea for the closure nn():

Internal Variables:

din: The number of nodes in the input level
$X$: A LIST of $n_i \times 2$ matrices representing all the layers other than the input
$W$: A LIST of “layer functions” W[[i]] takes nodes value from layer $i-1$ and moves them to layer $i$. We’ll use extended functions (as defined previously)
$A$: A LIST of activation functions. These are not extended functions (since we’re not updating them), but they are a list of a function and its derivative. If $f$ acts coordinate-wise then so should $df$. If f acts a bit more generally (say softmax), then df needs to reflect this as well.

We will use the $addLayer() function to add a new layer. For the moment, the only type of layer that we will allow is “fully connected” (coded as “fc”). We won’t create a separate function for adding the output layer… what the key is that we’ll use the identity function (“id”) (or, perhaps “softmax”) as the activation function for the final layer.

Definition of neural network generator (click to expand)

set.seed(1234)
nn=function(din){
  din    = din
  X      = list() #Entries are n x 2 matrices 
                  # (col 1 is accumulation)
                  # (col 2 is value after activation function)
  A      = list() # List of activation functions to apply, coordinate-wise, to each level 
                  # by default, this function is the identity for input and output layers
                  #
  W      = list() # The list of the layer-to-layer functions
  .activationHoard=list( # Holder of all the activation functions and their derivatives:
    id=list(
      f=function(x){x},
      df=function(x){x}
    ),
    RELU=list(
      f=function(x){ifelse(x>=0,x,0)}, 
      df=function(X){
        results=as.numeric(X>=0)
        dim(results)=dim(X)
        return(results)
      }
    )
  )
  addLayer=function(m,A.f="RELU",type="fc"){
    if(type=="fc"){    # Add a new layer that is fully connected to the previous layer
      i = length(X)    # Number of hidden layers
      if(i==0){        # Determine dimension of previous layer 
        n=din
      } else {
        n=nrow(X[[i]])
      }
      layer=matrix(0,nrow=m,ncol=2)
      colnames(layer)=c("a","x")
      X[[i+1]]<<-layer
      W[[i+1]]<<-new_ef(
         f      = function(x,params){params%*%x}, # matrix multiplication
         df     = function(x,params){params}, # total derivative
         update = NULL,
         init   = function(){matrix(runif(m*n,-1,1),ncol=n,nrow=m)}
      )
      A[[i+1]]<<-.activationHoard[[A.f]]
      return()
    }
  }
  .calc=function(X0){
    if(nrow(X0) != din){stop("Wrong input dimension")}
    V=X0
    for(i in 1:length(X)){
      tmp=W[[i]]$f(V)    #W[[i]] is an extended function
      X[[i]][,"a"]<<-tmp
      X[[i]][,"x"]<<-A[[i]]$f(tmp)
      V=X[[i]][,"x"]
    }
    return(matrix(X[[i]][,"x"],nrow=nrow(X[[i]])))
  }
  result=list(calc=.calc,addLayer=addLayer)
  class(result)="nn"
  return(result)
}
N=nn(3)
N$addLayer(4)

## NULL

N$addLayer(2,"id")

## NULL

N$calc(matrix(1:3,nrow=3))

##           [,1]
## [1,] -1.094852
## [2,]  2.013506

Reproducing the Details

At this point our neural net is nothing more than a random calculation. Let’s use a little bit of R-magic to peer inside at the details, and reproduce the calculations by hand (just to make sure things work):

A=RELU
X0=matrix(1:3,nrow=3)
(W1=get("W",environment(N$calc))[[1]]$params())

##            [,1]       [,2]       [,3]
## [1,] -0.7725932  0.7218308 0.33216752
## [2,]  0.2445988  0.2806212 0.02850228
## [3,]  0.2185495 -0.9810085 0.38718258
## [4,]  0.2467589 -0.5348990 0.08994967

(A1=W1%*%X0)

##            [,1]
## [1,]  1.6675709
## [2,]  0.8913481
## [3,] -0.5819198
## [4,] -0.5531901

(X1=A(A1))

##           [,1]
## [1,] 1.6675709
## [2,] 0.8913481
## [3,] 0.0000000
## [4,] 0.0000000

get("X",environment(N$calc))[[1]]

##               a         x
## [1,]  1.6675709 1.6675709
## [2,]  0.8913481 0.8913481
## [3,] -0.5819198 0.0000000
## [4,] -0.5531901 0.0000000

Well… first layer works as planned, now let’s check the output layer:

(W2=get("W",environment(N$calc))[[2]]$params())

##            [,1]       [,2]       [,3]       [,4]
## [1,] -0.4345328 -0.4153683 -0.4275534 -0.6265544
## [2,]  0.8468670  0.6745913 -0.4663584 -0.5355482

(A2=W2%*%X1) #A2 and X2 are equal in the output layer

##           [,1]
## [1,] -1.094852
## [2,]  2.013506

get("X",environment(N$calc))[[2]]

##              a         x
## [1,] -1.094852 -1.094852
## [2,]  2.013506  2.013506

Okay… so that looks good. There’s a good chance we have the forward calculation phases working correctly.

Improving Peformance (Training)

Ultimately, the neural network that we are developing is a function $nn:\mathbb{R}^3 \to \mathbb{R}^2$. Let’s use the function we discussed at the beginning of this tutorial (and ensure we have a version that can take a column vector as an input):

\[ f(x,y,z) = \left[\begin{array}{c} 2x+z^2 \\ xy - z\end{array}\right] \]

We will measure the cost using the sum of the squared differences

f=function(x,y,z){matrix(c(2*x+z^2,x*y - z),nrow=2)}
F=function(X){do.call("f",as.list(X))}
Cost=function(X,NN,Truth){
  target=Truth(X)
  output=NN$calc(X)
  sum((target-output)^2)
}

Our neural network is already defined so let’s look at the output, the target, and the cost:

X0=matrix(1:3,nrow=3)
(target=F(X0))

##      [,1]
## [1,]   11
## [2,]   -1

(output=N$calc(X0))

##           [,1]
## [1,] -1.094852
## [2,]  2.013506

Cost(X0,N,F)

## [1] 155.3667

Well, not surprisingly, that was not very good. So how do we adjust the neural network to peform better?

The key ideas come from multivariable calculus. The total derivative of a vector-valued multivariable function is a matrix. If you took liner algebra, then you’ll know that matrices are linear functions. If $f$ is our function then $\textrm{D}_f$ will denote the total derivative. The value of its entries changes for different possible inputs to $f$. In our example $f:\mathbb{R}^3 \to \mathbb{R}^2$ so we write $D_f(x,y,z)$. It is the linear function that performs as closely as possible to $f$ for values near $(x,y,z)$.

This is the secret of total derivatives– they are linearizations of functions. The less of a change in the inputs the more accurate the approximations.

If $f:\mathbb{R}^n \to \mathbb{R}^m$ then at $X \in \mathbb{R}^n$ the matrix is an $m \times n$ matrix. Which means it can also be interpreted as a function: \[ \textrm{D}_f(X):\mathbb{R}^n \to \mathbb{R}^m \]

If we apply the cost function (or loss function) to our toy network (in order to quantify how close our prediction is to the target) then our neural network becomes a function $\mathbb{R}^3 \to \mathbb{R}$.

If we think of the data $X$ as being fixed and instead consider the weights as adjustable, then we can view the neural-network as a function of the weights:

\[ L(W_1,W_2) = \textrm{Cost}(Y,f(X)) \]

We are, of course, leaving out some details… $Y$ is calculated USING $X$ and $W_1$ and $W_2$. Let’s leave $f(X)$ outside the Cost function (since it doesn’t depend upon $W_1$ or $W_2$)

So it would also be appropriate to write this as:

\[ L(W_1, W_2) = \textrm{Cost}(Y) \]

More precisely:

\[ \begin{aligned} L(W_1,W_2) &= \textrm{Cost}(Y)\\ &=\textrm{Cost}(W_2(A(W_1\cdot X))) \end{aligned} \]

If we want to emphasize the dependence of the output $Y$ and the target $f(X)$ on $X$ we can also write:

\[ L(W_1, W_2) = \textrm{Cost}(X) \]

As long as we were very clear about what we mean by the function $\textrm{Cost}()$ these are all appropriate expressions.

Let’s look at a block diagram where I represent the activation function as $A$ and LM means Left Multiplication (as matrices):


       ___      ______        ______          ____________
      |   |    |      |      |      |    Y   |            |
X---->|LM |--->|apply | ---> | LM   | -----> |calculate   | ----> COST(X)
      |by |    | A    |      | by   |        |   cost     |
      |W1 |    |      |      | W2   |        |            |
      |___|    |______|      |______|        |____________|

In the diagram above I expressed the final cost as a function of $X$. When I want to think of the cost in terms of the weights I’ll use $L(W_1,W_2)$

Improving Performance (more about derivatives)

In any event, the total derivative of any function that has a derivative is a matrix. The cool thing is that function composition becomes matrix multiplication and so, expressed as a composition:

\[ \textrm{COST}(W2(A(W1(X)))) \]

We are guided to create a series of matrices that multiply together to create a row vector, we note that multiplication by a matrix is a linear function, so the best linear approximation to a linear function is, in fact, itself. Thus:

     ________   ____   ______________   _____   
    |        | |    | |              | |     |      
    | D_Cost | | W2 | |componentwise | | W1  | 
    |   at   | |____| |  application | |_____|
    |   Y    |        |  of A' at    |        
    |________|        |     W1(x)    |        
                      |______________|

The componentwise application of $A'$ will depend upon the activation function. We’ve chosen $\textrm{RELU}$ so the matrix will have entries that are 0 or 1 exclusively, depending upon the values of the inputs.

It’s the derivative of the cost function that’s interesting. Let’s let $T=f(X)$ be the column vector output when $f$ is applied to $X$. We are NOT going to take derivatives of $f$…. we want our training process to be agnostic to $f$:

\[ \textrm{COST}(y_1,y_2) = (T_1-y_1)^2 + (T_2-y_2)^2 \]

This is a function $\mathbb{R}^2 \to \mathbb{R}^1$ so it becomes a $1 \times 2$ row vector:

\[ \begin{aligned} \left[\frac{\partial\textrm{COST}}{\partial y_1}\qquad \frac{\partial\textrm{COST}}{\partial y_2}\right] &= \left[-2\left(T_1-y_1\right) \quad -2\left(T_1-y_1\right)\right]\\ &=-2\ \textrm{err} \end{aligned} \]

This tells us, no surprise, that if we want to maximize the COST of $Y$ we need to move AWAY from the target and to minimize we move TOWARDS the target.

Improving Performance (The gradient)

But look at what happens to the matrix equation:

           ____   ______________   _____   
          |    | |              | |     |      
[errors]  | W2 | |componentwise | | W1  | 
          |____| |  application | |_____|
                 |  of A' at    |        
                 |     W1(x)    |        
                 |______________|

If we treat the error as a column vector then we call it the gradient (terminology varies a bit on this actually). And that would be equivalent to transposing what we see above

\[ [W1]^\textrm{t} [A']^\textrm{t} [W_2]^\textrm{t} [\textrm{col vector of errors}] \]

Notice that $\delta_2=[A']^\textrm{t} [W_2]^\textrm{t} [\textrm{col vector of errors}]$ has as many elements as the hidden layer.

It’s not intuitively obvious (at least to me), but the gradient tells us the direction of greatest change.

So the gradient of the $\textrm{COST}$ tells us that we can make the cost larger by moving away from the target. Conversely we can make it smaller by moving TOWARDS the target.

That’s probably not too surprising. What is interesting, is what happens if we use the transpose of $W_2$ to move the errors back from the final layer to the hidden layer. This vector is $\delta_2$. This is telling us how to perturb the values $[W_1][X]$ to change the total cost as rapidly as possible (again, it’s the gradient of the values of the hidden layer before they reach the activation function).

I was hung up for a long time on the idea that this perturbation would need to move the ouput $W_2(A(\#))$ along the gradient defined by the errors. But it turns out this is not true. The error vector determines the line that will most efficiently move $Y$ towards (or away) from the target. BUT the application of $W_2$ has an influence in the change too. Here’s how to view it:

Imagine a fixed value $p$ (usually a vector or a point) and a function $g$ that takes $p$ into some other space for which we already know the gradient. So we know what value the gradient takes for $g(p)$. So we care about $\textrm{COST(g(p))}$ and we already understand the gradient of $\textrm{COST}=-2\ \textrm{ERR}$. IE we know $\nabla g(p)$.

Now imagine a tiny sphere centered on $p$. The direction indicated by the gradient of the composition will tells us, approximately, which of those points will result in the greatest change in the total $\textrm{COST}$.

The vector taking us from $p$ to that point will, usually, not take $g(p)$ in the direction of $\textrm{ERR}$. This is because derivatives linearize– An angle OFF the “best” direction, might still cause a bigger change in the $\textrm{COST}$ if we travel far enough off the best route to make an overall improvement. Surprisingly (at least to me) it is the transpose of $W_2$ that takes the error back to the previous level. We also need to adjust that value using the derivative of the activation function but that’s our final adjustment.

Gradients for the Layers

We still don’t know how to update the weights, but we’re getting close. Let’s define a function to capture the derivative of $\textrm{RELU}$

DR = function(X){
  results=as.numeric(X>=0)
  dim(results)=dim(X)
  return(results)
}

So, reproducing all the previous steps, in the most basic way possible

set.seed(1234)
X=matrix(c(1,2,2),nrow=3)          # Our input
W1=matrix(runif(12,-3,3),ncol=3)   # Random colleciton of weights
W2=matrix(runif(8,-3,3),ncol=4)    # Next rendom collection
A=RELU                             # Use Relu for activation function
A1= W1%*%X                         # Accumulation phase of Layer 1
X_1=A(A1)                          # Node values of Layer 1
Y=W2%*%X_1                         # Output values
(sum((Y-F(X))^2))                  # Current Cost (loss function)

## [1] 446.4246

f=function(x,y,z){matrix(c(2*x+z^2,x*y - z),nrow=2)}
F=function(X){do.call("f",as.list(X))}
(ERR=-2*(Y-F(X)))  # I'm doing the NEGATIVE of the gradient so I decrease the cost

##           [,1]
## [1,]  28.89616
## [2,] -30.83359

(delta.2  = ERR)

##           [,1]
## [1,]  28.89616
## [2,] -30.83359

(delta.1 = DR(X_1)*(t(W2)%*%delta.2))

##             [,1]
## [1,] -116.004836
## [2,]  -98.407857
## [3,]    6.074561
## [4,]   -4.776426

These are the gradients (final and intermediate) for how to adjust node-values (at the accumulation phase) at the various layers to improve the cost.

Gradient for the Edges

Now it turns out that update rule for the edges is related to the $\delta$’s from the last section. The column vector $\delta_i$ has $n_i$ nodes. The column vector $X_{i-1}$ has $n_{i-1}$ nodes so the matrix equation

\[ [\delta_i] [X_{i-1}]^\textrm{t} \] will match the dimensions of $W_i$

X_0=X
Upd.2 = delta.2 %*% t(X_1)
Upd.1 = delta.1 %*% t(X_0)

Now we have to be a bit careful about how much we change the weights. So we’re going to normalize and make the steps fairly small:

Upd.2 = Upd.2/sqrt(sum(Upd.2^2))/100
Upd.1 = Upd.1/sqrt(sum(Upd.1^2))/100

Since I took my errors with the wrong sign, we’re already good to go. Let’s check our results, update and check again

sum((Y-F(X_0))^2) # Loss BEFORE update

## [1] 446.4246

W1 = W1 + Upd.1
W2 = W2 + Upd.2
Y=W2%*%A(W1%*%X_0)   
sum((Y-F(X_0))^2) # Loss AFTER update

## [1] 439.8886

Nice to see it worked.

Exercise IV: Updating the Weights

Now you’re ready. Take YOUR neural network from Exercise I: Feed Forward (recreate all the steps). Calculate the target value and the output. Calculate the cost. Then apply the update procedure as shown in the previous section and finally see if your neural network showed improvement (hint… it should). Again, make certain to create a new section in the RMarkdown you’re creating for submission and copy your work from this exercise to the document.

set.seed(1234)

Training and the nn Class

Training requires a target so we can assess how good (or how bad) our calculation was. The results of that calculation will tells us how to update the parameters in the layer-to-layer calculations.

set.seed(1234)
nn=function(din){
  din    = din
  X      = list() #Entries are n x 2 matrices 
                  # (col 1 is accumulation)
                  # (col 2 is value after activation function)
  A      = list() # List of activation functions to apply, coordinate-wise, to each level 
                  # by default, this function is the identity for input and output layers
                  #
  W      = list() # The list of the layer-to-layer functions
  delta  = list() # The "error" at each accumulation phase
  logistic=function(x){(1+exp(-x))^(-1)}
  .activationHoard=list( # Holder of all the activation functions and their derivatives:
    logistic=list(
      f=logistic,
      df=function(x){logistic(x)*(1-logistic(x))}
    ),
    id=list(
      f=function(x){x},
      df=function(x){x}
    ),
    RELU=list(
      f=function(x){ifelse(x>=0,x,0)}, 
      df=function(X){
        results=as.numeric(X>=0)
        dim(results)=dim(X)
        return(results)
      }
    ),
    tanh=list(
      f=function(x){tanh(x)},
      df=function(x){1-tanh(x)^2}
    )
  )
  addLayer=function(m,A.f="RELU",type="fc"){
    if(type=="fc"){    # Add a new layer that is fully connected to the previous layer
      i = length(X)    # Number of hidden layers
      if(i==0){        # Determine dimension of previous layer 
        n=din
      } else {
        n=nrow(X[[i]])
      }
      layer=matrix(0,nrow=m,ncol=2)
      colnames(layer)=c("a","x")
      X[[i+1]]<<-layer
      W[[i+1]]<<-new_ef(
         f      = function(x,params){params%*%x}, # matrix multiplication
         df     = function(x,params){params}, # total derivative
         update = function(x,delta,params,eta){ # params comes from extended function closure
           upd=delta %*% t(x)
           upd=upd*eta #eta is the learning rate
           #upd=upd/sqrt(sum(upd^2))*eta
            #if(any(is.nan(upd))){
           if(any(is.infinite(upd))){
             # save(delta,x,upd,eta,file="problematic.rdata")
            #stop("Encountered UPD")
              return(params)
            } #Dont' update if problems
           return(params+upd)    
        }, #add some dimension checking
         init   = function(){matrix(runif(m*n,-1,1),ncol=n,nrow=m)}
      )
      A[[i+1]]<<-.activationHoard[[A.f]]
      delta[[i+1]]<<-matrix(0,nrow=m)
      return()
    }
  }
  .calc=function(X0){
    if(nrow(X0) != din){stop("Wrong input dimension")}
    V=X0
    for(i in 1:length(X)){
      tmp=W[[i]]$f(V)    #W[[i]] is an extended function
      X[[i]][,"a"]<<-tmp
      X[[i]][,"x"]<<-A[[i]]$f(tmp)
      V=X[[i]][,"x"]
    }
    return(matrix(X[[i]][,"x"],nrow=nrow(X[[i]])))
  }
  .update=function(X0,T,eta=0.01){ #Need the raw input, and the target
    n=length(X)
    Y=.calc(X0)
    i=n
    while(i>0){
      if(i==n){
        delta[[i]]<<- (-2)*(Y-T)
      }else {
        df=A[[i]]$df #componentwise function
        V=X[[i]][,"x"]
        M=t(W[[i+1]]$df())
        delta[[i]] <<-  df(V)*(M%*%delta[[i+1]])
      }
      if(i==1){
        V.prev=X0
      } else {
        V.prev=X[[i-1]][,"x"]
      }
      W[[i]]$update(V.prev,delta[[i]],eta)
      i=i-1
    }
  }
  result=list(calc=.calc,addLayer=addLayer,update=.update)
  class(result)="nn"
  return(result)
}
N=nn(3)
N$addLayer(4)

## NULL

N$addLayer(2,"id")

## NULL

X0=matrix(1:3,nrow=3)
Y=N$calc(X0);T=F(X0);sum((Y-T)^2)

## [1] 155.3667

N$update(X0,T)
Y=N$calc(X0);T=F(X0);sum((Y-T)^2)

## [1] 124.5157

Well… that was an improvement. Let’s run through 100 cycles and see how we do:

How does our machine work on different values? Let’s randomly select some vectors and try

We see improvement, but reach something of an impasse

Exercise V: Training Against a Single Value

Select some value for $X0$.
Leave it fixed for 1000 iterations.
Repeat the process (same value of $X0$) until some pattern arises in plot.
What are you seeing and why?
Change the .update() function (in the generator nn() to be

update = function(x,delta,params,eta){ 
           upd=delta %*% t(x)
           upd=upd/sqrt(sum(upd^2))*eta
           if(any(is.nan(upd))){
             return(params)
           } #Dont' update if problems
           return(params+upd)    
        }

Repeat steps 1-4
When a pattern becomes obvious try decreasing the value of $\eta$. What happens and why?
What’s with that is.nan() in the .update() function? When does that happen? Why?
Prepare a short writeup on your findings (with pictures). (Copy and paste from this document to your writeup is okay). Eventually you will be asked to submit your RMarkdown and knitted version to Canvas after adding some more details from the next few exercises.

set.seed(1234)
N2=nn(3)
N2$addLayer(5)
N2$addLayer(2,A.f = "id")
X0=matrix(runif(3,-1,1),nrow=3)
T=F(X0)
results=replicate(1000,{
  N2$update(X0,T,eta=0.01);Y=N2$calc(X0);sum((Y-T)^2)}
)
plot(results)

Bias

I’ll presume you figured out why there is an is.nan() test in the modified .update() function. Not updating the weights in that situation is certainly a possible solution. Not normalizing the update direction is a solution too… but that will generate “run-away values” (which is why there is a check for infinity)

But there is a better solution to handle the situation. The idea is to give every layer (except the output layer) a “virtual node” whose value is always 1. Those virtual nodes are connected by edges to the next layer and have weights (just like a normal node). Since the node value never changes, we can view the layer-to-layer calculation as being of the form

\[ WX + b \]

where $b$ is a column vector of biases (the weights connecting the virtual node from the previous layer to the nodes in the current layer).

We could certainly add biases into our nn() generator function, but the code is already a bit more unwieldy than I had planned for the tutorial, so we’ll leave it at that.

Exercise VI: Multiple Architectures

Now let’s try using **multiple values for $X0$ (notice that $X0$ is randomly generated every iteration).

See what the code below generates for 1000 iterations
Modify the limits on runif. What happens to the cost when that occurs?
Now for the fun part: modify the architecture a bit (note, make sure your final layer is N2$addLayer(2,A.f = "id"))
- Try more hidden layers
- Try layers with different number of nodes
- Try layers with different activation functions
Pick a few (let’s say 3) of your favorite experimentations. Add a section to your writeup where you produce (and discuss) your architectures and displaying their results. You are encouraged to perform more than one round of updates and modify $\eta$ when/if it helps. Do you ever see indications of the error rates rising? What’s happening? Is there any way to prevent it or reverse the trend?

set.seed(1234)
N2=nn(3)
N2$addLayer(5)
N2$addLayer(5)
N2$addLayer(2,A.f = "id")
results=replicate(1000,{
  X0=matrix(runif(3,-1,1),nrow=3)
  T=F(X0)
  N2$update(X0,T,eta=0.01);Y=N2$calc(X0);sum((Y-T)^2)}
)
plot(results)

Exercise: Putting it all together

Your final requirement on this assignment is, in the language of your choice, create a simple neural network engine. Your engine should be able to

Create feed forward neural networks of different architectures (ie, differing number of hidden layers, differing number of nodes in each layer. You aren’t required to allow different activation functions on the hidden layers, but it would be nice)
Perform forward-calculation
Correctly update weights in response to training data

You can use my engine (from this document) as a sample to guide your work (or do it your own way). You are not required to implement bias or any processing on the output nodes.

You must submit

The complete source code of your engine
Directions for how to compile/utilize
Some examples demonstrating the effectiveness of your engine

Some Final Words

Our approach of estimating a function is unusual, but effectively provides for an unlimited number of training cases. Real-world data is finite and limited by what’s actually available.

Training in Batches

We dealt with our cases one-by-one. Updating the weights each time. We had definite issues arising from not including bias nodes. Perhaps our final accuracy was also suffering from the infinite number of cases (a case of chasing too many rabbits and catching none). Perhaps it would be better to pick 100 examples and train on JUST those until the neural network becomes better.

One approach is to produce Mini batches from a training set and calculate a global loss across all the examples in the mini-batch and then do an update to the weights. THe idea being that perhaps the average error will help us update more effectively.

Taking that idea to a bigger extreme is the use of batches (larger than mini-batches).

Regardless of the approach used, the neural network is usually iterated (eventually) over all the examples in the training set more than once. When each example has been processed that is known as an epoch.

The Gradient Descent Algorithm

We also suffered from having a fairly primitive gradient descent algorithm. All we had to work with was $\eta$ (known in some contexts as the learning rate). We changed it by hand, and did write an update algorithm to adjust it automatically.

Notice that when we chased the same example repeatedly we dramatically improved the loss (for that specific input)

Some approaches provide a type of momentum to the update… so that the update from the previous case is averaged (usually some form of weighted average) with the update suggested by the current case.

Limitations of Approximation

Ultimately, neural networks are not designed to work well on unconstrained input spaces, particularly when the output is unconstrained and non-linear.

You saw how much better the network performed when the input nodes were constrained to to be between -1 and 1. A common first step before training a neural network is to normalize (also called standardize) the training data. We’ve briefly discussed (in class) two common techniques

min-max normalization
Statistical normalization (Based off calculating Z-scores)

Packages in R

Premade Packages, such as nnet or neuralnet are much faster and produce far more accurate approximations.