# Neural Networks

### From Ufldl

Line 29: | Line 29: | ||

Here are plots of the sigmoid and <math>\tanh</math> functions: | Here are plots of the sigmoid and <math>\tanh</math> functions: | ||

- | + | ||

- | + | ||

- | + | ||

- | + | [[Image:Sigmoid_Function.png|400px|center|Sigmoid activation function.]] | |

- | + | [[Image:Tanh_Function.png|400px|center|Tanh activation function.]] | |

- | + | ||

- | + | The <math>\tanh(z)</math> function is a rescaled version of the sigmoid, and its output range is | |

- | + | <math>[-1,1]</math> instead of <math>[0,1]</math>. | |

- | + | ||

- | + | Note that unlike CS221 and (parts of) CS229, we are not using the convention | |

+ | here of <math>x_0=1</math>. Instead, the intercept term is handled separately by the parameter <math>b</math>. | ||

+ | |||

+ | Finally, one identity that'll be useful later: If <math>f(z) = 1/(1+\exp(-z))</math> is the sigmoid | ||

+ | function, then its derivative is given by <math>f'(z) = f(z) (1-f(z))</math>. | ||

+ | (If <math>f</math> is the tanh function, then its derivative is given by | ||

+ | <math>f'(z) = 1- (f(z))^2</math>.) You can derive this yourself using the definition of | ||

+ | the sigmoid (or tanh) function. | ||

+ | |||

+ | |||

+ | |||

+ | == Neural Network formulation == | ||

+ | |||

+ | |||

+ | A neural network is put together by hooking together many of our simple | ||

+ | ``neurons,'' so that the output of a neuron can be the input of another. For | ||

+ | example, here is a small neural network: | ||

+ | |||

+ | [[Image:Network331.png|400px|center]] | ||

+ | |||

+ | In this figure, we have used circles to also denote the inputs to the network. The circles | ||

+ | labeled ``+1'' are called {\bf bias units}, and correspond to the intercept term. | ||

+ | The leftmost layer of the network is called the {\bf input layer}, and the | ||

+ | rightmost layer the {\bf output layer} (which, in this example, has only one | ||

+ | node). The middle layer of nodes is called the {\bf hidden layer}, because its | ||

+ | values are not observed in the training set. We also say that our example | ||

+ | neural network has 3 {\bf input units} (not counting the bias unit), 3 {\bf | ||

+ | hidden units}, and 1 {\bf output unit}. | ||

+ | |||

+ | We will let <math>n_l</math> | ||

+ | denote the number of layers in our network; thus <math>n_l=3</math> in our example. We label layer <math>l</math> as | ||

+ | <math>L_l</math>, so layer <math>L_1</math> is the input layer, and layer <math>L_{n_l}</math> the output layer. | ||

+ | Our neural network has parameters <math>(W,b) = (W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)})</math>, where | ||

+ | we write | ||

+ | <math>W^{(l)}_{ij}</math> to denote the parameter (or weight) associated with the connection | ||

+ | between unit <math>j</math> in layer <math>l</math>, and unit <math>i</math> in layer <math>l+1</math>. (Note the order of the indices.) | ||

+ | Also, <math>b^{(l)}_i</math> is the bias associated with unit <math>i</math> in layer <math>l+1</math>. | ||

+ | Thus, in our example, we have <math>W^{(1)} \in \Re^{3\times 3}</math>, and <math>W^{(2)} \in \Re^{1\times 3}</math>. | ||

+ | Note that bias units don't have inputs or connections going into them, since they always output | ||

+ | the value +1. We also let <math>s_l</math> denote the number of nodes in layer <math>l</math> (not counting the bias unit). | ||

+ | |||

+ | We will write <math>a^{(l)}_i</math> to denote the {\bf activation} (meaning output value) of | ||

+ | unit <math>i</math> in layer <math>l</math>. For <math>l=1</math>, we also use <math>a^{(1)}_i = x_i</math> to denote the <math>i</math>-th input. | ||

+ | Given a fixed setting of | ||

+ | the parameters <math>W,b</math>, our neural | ||

+ | network defines a hypothesis <math>h_{W,b}(x)</math> that outputs a real number. Specifically, the | ||

+ | computation that this neural network represents is given by: | ||

+ | :<math> | ||

+ | \begin{align} | ||

+ | a_1^{(2)} &= f(W_{11}^{(1)}x_1 + W_{12}^{(1)} x_2 + W_{13}^{(1)} x_3 + b_1^{(1)}) \\ | ||

+ | a_2^{(2)} &= f(W_{21}^{(1)}x_1 + W_{22}^{(1)} x_2 + W_{23}^{(1)} x_3 + b_2^{(1)}) \\ | ||

+ | a_3^{(2)} &= f(W_{31}^{(1)}x_1 + W_{32}^{(1)} x_2 + W_{33}^{(1)} x_3 + b_3^{(1)}) \\ | ||

+ | h_{W,b}(x) &= a_1^{(3)} = f(W_{11}^{(2)}a_1^{(2)} + W_{12}^{(2)} a_2^{(2)} + W_{13}^{(2)} a_3^{(2)} + b_1^{(2)}) | ||

+ | \end{align} | ||

+ | </math> | ||

+ | |||

+ | In the sequel, we also let <math>z^{(l)}_i</math> denote the total weighted sum of inputs to unit <math>i</math> in layer <math>l</math>, | ||

+ | including the bias term (e.g., <math>z_i^{(2)} = \sum_{j=1}^n W^{(1)}_{ij} x_j + b^{(1)}_i</math>), so that | ||

+ | <math>a^{(l)}_i = f(z^{(l)}_i)</math>. |

## Revision as of 06:04, 26 February 2011

Consider a supervised learning problem where we have access to labeled training
examples (*x*^{(i)},*y*^{(i)}). Neural networks give a way of defining a complex,
non-linear form of hypotheses *h*_{W,b}(*x*), with parameters *W*,*b* that we can
fit to our data.

To describe neural networks, we will begin by describing the simplest possible neural network, one which comprises a single "neuron." We will use the following diagram to denote a single neuron:

This "neuron" is a computational unit that takes as input *x*_{1},*x*_{2},*x*_{3} (and a +1 intercept term), and
outputs , where is
called the **activation function**. In these notes, we will choose
to be the sigmoid function:

Thus, our single neuron corresponds exactly to the input-output mapping defined by logistic regression.

Although these notes will use the sigmoid function, it is worth noting that
another common choice for *f* is the hyperbolic tangent, or tanh, function:

Here are plots of the sigmoid and tanh functions:

The tanh(*z*) function is a rescaled version of the sigmoid, and its output range is
[ − 1,1] instead of [0,1].

Note that unlike CS221 and (parts of) CS229, we are not using the convention
here of *x*_{0} = 1. Instead, the intercept term is handled separately by the parameter *b*.

Finally, one identity that'll be useful later: If *f*(*z*) = 1 / (1 + exp( − *z*)) is the sigmoid
function, then its derivative is given by *f*'(*z*) = *f*(*z*)(1 − *f*(*z*)).
(If *f* is the tanh function, then its derivative is given by
*f*'(*z*) = 1 − (*f*(*z*))^{2}.) You can derive this yourself using the definition of
the sigmoid (or tanh) function.

## Neural Network formulation

A neural network is put together by hooking together many of our simple
``neurons,* so that the output of a neuron can be the input of another. For*
example, here is a small neural network:

In this figure, we have used circles to also denote the inputs to the network. The circles
labeled ``+1* are called {\bf bias units}, and correspond to the intercept term.*
The leftmost layer of the network is called the {\bf input layer}, and the
rightmost layer the {\bf output layer} (which, in this example, has only one
node). The middle layer of nodes is called the {\bf hidden layer}, because its
values are not observed in the training set. We also say that our example
neural network has 3 {\bf input units} (not counting the bias unit), 3 {\bf
hidden units}, and 1 {\bf output unit}.

We will let *n*_{l}
denote the number of layers in our network; thus *n*_{l} = 3 in our example. We label layer *l* as
*L*_{l}, so layer *L*_{1} is the input layer, and layer the output layer.
Our neural network has parameters (*W*,*b*) = (*W*^{(1)},*b*^{(1)},*W*^{(2)},*b*^{(2)}), where
we write
to denote the parameter (or weight) associated with the connection
between unit *j* in layer *l*, and unit *i* in layer *l* + 1. (Note the order of the indices.)
Also, is the bias associated with unit *i* in layer *l* + 1.
Thus, in our example, we have , and .
Note that bias units don't have inputs or connections going into them, since they always output
the value +1. We also let *s*_{l} denote the number of nodes in layer *l* (not counting the bias unit).

We will write to denote the {\bf activation} (meaning output value) of
unit *i* in layer *l*. For *l* = 1, we also use to denote the *i*-th input.
Given a fixed setting of
the parameters *W*,*b*, our neural
network defines a hypothesis *h*_{W,b}(*x*) that outputs a real number. Specifically, the
computation that this neural network represents is given by:

In the sequel, we also let denote the total weighted sum of inputs to unit *i* in layer *l*,
including the bias term (e.g., ), so that
.