Neural Networks

Revision as of 05:29, 26 February 2011 (view source)

(Created page with "Consider a supervised learning problem where we have access to labeled training examples <math>(x^{(i)}, y^{(i)})</math>. Neural networks give a way of defining a complex, non-l...")

← Older edit

Latest revision as of 19:38, 6 April 2013 (view source)

Wikiroot (Talk | contribs)

Line 8:

diagram to denote a single neuron:

-

~~INSERTGRAPHICSHERE~~

+

[[Image:SingleNeuron.png|300px|center]]

-

This `neuron' is a computational unit that takes as input <math>x_1, x_2, x_3</math> (and a +1 intercept term), and

+

This "neuron" is a computational unit that takes as input <math>x_1, x_2, x_3</math> (and a +1 intercept term), and

-

outputs <math>h_{W,b}(x) = f(W^Tx) = f(\sum_{i=1}^3 W_{i}x_i +b)</math>, where <math>f : \Re \mapsto \Re</math> is

+

outputs <math>\textstyle h_{W,b}(x) = f(W^Tx) = f(\sum_{i=1}^3 W_{i}x_i +b)</math>, where <math>f : \Re \mapsto \Re</math> is

called the '''activation function'''. In these notes, we will choose

<math>f(\cdot)</math> to be the sigmoid function:

Line 28:

</math>

Here are plots of the sigmoid and <math>\tanh</math> functions:

+

+

[[Image:Sigmoid_Function.png|400px|top|Sigmoid activation function.]]

+

[[Image:Tanh_Function.png|400px|top|Tanh activation function.]]

+

</div>

+

The <math>\tanh(z)</math> function is a rescaled version of the sigmoid, and its output range is

+

<math>[-1,1]</math> instead of <math>[0,1]</math>.

+

Note that unlike some other venues (including the OpenClassroom videos, and parts of CS229), we are not using the convention

+

here of <math>x_0=1</math>. Instead, the intercept term is handled separately by the parameter <math>b</math>.

+

Finally, one identity that'll be useful later: If <math>f(z) = 1/(1+\exp(-z))</math> is the sigmoid

+

function, then its derivative is given by <math>f'(z) = f(z) (1-f(z))</math>.

+

(If <math>f</math> is the tanh function, then its derivative is given by

+

<math>f'(z) = 1- (f(z))^2</math>.) You can derive this yourself using the definition of

+

the sigmoid (or tanh) function.

+

== Neural Network model ==

+

A neural network is put together by hooking together many of our simple

+

"neurons," so that the output of a neuron can be the input of another. For

+

example, here is a small neural network:

+

[[Image:Network331.png|400px|center]]

+

In this figure, we have used circles to also denote the inputs to the network. The circles

+

labeled "+1" are called '''bias units''', and correspond to the intercept term.

+

The leftmost layer of the network is called the '''input layer''', and the

+

rightmost layer the '''output layer''' (which, in this example, has only one

+

node). The middle layer of nodes is called the '''hidden layer''', because its

+

values are not observed in the training set. We also say that our example

+

neural network has 3 '''input units''' (not counting the bias unit), 3

+

'''hidden units''', and 1 '''output unit'''.

+

We will let <math>n_l</math>

+

denote the number of layers in our network; thus <math>n_l=3</math> in our example. We label layer <math>l</math> as

+

<math>L_l</math>, so layer <math>L_1</math> is the input layer, and layer <math>L_{n_l}</math> the output layer.

+

Our neural network has parameters <math>(W,b) = (W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)})</math>, where

+

we write

+

<math>W^{(l)}_{ij}</math> to denote the parameter (or weight) associated with the connection

+

between unit <math>j</math> in layer <math>l</math>, and unit <math>i</math> in layer <math>l+1</math>. (Note the order of the indices.)

+

Also, <math>b^{(l)}_i</math> is the bias associated with unit <math>i</math> in layer <math>l+1</math>.

+

Thus, in our example, we have <math>W^{(1)} \in \Re^{3\times 3}</math>, and <math>W^{(2)} \in \Re^{1\times 3}</math>.

+

Note that bias units don't have inputs or connections going into them, since they always output

+

the value +1. We also let <math>s_l</math> denote the number of nodes in layer <math>l</math> (not counting the bias unit).

+

We will write <math>a^{(l)}_i</math> to denote the '''activation''' (meaning output value) of

+

unit <math>i</math> in layer <math>l</math>. For <math>l=1</math>, we also use <math>a^{(1)}_i = x_i</math> to denote the <math>i</math>-th input.

+

Given a fixed setting of

+

the parameters <math>W,b</math>, our neural

+

network defines a hypothesis <math>h_{W,b}(x)</math> that outputs a real number. Specifically, the

+

computation that this neural network represents is given by:

+

:<math>

+

\begin{align}

+

a_1^{(2)} &= f(W_{11}^{(1)}x_1 + W_{12}^{(1)} x_2 + W_{13}^{(1)} x_3 + b_1^{(1)}) \\

+

a_2^{(2)} &= f(W_{21}^{(1)}x_1 + W_{22}^{(1)} x_2 + W_{23}^{(1)} x_3 + b_2^{(1)}) \\

+

a_3^{(2)} &= f(W_{31}^{(1)}x_1 + W_{32}^{(1)} x_2 + W_{33}^{(1)} x_3 + b_3^{(1)}) \\

+

h_{W,b}(x) &= a_1^{(3)} = f(W_{11}^{(2)}a_1^{(2)} + W_{12}^{(2)} a_2^{(2)} + W_{13}^{(2)} a_3^{(2)} + b_1^{(2)})

+

\end{align}

+

</math>

+

In the sequel, we also let <math>z^{(l)}_i</math> denote the total weighted sum of inputs to unit <math>i</math> in layer <math>l</math>,

+

including the bias term (e.g., <math>\textstyle z_i^{(2)} = \sum_{j=1}^n W^{(1)}_{ij} x_j + b^{(1)}_i</math>), so that

+

<math>a^{(l)}_i = f(z^{(l)}_i)</math>.

+

Note that this easily lends itself to a more compact notation. Specifically, if we extend the

+

activation function <math>f(\cdot)</math>

+

to apply to vectors in an element-wise fashion (i.e.,

+

<math>f([z_1, z_2, z_3]) = [f(z_1), f(z_2), f(z_3)]</math>), then we can write

+

the equations above more

+

compactly as:

+

:<math>\begin{align}

+

z^{(2)} &= W^{(1)} x + b^{(1)} \\

+

a^{(2)} &= f(z^{(2)}) \\

+

z^{(3)} &= W^{(2)} a^{(2)} + b^{(2)} \\

+

h_{W,b}(x) &= a^{(3)} = f(z^{(3)})

+

\end{align}</math>

+

We call this step '''forward propagation.''' More generally, recalling that we also use <math>a^{(1)} = x</math> to also denote the values from the input layer,

+

then given layer <math>l</math>'s activations <math>a^{(l)}</math>, we can compute layer <math>l+1</math>'s activations <math>a^{(l+1)}</math> as:

+

:<math>\begin{align}

+

z^{(l+1)} &= W^{(l)} a^{(l)} + b^{(l)} \\

+

a^{(l+1)} &= f(z^{(l+1)})

+

\end{align}</math>

+

By organizing our parameters in matrices and using matrix-vector operations, we can take

+

advantage of fast linear algebra routines to quickly perform calculations in our network.

+

We have so far focused on one example neural network, but one can also build neural

+

networks with other '''architectures''' (meaning patterns of connectivity between neurons), including ones with multiple hidden layers.

+

The most common choice is a <math>\textstyle n_l</math>-layered network

+

where layer <math>\textstyle 1</math> is the input layer, layer <math>\textstyle n_l</math> is the output layer, and each

+

layer <math>\textstyle l</math> is densely connected to layer <math>\textstyle l+1</math>. In this setting, to compute the

+

output of the network, we can successively compute all the activations in layer

+

<math>\textstyle L_2</math>, then layer <math>\textstyle L_3</math>, and so on, up to layer <math>\textstyle L_{n_l}</math>, using the equations above that describe the forward propagation step. This is one

+

example of a '''feedforward''' neural network, since the connectivity graph

+

does not have any directed loops or cycles.

+

Neural networks can also have multiple output units. For example, here is a network

+

with two hidden layers layers <math>L_2</math> and <math>L_3</math> and two output units in layer <math>L_4</math>:

+

[[Image:Network3322.png|500px|center]]

+

To train this network, we would need training examples <math>(x^{(i)}, y^{(i)})</math>

+

where <math>y^{(i)} \in \Re^2</math>. This sort of network is useful if there're multiple

+

outputs that you're interested in predicting. (For example, in a medical

+

diagnosis application, the vector <math>x</math> might give the input features of a

+

patient, and the different outputs <math>y_i</math>'s might indicate presence or absence

+

of different diseases.)

+

Neural Networks

From Ufldl

Latest revision as of 19:38, 6 April 2013

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 8: / Line 8: @@
 diagram to denote a single neuron:
-INSERTGRAPHICSHERE
+[[Image:SingleNeuron.png|300px|center]]
-This `neuron' is a computational unit that takes as input <math>x_1, x_2, x_3</math> (and a +1 intercept term), and
+This "neuron" is a computational unit that takes as input <math>x_1, x_2, x_3</math> (and a +1 intercept term), and
-outputs <math>h_{W,b}(x) = f(W^Tx) = f(\sum_{i=1}^3 W_{i}x_i +b)</math>, where <math>f : \Re \mapsto \Re</math> is
+outputs <math>\textstyle h_{W,b}(x) = f(W^Tx) = f(\sum_{i=1}^3 W_{i}x_i +b)</math>, where <math>f : \Re \mapsto \Re</math> is
 called the '''activation function'''.  In these notes, we will choose
 <math>f(\cdot)</math> to be the sigmoid function:
@@ Line 28: / Line 28: @@
 </math>
 Here are plots of the sigmoid and <math>\tanh</math> functions:
+<div align=center>
+[[Image:Sigmoid_Function.png|400px|top|Sigmoid activation function.]]
+[[Image:Tanh_Function.png|400px|top|Tanh activation function.]]
+</div>
+The <math>\tanh(z)</math> function is a rescaled version of the sigmoid, and its output range is
+<math>[-1,1]</math> instead of <math>[0,1]</math>.
+Note that unlike some other venues (including the OpenClassroom videos, and parts of CS229),  we are not using the convention
+here of <math>x_0=1</math>.  Instead, the intercept term is handled separately by the parameter <math>b</math>.
+Finally, one identity that'll be useful later: If <math>f(z) = 1/(1+\exp(-z))</math> is the sigmoid
+function, then its derivative is given by <math>f'(z) = f(z) (1-f(z))</math>.
+(If <math>f</math> is the tanh function, then its derivative is given by
+<math>f'(z) = 1- (f(z))^2</math>.)  You can derive this yourself using the definition of
+the sigmoid (or tanh) function.
+== Neural Network model ==
+A neural network is put together by hooking together many of our simple
+"neurons," so that the output of a neuron can be the input of another.  For
+example, here is a small neural network:
+[[Image:Network331.png|400px|center]]
+In this figure, we have used circles to also denote the inputs to the network.  The circles
+labeled "+1" are called '''bias units''', and correspond to the intercept term.
+The leftmost layer of the network is called the '''input layer''', and the
+rightmost layer the '''output layer''' (which, in this example, has only one
+node).  The middle layer of nodes is called the '''hidden layer''', because its
+values are not observed in the training set.  We also say that our example
+neural network has 3 '''input units''' (not counting the bias unit), 3
+'''hidden units''', and 1 '''output unit'''.
+We will let <math>n_l</math>
+denote the number of layers in our network; thus <math>n_l=3</math> in our example.  We label layer <math>l</math> as
+<math>L_l</math>, so layer <math>L_1</math> is the input layer, and layer <math>L_{n_l}</math> the output layer.
+Our neural network has parameters <math>(W,b) = (W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)})</math>, where
+we write
+<math>W^{(l)}_{ij}</math> to denote the parameter (or weight) associated with the connection
+between unit <math>j</math> in layer <math>l</math>, and unit <math>i</math> in layer <math>l+1</math>.  (Note the order of the indices.)
+Also, <math>b^{(l)}_i</math> is the bias associated with unit <math>i</math> in layer <math>l+1</math>.
+Thus, in our example, we have <math>W^{(1)} \in \Re^{3\times 3}</math>, and <math>W^{(2)} \in \Re^{1\times 3}</math>.
+Note that bias units don't have inputs or connections going into them, since they always output
+the value +1.  We also let <math>s_l</math> denote the number of nodes in layer <math>l</math> (not counting the bias unit).
+We will write <math>a^{(l)}_i</math> to denote the '''activation''' (meaning output value) of
+unit <math>i</math> in layer <math>l</math>.  For <math>l=1</math>, we also use <math>a^{(1)}_i = x_i</math> to denote the <math>i</math>-th input.
+Given a fixed setting of
+the parameters <math>W,b</math>, our neural
+network defines a hypothesis <math>h_{W,b}(x)</math> that outputs a real number.  Specifically, the
+computation that this neural network represents is given by:
+:<math>
+\begin{align}
+a_1^{(2)} &= f(W_{11}^{(1)}x_1 + W_{12}^{(1)} x_2 + W_{13}^{(1)} x_3 + b_1^{(1)})  \\
+a_2^{(2)} &= f(W_{21}^{(1)}x_1 + W_{22}^{(1)} x_2 + W_{23}^{(1)} x_3 + b_2^{(1)})  \\
+a_3^{(2)} &= f(W_{31}^{(1)}x_1 + W_{32}^{(1)} x_2 + W_{33}^{(1)} x_3 + b_3^{(1)})  \\
+h_{W,b}(x) &= a_1^{(3)} =  f(W_{11}^{(2)}a_1^{(2)} + W_{12}^{(2)} a_2^{(2)} + W_{13}^{(2)} a_3^{(2)} + b_1^{(2)})
+\end{align}
+</math>
+In the sequel, we also let <math>z^{(l)}_i</math> denote the total weighted sum of inputs to unit <math>i</math> in layer <math>l</math>,
+including the bias term (e.g., <math>\textstyle z_i^{(2)} = \sum_{j=1}^n W^{(1)}_{ij} x_j + b^{(1)}_i</math>), so that
+<math>a^{(l)}_i = f(z^{(l)}_i)</math>.
+Note that this easily lends itself to a more compact notation.  Specifically, if we extend the
+activation function <math>f(\cdot)</math>
+to apply to vectors in an element-wise fashion (i.e.,
+<math>f([z_1, z_2, z_3]) = [f(z_1), f(z_2), f(z_3)]</math>), then we can write
+the equations above more
+compactly as:
+:<math>\begin{align}
+z^{(2)} &= W^{(1)} x + b^{(1)} \\
+a^{(2)} &= f(z^{(2)}) \\
+z^{(3)} &= W^{(2)} a^{(2)} + b^{(2)} \\
+h_{W,b}(x) &= a^{(3)} = f(z^{(3)})
+\end{align}</math>
+We call this step '''forward propagation.'''  More generally, recalling that we also use <math>a^{(1)} = x</math> to also denote the values from the input layer,
+then given layer <math>l</math>'s activations <math>a^{(l)}</math>, we can compute layer <math>l+1</math>'s activations <math>a^{(l+1)}</math> as:
+:<math>\begin{align}
+z^{(l+1)} &= W^{(l)} a^{(l)} + b^{(l)}   \\
+a^{(l+1)} &= f(z^{(l+1)})
+\end{align}</math>
+By organizing our parameters in matrices and using matrix-vector operations, we can take
+advantage of fast linear algebra routines to quickly perform calculations in our network.
+We have so far focused on one example neural network, but one can also build neural
+networks with other '''architectures''' (meaning patterns of connectivity between neurons), including ones with multiple hidden layers.
+The most common choice is a <math>\textstyle n_l</math>-layered network
+where layer <math>\textstyle 1</math> is the input layer, layer <math>\textstyle n_l</math> is the output layer, and each
+layer <math>\textstyle l</math> is densely connected to layer <math>\textstyle l+1</math>.  In this setting, to compute the
+output of the network, we can successively compute all the activations in layer
+<math>\textstyle L_2</math>, then layer <math>\textstyle L_3</math>, and so on, up to layer <math>\textstyle L_{n_l}</math>, using the equations above that describe the forward propagation step.  This is one
+example of a '''feedforward''' neural network, since the connectivity graph
+does not have any directed loops or cycles.
+Neural networks can also have multiple output units.  For example, here is a network
+with two hidden layers layers <math>L_2</math> and <math>L_3</math> and two output units in layer <math>L_4</math>:
+[[Image:Network3322.png|500px|center]]
+To train this network, we would need training examples <math>(x^{(i)}, y^{(i)})</math>
+where <math>y^{(i)} \in \Re^2</math>.  This sort of network is useful if there're multiple
+outputs that you're interested in predicting.  (For example, in a medical
+diagnosis application, the vector <math>x</math> might give the input features of a
+patient, and the different outputs <math>y_i</math>'s might indicate presence or absence
+of different diseases.)
+{{Sparse_Autoencoder}}
+{{Languages|神经网络|中文}}