# Neural Networks

### From Ufldl

Line 8: | Line 8: | ||

diagram to denote a single neuron: | diagram to denote a single neuron: | ||

- | [[Image:SingleNeuron.png| | + | [[Image:SingleNeuron.png|300px|center]] |

This "neuron" is a computational unit that takes as input <math>x_1, x_2, x_3</math> (and a +1 intercept term), and | This "neuron" is a computational unit that takes as input <math>x_1, x_2, x_3</math> (and a +1 intercept term), and | ||

- | outputs <math>h_{W,b}(x) = f(W^Tx) = f(\sum_{i=1}^3 W_{i}x_i +b)</math>, where <math>f : \Re \mapsto \Re</math> is | + | outputs <math>\textstyle h_{W,b}(x) = f(W^Tx) = f(\sum_{i=1}^3 W_{i}x_i +b)</math>, where <math>f : \Re \mapsto \Re</math> is |

called the '''activation function'''. In these notes, we will choose | called the '''activation function'''. In these notes, we will choose | ||

<math>f(\cdot)</math> to be the sigmoid function: | <math>f(\cdot)</math> to be the sigmoid function: | ||

Line 28: | Line 28: | ||

</math> | </math> | ||

Here are plots of the sigmoid and <math>\tanh</math> functions: | Here are plots of the sigmoid and <math>\tanh</math> functions: | ||

+ | |||

+ | |||

+ | |||

+ | <div align=center> | ||

+ | [[Image:Sigmoid_Function.png|400px|top|Sigmoid activation function.]] | ||

+ | [[Image:Tanh_Function.png|400px|top|Tanh activation function.]] | ||

+ | </div> | ||

+ | |||

+ | The <math>\tanh(z)</math> function is a rescaled version of the sigmoid, and its output range is | ||

+ | <math>[-1,1]</math> instead of <math>[0,1]</math>. | ||

+ | |||

+ | Note that unlike some other venues (including the OpenClassroom videos, and parts of CS229), we are not using the convention | ||

+ | here of <math>x_0=1</math>. Instead, the intercept term is handled separately by the parameter <math>b</math>. | ||

+ | |||

+ | Finally, one identity that'll be useful later: If <math>f(z) = 1/(1+\exp(-z))</math> is the sigmoid | ||

+ | function, then its derivative is given by <math>f'(z) = f(z) (1-f(z))</math>. | ||

+ | (If <math>f</math> is the tanh function, then its derivative is given by | ||

+ | <math>f'(z) = 1- (f(z))^2</math>.) You can derive this yourself using the definition of | ||

+ | the sigmoid (or tanh) function. | ||

+ | |||

+ | |||

+ | |||

+ | == Neural Network model == | ||

+ | |||

+ | A neural network is put together by hooking together many of our simple | ||

+ | "neurons," so that the output of a neuron can be the input of another. For | ||

+ | example, here is a small neural network: | ||

+ | |||

+ | [[Image:Network331.png|400px|center]] | ||

+ | |||

+ | In this figure, we have used circles to also denote the inputs to the network. The circles | ||

+ | labeled "+1" are called '''bias units''', and correspond to the intercept term. | ||

+ | The leftmost layer of the network is called the '''input layer''', and the | ||

+ | rightmost layer the '''output layer''' (which, in this example, has only one | ||

+ | node). The middle layer of nodes is called the '''hidden layer''', because its | ||

+ | values are not observed in the training set. We also say that our example | ||

+ | neural network has 3 '''input units''' (not counting the bias unit), 3 | ||

+ | '''hidden units''', and 1 '''output unit'''. | ||

+ | |||

+ | We will let <math>n_l</math> | ||

+ | denote the number of layers in our network; thus <math>n_l=3</math> in our example. We label layer <math>l</math> as | ||

+ | <math>L_l</math>, so layer <math>L_1</math> is the input layer, and layer <math>L_{n_l}</math> the output layer. | ||

+ | Our neural network has parameters <math>(W,b) = (W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)})</math>, where | ||

+ | we write | ||

+ | <math>W^{(l)}_{ij}</math> to denote the parameter (or weight) associated with the connection | ||

+ | between unit <math>j</math> in layer <math>l</math>, and unit <math>i</math> in layer <math>l+1</math>. (Note the order of the indices.) | ||

+ | Also, <math>b^{(l)}_i</math> is the bias associated with unit <math>i</math> in layer <math>l+1</math>. | ||

+ | Thus, in our example, we have <math>W^{(1)} \in \Re^{3\times 3}</math>, and <math>W^{(2)} \in \Re^{1\times 3}</math>. | ||

+ | Note that bias units don't have inputs or connections going into them, since they always output | ||

+ | the value +1. We also let <math>s_l</math> denote the number of nodes in layer <math>l</math> (not counting the bias unit). | ||

+ | |||

+ | We will write <math>a^{(l)}_i</math> to denote the '''activation''' (meaning output value) of | ||

+ | unit <math>i</math> in layer <math>l</math>. For <math>l=1</math>, we also use <math>a^{(1)}_i = x_i</math> to denote the <math>i</math>-th input. | ||

+ | Given a fixed setting of | ||

+ | the parameters <math>W,b</math>, our neural | ||

+ | network defines a hypothesis <math>h_{W,b}(x)</math> that outputs a real number. Specifically, the | ||

+ | computation that this neural network represents is given by: | ||

+ | :<math> | ||

+ | \begin{align} | ||

+ | a_1^{(2)} &= f(W_{11}^{(1)}x_1 + W_{12}^{(1)} x_2 + W_{13}^{(1)} x_3 + b_1^{(1)}) \\ | ||

+ | a_2^{(2)} &= f(W_{21}^{(1)}x_1 + W_{22}^{(1)} x_2 + W_{23}^{(1)} x_3 + b_2^{(1)}) \\ | ||

+ | a_3^{(2)} &= f(W_{31}^{(1)}x_1 + W_{32}^{(1)} x_2 + W_{33}^{(1)} x_3 + b_3^{(1)}) \\ | ||

+ | h_{W,b}(x) &= a_1^{(3)} = f(W_{11}^{(2)}a_1^{(2)} + W_{12}^{(2)} a_2^{(2)} + W_{13}^{(2)} a_3^{(2)} + b_1^{(2)}) | ||

+ | \end{align} | ||

+ | </math> | ||

+ | |||

+ | In the sequel, we also let <math>z^{(l)}_i</math> denote the total weighted sum of inputs to unit <math>i</math> in layer <math>l</math>, | ||

+ | including the bias term (e.g., <math>\textstyle z_i^{(2)} = \sum_{j=1}^n W^{(1)}_{ij} x_j + b^{(1)}_i</math>), so that | ||

+ | <math>a^{(l)}_i = f(z^{(l)}_i)</math>. | ||

+ | |||

+ | Note that this easily lends itself to a more compact notation. Specifically, if we extend the | ||

+ | activation function <math>f(\cdot)</math> | ||

+ | to apply to vectors in an element-wise fashion (i.e., | ||

+ | <math>f([z_1, z_2, z_3]) = [f(z_1), f(z_2), f(z_3)]</math>), then we can write | ||

+ | the equations above more | ||

+ | compactly as: | ||

+ | :<math>\begin{align} | ||

+ | z^{(2)} &= W^{(1)} x + b^{(1)} \\ | ||

+ | a^{(2)} &= f(z^{(2)}) \\ | ||

+ | z^{(3)} &= W^{(2)} a^{(2)} + b^{(2)} \\ | ||

+ | h_{W,b}(x) &= a^{(3)} = f(z^{(3)}) | ||

+ | \end{align}</math> | ||

+ | We call this step '''forward propagation.''' More generally, recalling that we also use <math>a^{(1)} = x</math> to also denote the values from the input layer, | ||

+ | then given layer <math>l</math>'s activations <math>a^{(l)}</math>, we can compute layer <math>l+1</math>'s activations <math>a^{(l+1)}</math> as: | ||

+ | :<math>\begin{align} | ||

+ | z^{(l+1)} &= W^{(l)} a^{(l)} + b^{(l)} \\ | ||

+ | a^{(l+1)} &= f(z^{(l+1)}) | ||

+ | \end{align}</math> | ||

+ | By organizing our parameters in matrices and using matrix-vector operations, we can take | ||

+ | advantage of fast linear algebra routines to quickly perform calculations in our network. | ||

+ | |||

+ | |||

+ | We have so far focused on one example neural network, but one can also build neural | ||

+ | networks with other '''architectures''' (meaning patterns of connectivity between neurons), including ones with multiple hidden layers. | ||

+ | The most common choice is a <math>\textstyle n_l</math>-layered network | ||

+ | where layer <math>\textstyle 1</math> is the input layer, layer <math>\textstyle n_l</math> is the output layer, and each | ||

+ | layer <math>\textstyle l</math> is densely connected to layer <math>\textstyle l+1</math>. In this setting, to compute the | ||

+ | output of the network, we can successively compute all the activations in layer | ||

+ | <math>\textstyle L_2</math>, then layer <math>\textstyle L_3</math>, and so on, up to layer <math>\textstyle L_{n_l}</math>, using the equations above that describe the forward propagation step. This is one | ||

+ | example of a '''feedforward''' neural network, since the connectivity graph | ||

+ | does not have any directed loops or cycles. | ||

+ | |||

+ | |||

+ | Neural networks can also have multiple output units. For example, here is a network | ||

+ | with two hidden layers layers <math>L_2</math> and <math>L_3</math> and two output units in layer <math>L_4</math>: | ||

+ | |||

+ | [[Image:Network3322.png|500px|center]] | ||

+ | |||

+ | To train this network, we would need training examples <math>(x^{(i)}, y^{(i)})</math> | ||

+ | where <math>y^{(i)} \in \Re^2</math>. This sort of network is useful if there're multiple | ||

+ | outputs that you're interested in predicting. (For example, in a medical | ||

+ | diagnosis application, the vector <math>x</math> might give the input features of a | ||

+ | patient, and the different outputs <math>y_i</math>'s might indicate presence or absence | ||

+ | of different diseases.) | ||

+ | |||

+ | |||

+ | {{Sparse_Autoencoder}} | ||

+ | |||

+ | |||

+ | {{Languages|神经网络|中文}} |