# Neural Networks

 Revision as of 05:41, 26 February 2011 (view source)Ang (Talk | contribs)← Older edit Revision as of 06:04, 26 February 2011 (view source)Ang (Talk | contribs) Newer edit → Line 29: Line 29: Here are plots of the sigmoid and $\tanh$ functions: Here are plots of the sigmoid and $\tanh$ functions: - {{multiple image + - | width    = 400 + - | footer    = Two cards used by football referees + - | image1    = Sigmoid_Function.png + [[Image:Sigmoid_Function.png|400px|center|Sigmoid activation function.]] - | alt1      = Sigmoid activation function + [[Image:Tanh_Function.png|400px|center|Tanh activation function.]] - | caption1  = Sigmoid activation function + - | image2    = Tanh_Function.png + The $\tanh(z)$ function is a rescaled version of the sigmoid, and its output range is - | alt2      = Tanh activation function + $[-1,1]$ instead of $[0,1]$. - | caption2 = Tanh activation function + - }} + Note that unlike CS221 and (parts of) CS229, we are not using the convention + here of $x_0=1$.  Instead, the intercept term is handled separately by the parameter $b$. + + Finally, one identity that'll be useful later: If $f(z) = 1/(1+\exp(-z))$ is the sigmoid + function, then its derivative is given by $f'(z) = f(z) (1-f(z))$. + (If $f$ is the tanh function, then its derivative is given by + $f'(z) = 1- (f(z))^2$.)  You can derive this yourself using the definition of + the sigmoid (or tanh) function. + + + + == Neural Network formulation == + + + A neural network is put together by hooking together many of our simple + neurons,'' so that the output of a neuron can be the input of another.  For + example, here is a small neural network: + + [[Image:Network331.png|400px|center]] + + In this figure, we have used circles to also denote the inputs to the network. The circles + labeled +1'' are called {\bf bias units}, and correspond to the intercept term. + The leftmost layer of the network is called the {\bf input layer}, and the + rightmost layer the {\bf output layer} (which, in this example, has only one + node).  The middle layer of nodes is called the {\bf hidden layer}, because its + values are not observed in the training set.  We also say that our example + neural network has 3 {\bf input units} (not counting the bias unit), 3 {\bf + hidden units}, and 1 {\bf output unit}. + + We will let $n_l$ + denote the number of layers in our network; thus $n_l=3$ in our example.  We label layer $l$ as + $L_l$, so layer $L_1$ is the input layer, and layer $L_{n_l}$ the output layer. + Our neural network has parameters $(W,b) = (W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)})$, where + we write + $W^{(l)}_{ij}$ to denote the parameter (or weight) associated with the connection + between unit $j$ in layer $l$, and unit $i$ in layer $l+1$.  (Note the order of the indices.) + Also, $b^{(l)}_i$ is the bias associated with unit $i$ in layer $l+1$. + Thus, in our example, we have $W^{(1)} \in \Re^{3\times 3}$, and $W^{(2)} \in \Re^{1\times 3}$. + Note that bias units don't have inputs or connections going into them, since they always output + the value +1.  We also let $s_l$ denote the number of nodes in layer $l$ (not counting the bias unit). + + We will write $a^{(l)}_i$ to denote the {\bf activation} (meaning output value) of + unit $i$ in layer $l$.  For $l=1$, we also use $a^{(1)}_i = x_i$ to denote the $i$-th input. + Given a fixed setting of + the parameters $W,b$, our neural + network defines a hypothesis $h_{W,b}(x)$ that outputs a real number.  Specifically, the + computation that this neural network represents is given by: + :+ \begin{align} + a_1^{(2)} &= f(W_{11}^{(1)}x_1 + W_{12}^{(1)} x_2 + W_{13}^{(1)} x_3 + b_1^{(1)}) \\ + a_2^{(2)} &= f(W_{21}^{(1)}x_1 + W_{22}^{(1)} x_2 + W_{23}^{(1)} x_3 + b_2^{(1)}) \\ + a_3^{(2)} &= f(W_{31}^{(1)}x_1 + W_{32}^{(1)} x_2 + W_{33}^{(1)} x_3 + b_3^{(1)}) \\ + h_{W,b}(x) &= a_1^{(3)} = f(W_{11}^{(2)}a_1^{(2)} + W_{12}^{(2)} a_2^{(2)} + W_{13}^{(2)} a_3^{(2)} + b_1^{(2)}) + \end{align} + + + In the sequel, we also let $z^{(l)}_i$ denote the total weighted sum of inputs to unit $i$ in layer $l$, + including the bias term (e.g., $z_i^{(2)} = \sum_{j=1}^n W^{(1)}_{ij} x_j + b^{(1)}_i$), so that + $a^{(l)}_i = f(z^{(l)}_i)$.