Softmax回归

Revision as of 05:23, 16 March 2013 (view source)

Kandeng (Talk | contribs)

(→简介)

← Older edit

Revision as of 05:42, 16 March 2013 (view source)

Kandeng (Talk | contribs)

(→代价函数 Cost Function)

Newer edit →

Line 67:

== 代价函数 Cost Function ==

-

~~'''原文''':~~

+

现在我们来介绍softmax回归算法的代价函数。在下面的公式中，<math>1\{\cdot\}</math>是示性函数，其取值规则为：

-

+

<math>1\{</math> 值为真的表达式<math>\}=1</math>

-

~~We now describe the cost function that we'll use for softmax regression. In the equation below,~~ <math>1\{\cdot\}</math> is

+

，<math>1\{</math> 值为假的表达式<math>\}=0</math>。举例来说，表达式<math>1\{2+2=4\}</math>的值为1 ，<math>1\{1+1=5\}</math>的值为 0。我们的代价函数为：

-

~~the '''indicator function,''' so that~~ <math>1\{~~\hbox{a true statement}~~\}=1</math>~~, and~~ <math>1\{~~\hbox{a false statement}~~\}=0</math>.

+

-

~~For example,~~ <math>1\{2+2=4\}</math> ~~evaluates to 1; whereas~~ <math>1\{1+1=5\}</math> ~~evaluates to 0. Our cost function will be:~~

+

<math>

Line 79:

Line 77:

</math>

-

~~'''译文''':~~

-

在本节中，我们定义 softmax回归的损失函数。在下面的公式中，<math>1\{\cdot\}</math>是一个标识函数，1{值为真的表达式}=1，1{值为假的表达式}=0。例如，表达式 1{2+2=4}的值为1 ，1{1+1=5}的值为 0。我们的损失函数为：

-

~~<math>~~

-

~~\begin{align}~~

-

~~J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]~~

-

~~\end{align}~~

-

~~</math>~~

-

~~'''一审''':~~

+

值得注意的是，上述公式是logistic回归代价函数的推广。 logistic回归代价函数可以改为：

-

现在我们来介绍用于softmax回归算法的代价函数。在下面的公式中，<math>1\{\cdot\}</math>是示性函数，其取值规则为：1{值为真的表达式}=1，1{值为假的表达式}=0。举例来说，表达式1{2+2=4}的值为1 ，1{1+1=5}的值为 0。我们的代价函数为：

+

-

~~<math>~~

+

-

~~\begin{align}~~

+

-

~~J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]~~

+

-

~~\end{align}~~

+

-

~~</math>~~

+

-

+

-

+

-

~~'''原文''':~~

+

-

+

-

~~Notice that this generalizes the logistic regression cost function, which could also have been written:~~

+

<math>

Line 107:

Line 87:

</math>

-

~~'''译文''':~~

-

~~值得注意的是，上述公式是logistic回归损失函数的一个泛化版。 logistic回归损失函数可以改写如下：~~

+

可以看到，Softmax代价函数与logistic 代价函数在形式上非常类似，只是在Softmax损失函数中对类标记的<math>k</math>个可能值进行了累加。注意在Softmax回归中将<math>x</math>分类为类别<math>j</math>的概率为：

-

+

-

<math>

+

-

~~\begin{align}~~

+

-

~~J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m (1-y^{(i)}) \log (1-h_\theta(~~x~~^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\~~

+

-

~~&= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{~~j~~=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]~~

+

-

~~\end{align}~~

+

-

</math>

+

-

+

-

~~'''一审''':~~

+

-

~~值得注意的是，上述公式是logistic回归代价函数的一个泛化版。 logistic回归代价函数可以改写如下：~~

+

-

~~<math>~~

-

~~\begin{align}~~

-

~~J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\~~

-

~~&= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]~~

-

~~\end{align}~~

-

~~</math>~~

-

~~'''原文''':~~

-

~~The softmax cost function is similar, except that we now sum over the <math>k</math> different possible values~~

-

~~of the class label. Note also that in softmax regression, we have that~~

<math>

p(y^{(i)} = j | x^{(i)} ; \theta) = \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} }

</math>.

-

~~'''译文''':~~

-

~~可以看到，Softmax损失函数与logistic 损失函数在形式上非常类似，只是在Softmax损失函数将类标的开 k个可能值进行了累加，另外，~~

-

~~'''一审''':~~

+

对于<math>J(\theta)</math>的最小化问题，目前还没有闭式解法。因此，我们使用迭代的优化算法（例如梯度下降法，或 L-BFGS）。经过求导，我们得到梯度公式如下：

-

+

-

~~除了我们是对 k 个分类标记的概率值求和之外，Softmax回归的代价函数和上式是十分相似的。我们可以注意到在Softmax回归中概率值为：~~

+

-

+

-

~~'''原文''':~~

+

-

+

-

~~There is no known closed-form way to solve for the minimum of~~ <math>J(\theta)</math>~~, and thus as usual we'll resort to an iterative~~

+

-

~~optimization algorithm such as gradient descent or~~ L-~~BFGS. Taking derivatives, one can show that the gradient is:~~

+

<math>

Line 154:

Line 103:

</math>

-

~~'''译文''':~~

-

对于<math>J(\theta)</math>，现在还没有一个闭合形式的方法来求解，因此，我们使用一个迭代的优化算法（例如梯度下降法，或 L-BFGS）来求解<math>J(\theta)</math>。经过求导，我们得到梯度公式如下：

-

~~<math>~~

-

~~\begin{align}~~

-

~~\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\} - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right] }~~

-

~~\end{align}~~

-

~~</math>~~

-

~~'''一审''':~~

-

对于<math>J(\theta)</math>，现在还没有一个闭合形式的方法来求解，因此，我们使用一个迭代的优化算法（例如梯度下降法，或 L-BFGS）来求解<math>J(\theta)</math>。经过求导，我们得到梯度公式如下：

-

~~<math>~~

-

~~\begin{align}~~

-

~~\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\} - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right] }~~

-

~~\end{align}~~

-

~~</math>~~

-

~~'''原文''':~~

-

~~Recall the meaning of the "<math>\nabla_{\theta_j}</math>" notation. In particular, <math>\nabla_{\theta_j} J(\theta)</math>~~

-

~~is itself a vector, so that its <math>l</math>-th element is <math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>~~

-

~~the partial derivative of <math>J(\theta)</math> with respect to the <math>l</math>-th element of <math>\theta_j</math>.~~

-

~~'''译文''':~~

-

让我们来回顾一下 "<math>\nabla_{\theta_j}</math>" 的含义， <math>\nabla_{\theta_j} J(\theta)</math>是一个向量，因此，它的第 l个元素<math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>是<math>J(\theta)</math>对<math>\theta_j</math>的第l个元素求偏导后的值。

-

~~'''一审''':~~

-

让我们来回顾一下符号 "<math>\nabla_{\theta_j}</math>" 的含义。特别地， <math>\nabla_{\theta_j} J(\theta)</math>本身是一个向量，因此它的第 l个元素<math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>是<math>J(\theta)</math>对<math>\theta_j</math>的第l个分量的偏导数。

-

~~'''原文''':~~

-

~~Armed with this formula for the derivative, one can then plug it into an algorithm such as gradient descent, and have it~~

-

~~minimize <math>J(\theta)</math>. For example, with the standard implementation of gradient descent, on each iteration~~

-

~~we would perform the update <math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math> (for each <math>j=1,\ldots,k</math>).~~

-

~~When implementing softmax regression, we will typically use a modified version of the cost function described above;~~

+

让我们来回顾一下符号"<math>\nabla_{\theta_j}</math>"的含义。<math>\nabla_{\theta_j} J(\theta)</math>本身是一个向量，它的第<math>l</math>个元素<math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>是<math>J(\theta)</math>对<math>\theta_j</math>的第<math>l</math>个分量的偏导数。

-

~~specifically, one that incorporates weight decay. We describe the motivation and details below.~~

+

-

~~'''译文''':~~

-

有了上面的偏导公式以后，我们可以将它带入到算法中来最小化 <math>J(\theta)</math>。例如，使用标准的梯度下降法，在每一次迭代过程中，我们更新 <math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math>。

-

~~在实际的 softmax 实现过程中，我们通常使用一个改进版的损失函数（一个加入了权重 decay 的函数），在下面会详细讲到。~~

-

~~'''一审''':~~

+

有了上面的偏导数公式以后，我们就可以将它代入到梯度下降法等算法中，来最小化<math>J(\theta)</math>。例如，在梯度下降法的标准实现中，每一次迭代需要进行如下更新:<math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math>(<math>j=1,\ldots,k</math>）。

-

~~有了上面的偏导数公式以后，我们就可以将它带入到梯度下降法等算法中，来使~~<math>J(\theta)</math>~~最小化。例如，在梯度下降法标准实现的每一次迭代中，我们需要进行如下更新：~~<math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math>~~（对于每一个~~ <math>j=1,\ldots,k</math>）

+

-

当实现 softmax 回归算法时，我们通常会使用上述代价函数的一个改进版本。具体来说，就是和权重衰减一起使用。我们接下来会描述使用它的动机和细节。

+

当实现 softmax 回归算法时，我们通常会使用上述代价函数的一个改进版本。具体来说，就是和权重衰减(weight decay)一起使用。我们接下来介绍使用它的动机和细节。

== softmax回归参数化的特性 Properties of softmax regression parameterization ==

Softmax回归

From Ufldl

Revision as of 05:42, 16 March 2013

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 67: / Line 67: @@
 == 代价函数 Cost Function ==
-'''原文''':
+现在我们来介绍softmax回归算法的代价函数。在下面的公式中，<math>1\{\cdot\}</math>是示性函数，其取值规则为：
+<math>1\{</math> 值为真的表达式<math>\}=1</math>
-We now describe the cost function that we'll use for softmax regression.  In the equation below, <math>1\{\cdot\}</math> is
+，<math>1\{</math> 值为假的表达式<math>\}=0</math>。举例来说，表达式<math>1\{2+2=4\}</math>的值为1 ，<math>1\{1+1=5\}</math>的值为 0。我们的代价函数为：
-the '''indicator function,''' so that <math>1\{\hbox{a true statement}\}=1</math>, and <math>1\{\hbox{a false statement}\}=0</math>.
-For example, <math>1\{2+2=4\}</math> evaluates to 1; whereas <math>1\{1+1=5\}</math> evaluates to 0. Our cost function will be:
 <math>
@@ Line 79: / Line 77: @@
 </math>
-'''译文''':
-在本节中，我们定义 softmax回归的损失函数。在下面的公式中，<math>1\{\cdot\}</math>是一个标识函数，1{值为真的表达式}=1，1{值为假的表达式}=0。例如，表达式 1{2+2=4}的值为1 ，1{1+1=5}的值为 0。我们的损失函数为：
-<math>
-\begin{align}
-J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k}  1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]
-\end{align}
-</math>
-'''一审''':
+值得注意的是，上述公式是logistic回归代价函数的推广。 logistic回归代价函数可以改为：
-现在我们来介绍用于softmax回归算法的代价函数。在下面的公式中，<math>1\{\cdot\}</math>是示性函数，其取值规则为：1{值为真的表达式}=1，1{值为假的表达式}=0。举例来说，表达式1{2+2=4}的值为1 ，1{1+1=5}的值为 0。我们的代价函数为：
-<math>
-\begin{align}
-J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k}  1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]
-\end{align}
-</math>
-'''原文''':
-Notice that this generalizes the logistic regression cost function, which could also have been written:
 <math>
@@ Line 107: / Line 87: @@
 </math>
-'''译文''':
-值得注意的是，上述公式是logistic回归损失函数的一个泛化版。 logistic回归损失函数可以改写如下：
+可以看到，Softmax代价函数与logistic 代价函数在形式上非常类似，只是在Softmax损失函数中对类标记的<math>k</math>个可能值进行了累加。注意在Softmax回归中将<math>x</math>分类为类别<math>j</math>的概率为：
-<math>
-\begin{align}
-J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m   (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\
-&= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]
-\end{align}
-</math>
-'''一审''':
-值得注意的是，上述公式是logistic回归代价函数的一个泛化版。 logistic回归代价函数 可以改写如下：
-<math>
-\begin{align}
-J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m   (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\
-&= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]
-\end{align}
-</math>
-'''原文''':
-The softmax cost function is similar, except that we now sum over the <math>k</math> different possible values
-of the class label.  Note also that in softmax regression, we have that
 <math>
 p(y^{(i)} = j | x^{(i)} ; \theta) = \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} }
 </math>.
-'''译文''':
-可以看到，Softmax损失函数与logistic 损失函数在形式上非常类似，只是在Softmax损失函数将类标的开 k个可能值进行了累加，另外，
-'''一审''':
+对于<math>J(\theta)</math>的最小化问题，目前还没有闭式解法。因此，我们使用迭代的优化算法（例如梯度下降法，或 L-BFGS）。经过求导，我们得到梯度公式如下：
-除了我们是对 k 个分类标记的概率值求和之外，Softmax回归的代价函数和上式是十分相似的。我们可以注意到在Softmax回归中概率值为：
-'''原文''':
-There is no known closed-form way to solve for the minimum of <math>J(\theta)</math>, and thus as usual we'll resort to an iterative
-optimization algorithm such as gradient descent or L-BFGS.  Taking derivatives, one can show that the gradient is:
 <math>
@@ Line 154: / Line 103: @@
 </math>
-'''译文''':
-对于<math>J(\theta)</math>，现在还没有一个闭合形式的方法来求解，因此，我们使用一个迭代的优化算法（例如梯度下降法，或 L-BFGS）来求解<math>J(\theta)</math>。经过求导，我们得到梯度公式如下：
-<math>
-\begin{align}
-\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\}  - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right]  }
-\end{align}
-</math>
-'''一审''':
-对于<math>J(\theta)</math>，现在还没有一个闭合形式的方法来求解，因此，我们使用一个迭代的优化算法（例如梯度下降法，或 L-BFGS）来求解<math>J(\theta)</math>。经过求导，我们得到梯度公式如下：
-<math>
-\begin{align}
-\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\}  - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right]  }
-\end{align}
-</math>
-'''原文''':
-Recall the meaning of the "<math>\nabla_{\theta_j}</math>" notation.  In particular, <math>\nabla_{\theta_j} J(\theta)</math>
-is itself a vector, so that its <math>l</math>-th element is <math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>
-the partial derivative of <math>J(\theta)</math> with respect to the <math>l</math>-th element of <math>\theta_j</math>.
-'''译文''':
-让我们来回顾一下 "<math>\nabla_{\theta_j}</math>" 的含义， <math>\nabla_{\theta_j} J(\theta)</math>是一个向量，因此，它的第 l个元素<math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>是<math>J(\theta)</math>对<math>\theta_j</math>的第l个元素求偏导后的值。
-'''一审''':
-让我们来回顾一下 符号 "<math>\nabla_{\theta_j}</math>" 的含义。特别地， <math>\nabla_{\theta_j} J(\theta)</math>本身是一个向量，因此它的第 l个元素<math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>是<math>J(\theta)</math>对<math>\theta_j</math>的第l个分量的偏导数。
-'''原文''':
-Armed with this formula for the derivative, one can then plug it into an algorithm such as gradient descent, and have it
-minimize <math>J(\theta)</math>.  For example, with the standard implementation of gradient descent, on each iteration
-we would perform the update <math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math> (for each <math>j=1,\ldots,k</math>).
-When implementing softmax regression, we will typically use a modified version of the cost function described above;
+让我们来回顾一下符号"<math>\nabla_{\theta_j}</math>"的含义。<math>\nabla_{\theta_j} J(\theta)</math>本身是一个向量，它的第<math>l</math>个元素<math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>是<math>J(\theta)</math>对<math>\theta_j</math>的第<math>l</math>个分量的偏导数。
-specifically, one that incorporates weight decay.  We describe the motivation and details below.
-'''译文''':
-有了上面的偏导公式以后，我们可以将它带入到算法中来最小化 <math>J(\theta)</math>。例如，使用标准的梯度下降法，在每一次迭代过程中，我们更新 <math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math>。
-在实际的 softmax 实现过程中，我们通常使用一个改进版的损失函数（一个加入了权重 decay 的函数），在下面会详细讲到。
-'''一审''':
+有了上面的偏导数公式以后，我们就可以将它代入到梯度下降法等算法中，来最小化<math>J(\theta)</math>。 例如，在梯度下降法的标准实现中，每一次迭代需要进行如下更新:<math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math>(<math>j=1,\ldots,k</math>）。
-有了上面的偏导数公式以后，我们就可以将它带入到梯度下降法等算法中，来使<math>J(\theta)</math>最小化。 例如，在梯度下降法标准实现的每一次迭代中，我们需要进行如下更新 ：<math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math>（对于每一个 <math>j=1,\ldots,k</math>）
-当实现 softmax 回归算法时， 我们通常会使用 上述代价函数的一个改进版本。具体来说，就是和 权重衰减 一起使用。我们接下来会描述使用它的动机和细节。
+当实现 softmax 回归算法时， 我们通常会使用上述代价函数的一个改进版本。具体来说，就是和权重衰减(weight decay)一起使用。我们接下来介绍使用它的动机和细节。
 == softmax回归参数化的特性 Properties of softmax regression parameterization ==