SOFTMAX回归

Softmax回归(Softmax Regression)

'''初译''':@knighterzjy

'''一审''':@GuitarFang

==简介 Introduction==

'''原文''':

In these notes, we describe the '''Softmax regression''' model.  This model generalizes logistic regression to
classification problems where the class label <math>y</math> can take on more than two possible values.
This will be useful for such problems as MNIST digit classification, where the goal is to distinguish between 10 different
numerical digits.  Softmax regression is a supervised learning algorithm, but we will later be
using it in conjuction with our deep learning/unsupervised feature learning methods.


'''译文''':

在本节中，我们介绍Softmax回归模型，该模型是logistic回归模型在多分类问题上的泛化，在多分类问题中，类标签y可以取两个以上的值。 Softmax回归模型可以直接应用于 MNIST 手写数字分类问题等多分类问题。Softmax回归是有监督的，不过我们接下来也会介绍它与深度学习/无监督学习方法的结合。
（译者注： MNIST 是一个手写数字识别库，由 NYU 的Yann LeCun 等人维护。 http://yann.lecun.com/exdb/mnist/ ）

'''一审''':

在本章中，我们介绍Softmax回归模型。该模型将logistic回归模型一般化，以用来解决类型标签y的可能取值多于两种的分类问题。Softmax回归模型对于诸如MNIST手写数字分类等问题是十分有用的，该问题的目的是辨识10个不同的单个数字。Softmax回归是一种有监督学习算法，但是我们接下来要将它与我们的深度学习/无监督特征学习方法结合起来使用。
（译者注：MNIST是一个手写数字识别库，由NYU的Yann LeCun等人维护。http://yann.lecun.com/exdb/mnist/）

'''原文''':
Recall that in logistic regression, we had a training set
<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>
of <math>m</math> labeled examples, where the input features are <math>x^{(i)} \in \Re^{n+1}</math>.  
(In this set of notes, we will use the notational convention of letting the feature vectors <math>x</math> be
<math>n+1</math> dimensional, with <math>x_0 = 1</math> corresponding to the intercept term.) 
With logistic regression, we were in the binary classification setting, so the labels 
were <math>y^{(i)} \in \{0,1\}</math>.  Our hypothesis took the form:

<math>\begin{align}
h_\theta(x) = \frac{1}{1+\exp(-\theta^Tx)},
\end{align}</math>


'''译文''':
回顾一下 logistic 回归，我们的训练集为<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>
，其中 m为样本数，<math>x^{(i)} \in \Re^{n+1}</math>为特征。
由于 logistic 回归是针对二分类问题的，因此类标 <math>y^{(i)} \in \{0,1\}</math>。假设函数如下：

<math>\begin{align}
h_\theta(x) = \frac{1}{1+\exp(-\theta^Tx)},
\end{align}</math>

'''一审''':
回想一下在 logistic 回归中，我们拥有一个包含 m 个被标记的样本的训练集 <math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>，其中输入特征值 <math>x^{(i)} \in \Re^{n+1}</math>。（在本章中，我们对出现的符号进行如下约定：特征向量 x 的维度为n+1 ，其中x0=1对应截距项 。）因为在Logistic 回归中，我们要解决的是二元分类问题，因此类型标记<math>y^{(i)} \in \{0,1\}</math>。估值函数如下：

<math>\begin{align}
h_\theta(x) = \frac{1}{1+\exp(-\theta^Tx)},
\end{align}</math>


'''原文''':

and the model parameters <math>\theta</math> were trained to minimize
the cost function
<math>
\begin{align}
J(\theta) = -\frac{1}{m} \left[ \sum_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \right]
\end{align}
</math>


'''译文''':
模型参数 <math>\theta</math> 用于最小化损失函数
<math>
\begin{align}
J(\theta) = -\frac{1}{m} \left[ \sum_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \right]
\end{align}
</math>


'''一审''':
我们将训练模型参数 <math>\theta</math> ，使其能够最小化代价函数 ：

<math>
\begin{align}
J(\theta) = -\frac{1}{m} \left[ \sum_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \right]
\end{align}
</math>

'''原文''':
In the softmax regression setting, we are interested in multi-class
classification (as opposed to only binary classification), and so the label
<math>y</math> can take on <math>k</math> different values, rather than only
two.  Thus, in our training set
<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>,
we now have that <math>y^{(i)} \in \{1, 2, \ldots, k\}</math>.  (Note that
our convention will be to index the classes starting from 1, rather than from 0.)  For example,
in the MNIST digit recognition task, we would have <math>k=10</math> different classes.

'''译文''':
在 softmax回归中，我们解决的是多分类问题（相对于 logistic 回归解决的二分类问题），类标 y 可以取 k个不同的值（而不是 2 个）。因此，对于训练集 <math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>，我们有 <math>y^{(i)} \in \{1, 2, \ldots, k\}</math>。（注意此处的类别下标从 1 开始，而不是 0）。例如，在 MNIST 数字识别任务中，我们有 k=10 个不同的类别。

'''一审''':
在 softmax回归中，我们感兴趣的是多元分类（相对于只能辨识两种类型的二元分类）， 所以类型标记y可以取k个不同的值（而不只限于2个）。 于是，对于我们的 训练集<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math> 便有<math>y^{(i)} \in \{1, 2, \ldots, k\}</math>。（注意，我们约定类别下标从 1 开始，而不是 0）。例如，在 MNIST 数字识别任务中，我们有 k=10 个不同的类别。


'''原文''':
Given a test input <math>x</math>, we want our hypothesis to estimate
the probability that <math>p(y=j | x)</math> for each value of <math>j = 1, \ldots, k</math>.
I.e., we want to estimate the probability of the class label taking
on each of the <math>k</math> different possible values.  Thus, our hypothesis
will output a <math>k</math> dimensional vector (whose elements sum to 1) giving
us our <math>k</math> estimated probabilities.  Concretely, our hypothesis
<math>h_{\theta}(x)</math> takes the form:

<math>
\begin{align}
h_\theta(x^{(i)}) =
\begin{bmatrix}
p(y^{(i)} = 1 | x^{(i)}; \theta) \\
p(y^{(i)} = 2 | x^{(i)}; \theta) \\
\vdots \\
p(y^{(i)} = k | x^{(i)}; \theta)
\end{bmatrix}
=
\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }
\begin{bmatrix}
e^{ \theta_1^T x^{(i)} } \\
e^{ \theta_2^T x^{(i)} } \\
\vdots \\
e^{ \theta_k^T x^{(i)} } \\
\end{bmatrix}
\end{align}
</math>

'''译文''':
给定一个测试样本 x ，我们想让假设函数去估计该样本在每一个类别上的概率 <math>p(y=j | x)</math> ，例如，我们想要估计类标在 k 个不同类别上的概率。因此，我们的假设函数会输出一个 k 维的向量（向量元素的和为1）来表示样本x在k个类别上的概率值。具体地说，我们的假设函数<math>h_{\theta}(x)</math> 形式如下：

<math>
\begin{align}
h_\theta(x^{(i)}) =
\begin{bmatrix}
p(y^{(i)} = 1 | x^{(i)}; \theta) \\
p(y^{(i)} = 2 | x^{(i)}; \theta) \\
\vdots \\
p(y^{(i)} = k | x^{(i)}; \theta)
\end{bmatrix}
=
\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }
\begin{bmatrix}
e^{ \theta_1^T x^{(i)} } \\
e^{ \theta_2^T x^{(i)} } \\
\vdots \\
e^{ \theta_k^T x^{(i)} } \\
\end{bmatrix}
\end{align}
</math>

'''一审''':
对于给定的测试输入，我们想让估值函数针对每一个估算出概率值<math>p(y=j | x)</math> 。也就是说，我们想估计出分类结果在每一个分类标记值上出现的概率 (一审注：而不是估算出具体是取哪一个值，这一点和基本神经网络估值函数输出最终值是有区别的) 。因此，我们的 估值函数将要输出一个k维的向量（向量元素的和为1）来表示这k被估计出的概率值。 具体地说，我们的 估值函数<math>h_{\theta}(x)</math> 形式如下：

<math>
\begin{align}
h_\theta(x^{(i)}) =
\begin{bmatrix}
p(y^{(i)} = 1 | x^{(i)}; \theta) \\
p(y^{(i)} = 2 | x^{(i)}; \theta) \\
\vdots \\
p(y^{(i)} = k | x^{(i)}; \theta)
\end{bmatrix}
=
\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }
\begin{bmatrix}
e^{ \theta_1^T x^{(i)} } \\
e^{ \theta_2^T x^{(i)} } \\
\vdots \\
e^{ \theta_k^T x^{(i)} } \\
\end{bmatrix}
\end{align}
</math>

'''原文''':
Here <math>\theta_1, \theta_2, \ldots, \theta_k \in \Re^{n+1}</math> are the
parameters of our model.  
Notice that
the term <math>\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } </math>
normalizes the distribution, so that it sums to one. 

'''译文''':
其中 <math>\theta_1, \theta_2, \ldots, \theta_k \in \Re^{n+1}</math>  均为模型参数， the term <math>\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } </math> 是模型的归一化因子，使得向量的和为 1 。

'''一审''':
其中  <math>\theta_1, \theta_2, \ldots, \theta_k \in \Re^{n+1}</math>是我们模型的参数。请注意<math>\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } </math>，这一项对概率分布进行归一化，使得所有概率之和为 1 。


'''原文''':
For convenience, we will also write 
<math>\theta</math> to denote all the
parameters of our model.  When you implement softmax regression, it is usually
convenient to represent <math>\theta</math> as a <math>k</math>-by-<math>(n+1)</math> matrix obtained by
stacking up <math>\theta_1, \theta_2, \ldots, \theta_k</math> in rows, so that

<math>
\theta = \begin{bmatrix}
\mbox{---} \theta_1^T \mbox{---} \\
\mbox{---} \theta_2^T \mbox{---} \\
\vdots \\
\mbox{---} \theta_k^T \mbox{---} \\
\end{bmatrix}
</math>

'''译文''':
为了简便，我们使用<math>\theta</math>来表示模型参数。在实现Softmax回归的时候，往往使用一个<math>k</math>-by-<math>(n+1)</math>的矩阵来表示<math>\theta</math>。我们将 <math>\theta_1, \theta_2, \ldots, \theta_k</math>按行表示，得到
<math>
\theta = \begin{bmatrix}
\mbox{---} \theta_1^T \mbox{---} \\
\mbox{---} \theta_2^T \mbox{---} \\
\vdots \\
\mbox{---} \theta_k^T \mbox{---} \\
\end{bmatrix}
</math>

'''一审''':
为了方便起见，我们同样使用符号<math>\theta</math>来表示全部的模型参数。在实现Softmax回归时，你通常会发现，将θ用一个<math>k</math>-by-<math>(n+1)</math>的矩阵来表示会十分便利，该矩阵是将 <math>\theta_1, \theta_2, \ldots, \theta_k</math>按行罗列起来得到的，如下所示：
<math>
\theta = \begin{bmatrix}
\mbox{---} \theta_1^T \mbox{---} \\
\mbox{---} \theta_2^T \mbox{---} \\
\vdots \\
\mbox{---} \theta_k^T \mbox{---} \\
\end{bmatrix}
</math>

== 代价函数 Cost Function ==

'''原文''':

We now describe the cost function that we'll use for softmax regression.  In the equation below, <math>1\{\cdot\}</math> is
the '''indicator function,''' so that <math>1\{\hbox{a true statement}\}=1</math>, and <math>1\{\hbox{a false statement}\}=0</math>.
For example, <math>1\{2+2=4\}</math> evaluates to 1; whereas <math>1\{1+1=5\}</math> evaluates to 0. Our cost function will be:

<math>
\begin{align}
J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k}  1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]
\end{align}
</math>

'''译文''':
在本节中，我们定义 softmax回归的损失函数。在下面的公式中，<math>1\{\cdot\}</math>是一个标识函数，1{值为真的表达式}=1，1{值为假的表达式}=0。例如，表达式 1{2+2=4}的值为1 ，1{1+1=5}的值为 0。我们的损失函数为：
<math>
\begin{align}
J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k}  1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]
\end{align}
</math>

'''一审''':
现在我们来介绍用于softmax回归算法的代价函数。在下面的公式中，<math>1\{\cdot\}</math>是示性函数，其取值规则为：1{值为真的表达式}=1，1{值为假的表达式}=0。举例来说，表达式1{2+2=4}的值为1 ，1{1+1=5}的值为 0。我们的代价函数为：
<math>
\begin{align}
J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k}  1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]
\end{align}
</math>


'''原文''':

Notice that this generalizes the logistic regression cost function, which could also have been written:

<math>
\begin{align}
J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m   (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\
&= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]
\end{align}
</math>

'''译文''':

值得注意的是，上述公式是logistic回归损失函数的一个泛化版。 logistic回归损失函数可以改写如下：
 
<math>
\begin{align}
J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m   (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\
&= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]
\end{align}
</math>

'''一审''':
值得注意的是，上述公式是logistic回归代价函数的一个泛化版。 logistic回归代价函数 可以改写如下：

<math>
\begin{align}
J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m   (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\
&= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]
\end{align}
</math>

'''原文''':

The softmax cost function is similar, except that we now sum over the <math>k</math> different possible values
of the class label.  Note also that in softmax regression, we have that
<math>
p(y^{(i)} = j | x^{(i)} ; \theta) = \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} }
</math>.
'''译文''':

可以看到，Softmax损失函数与logistic 损失函数在形式上非常类似，只是在Softmax损失函数将类标的开 k个可能值进行了累加，另外，

'''一审''':

除了我们是对 k 个分类标记的概率值求和之外，Softmax回归的代价函数和上式是十分相似的。我们可以注意到在Softmax回归中概率值为：

'''原文''':

There is no known closed-form way to solve for the minimum of <math>J(\theta)</math>, and thus as usual we'll resort to an iterative
optimization algorithm such as gradient descent or L-BFGS.  Taking derivatives, one can show that the gradient is:

<math>
\begin{align}
\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\}  - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right]  }
\end{align}
</math>

'''译文''':
对于<math>J(\theta)</math>，现在还没有一个闭合形式的方法来求解，因此，我们使用一个迭代的优化算法（例如梯度下降法，或 L-BFGS）来求解<math>J(\theta)</math>。经过求导，我们得到梯度公式如下：

<math>
\begin{align}
\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\}  - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right]  }
\end{align}
</math>

'''一审''':
对于<math>J(\theta)</math>，现在还没有一个闭合形式的方法来求解，因此，我们使用一个迭代的优化算法（例如梯度下降法，或 L-BFGS）来求解<math>J(\theta)</math>。经过求导，我们得到梯度公式如下：

<math>
\begin{align}
\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\}  - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right]  }
\end{align}
</math>


'''原文''':

Recall the meaning of the "<math>\nabla_{\theta_j}</math>" notation.  In particular, <math>\nabla_{\theta_j} J(\theta)</math>
is itself a vector, so that its <math>l</math>-th element is <math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>
the partial derivative of <math>J(\theta)</math> with respect to the <math>l</math>-th element of <math>\theta_j</math>. 


'''译文''':
让我们来回顾一下 "<math>\nabla_{\theta_j}</math>" 的含义， <math>\nabla_{\theta_j} J(\theta)</math>是一个向量，因此，它的第 l个元素<math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>是<math>J(\theta)</math>对<math>\theta_j</math>的第l个元素求偏导后的值。

'''一审''':
让我们来回顾一下 符号 "<math>\nabla_{\theta_j}</math>" 的含义。特别地， <math>\nabla_{\theta_j} J(\theta)</math>本身是一个向量，因此它的第 l个元素<math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>是<math>J(\theta)</math>对<math>\theta_j</math>的第l个分量的偏导数。

'''原文''':

Armed with this formula for the derivative, one can then plug it into an algorithm such as gradient descent, and have it
minimize <math>J(\theta)</math>.  For example, with the standard implementation of gradient descent, on each iteration
we would perform the update <math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math> (for each <math>j=1,\ldots,k</math>).

When implementing softmax regression, we will typically use a modified version of the cost function described above;
specifically, one that incorporates weight decay.  We describe the motivation and details below.

'''译文''':
有了上面的偏导公式以后，我们可以将它带入到算法中来最小化 <math>J(\theta)</math>。例如，使用标准的梯度下降法，在每一次迭代过程中，我们更新 <math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math>。
在实际的 softmax 实现过程中，我们通常使用一个改进版的损失函数（一个加入了权重 decay 的函数），在下面会详细讲到。

'''一审''':
有了上面的偏导数公式以后，我们就可以将它带入到梯度下降法等算法中，来使<math>J(\theta)</math>最小化。 例如，在梯度下降法标准实现的每一次迭代中，我们需要进行如下更新 ：<math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math>（对于每一个 <math>j=1,\ldots,k</math>）
当实现 softmax 回归算法时， 我们通常会使用 上述代价函数的一个改进版本。具体来说，就是和 权重衰减 一起使用。我们接下来会描述使用它的动机和细节。


== softmax回归参数化的特性 Properties of softmax regression parameterization ==


'''原文''':

Softmax regression has an unusual property that it has a "redundant" set of parameters.  To explain what this means,
suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math>
from it, so that every <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math>
(for every <math>j=1, \ldots, k</math>).  Our hypothesis
now estimates the class label probabilities as

<math>
\begin{align}
p(y^{(i)} = j | x^{(i)} ; \theta)
&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}}  \\
&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\
&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}.
\end{align}
</math>

'''译文''':

Softmax回归有一个不寻常的特点：它有一个“冗余”的参数集。为了便于阐述这一特点，假设我们已经得到了参数向量<math>\theta_j</math>，并从中减去了向量 <math>\psi</math>，这时，每一个<math>\theta_j</math>都变成了<math>\theta_j - \psi</math>(对于每一个<math>j=1, \ldots, k</math> )。此时我们的假设变成了以下的式子：
<math>
\begin{align}
p(y^{(i)} = j | x^{(i)} ; \theta)
&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}}  \\
&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\
&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}.
\end{align}
</math>

'''一审''':

Softmax回归算法 有一个不寻常的特性，就是它有一个“冗余”的参数集。为了便于阐述这一特点的意义，假设我们对每一个参数向量<math>\theta_j</math>进行操作，从中减去一个固定的向量<math>\psi</math>，于是每一个<math>\theta_j</math>就被<math>\theta_j - \psi</math>所替代(针对每一个  <math>j=1, \ldots, k</math>)。此时我们的估值函数对分类标记概率的估计为 ：

<math>
\begin{align}
p(y^{(i)} = j | x^{(i)} ; \theta)
&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}}  \\
&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\
&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}.
\end{align}
</math>

'''原文''':


In other words, subtracting <math>\psi</math> from every <math>\theta_j</math>
does not affect our hypothesis' predictions at all!  This shows that softmax
regression's parameters are "redundant."  More formally, we say that our
softmax model is '''overparameterized,''' meaning that for any hypothesis we might
fit to the data, there are multiple parameter settings that give rise to exactly
the same hypothesis function <math>h_\theta</math> mapping from inputs <math>x</math>
to the predictions.

'''译文''':
换句话说，从<math>\psi</math>中减去<math>\theta_j</math>完全不影响假设函数的预测结果！这一现象表明，softmax回归中存在冗余的参数。或者说，我们的 Softmax 模型参数比实际需要的多，对于任意的假设函数 <math>h_\theta</math> ，我们可以求出多组参数值。

'''一审''':

换句话说，从每个<math>\theta_j</math>中都减去<math>\psi</math>完全不会影响我们的估值函数的预测结果！这表明了softmax回归中的参数是“冗余”的。更正式一点来说，我们的Softmax模型被过度参数化了，这意味着对于任何我们用来与数据相拟合的估计值，都会存在多组参数集，它们能够生成完全相同的估值函数 <math>h_\theta</math> 将输入<math>x</math> 映射到预测值。

'''原文''':

Further, if the cost function <math>J(\theta)</math> is minimized by some
setting of the parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math>,
then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots,
\theta_k - \psi)</math> for any value of <math>\psi</math>.  Thus, the
minimizer of <math>J(\theta)</math> is not unique.  (Interestingly, 
<math>J(\theta)</math> is still convex, and thus gradient descent will
not run into a local optima problems.  But the Hessian is singular/non-invertible,
which causes a straightforward implementation of Newton's method to run into
numerical problems.) 



'''译文''':
另外，如果损失函数<math>J(\theta)</math>由<math>(\theta_1,\theta_2,\ldots,\theta_k)</math>最小化了，它也可以由<math>(\theta_1 - \psi, \theta_2 - \psi, \ldots, \theta_k - \psi)</math>求得。因此，<math>J(\theta)</math> 的最小值是不唯一的。（有趣的是，由于<math>J(\theta)</math> 仍然是一个凸函数，因此梯度下降时不会陷入局部最优。但是 Hessian矩阵是奇异的/不可逆的，这会导致 Softmax的牛顿法实现版本出现数值计算的问题）

'''一审''':
进一步而言，如果代价函数<math>J(\theta)</math>能够通过参数集<math>(\theta_1, \theta_2,\ldots, \theta_k)</math>得到最小值，那么它使用参数集<math>(\theta_1 - \psi, \theta_2 - \psi,\ldots, \theta_k - \psi)</math>同样也会得到最小值，其中<math>\psi</math>可以为任意向量。因此使<math>J(\theta)</math>最小化的解不是唯一的。（有趣的是，由于<math>J(\theta)</math>仍然是一个凸函数，因此梯度下降时不会遇到陷入局部最优解的问题。但是Hessian 矩阵是奇异的/不可逆的，这会直接导致 Softmax的牛顿法实现版本出现数值计算的问题）

'''原文''':


Notice also that by setting <math>\psi = \theta_1</math>, one can always
replace <math>\theta_1</math> with <math>\theta_1 - \psi = \vec{0}</math> (the vector of all
0's), without affecting the hypothesis.  Thus, one could "eliminate" the vector
of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for
any single value of <math>j</math>), without harming the representational power
of our hypothesis.  Indeed, rather than optimizing over the <math>k(n+1)</math>
parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math> (where
<math>\theta_j \in \Re^{n+1}</math>), one could instead set <math>\theta_1 =
\vec{0}</math> and optimize only with respect to the <math>(k-1)(n+1)</math>
remaining parameters, and this would work fine.


'''译文''':
我们注意到，当<math>\psi = \theta_1</math>时，我们可以将<math>\theta_1</math>变换为 <math>\theta_1 - \psi = \vec{0}</math>，而这一变换不影响模型结果。因此，我们可以减掉向量的参数<math>\theta_1</math>（或者减去其他的任意<math>\theta_j</math>）而不影响模型的结果。实际上，我们可以不必优化<math>k(n+1)</math>个参数<math>(\theta_1, \theta_2,\ldots, \theta_k)</math>，而只需要优化<math>(k-1)(n+1)</math>个。

'''一审''':

我们注意到，当<math>\psi = \theta_1</math>时，我们总是可以将<math>\theta_1</math>替换为 <math>\theta_1 - \psi = \vec{0}</math>（替换为全零向量） ， 这并不会影响到估计值。因此我们可以“去除掉”参数向量<math>\theta_1</math>（或者任意其他 <math>\theta_j</math>中的其中一个）而不损害到我们估计值的实际功用。实际上，与其优化全部的<math>k(n+1)</math>个参数<math>(\theta_1, \theta_2,\ldots, \theta_k)</math>（其中<math>\theta_j \in \Re^{n+1}</math>），不如让我们令 <math>\theta_1 =
\vec{0}</math> ，之后只优化剩余的<math>(k-1)(n+1)</math>个参数，这样算法依然能够正常工作。


'''原文''':

In practice, however, it is often cleaner and simpler to implement the version which keeps
all the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>, without
arbitrarily setting one of them to zero.  But we will
make one change to the cost function: Adding weight decay.  This will take care of
the numerical problems associated with softmax regression's overparameterized representation.


'''译文''':
在实际过程中，实现一个保留所有参数<math>(\theta_1, \theta_2,\ldots, \theta_n)</math>的模型往往更简单清楚。但此时我们需要对损失函数做一个改动：增加权重衰减。权重衰减可以解决 softmax 回归的参数冗余问题。

'''一审''':

在实际过程中，实现一个保留所有参数<math>(\theta_1, \theta_2,\ldots, \theta_n)</math>、 不去任意地将某一参数向量置0的模型往往更简单清楚。但是我们需要对代价函数做一个改动：增加权重衰减。 这将有助于解决由Softmax回归算法的参数冗余形式所带来的计算问题。

==权重衰减  Weight Decay ==

'''原文''':

We will modify the cost function by adding a weight decay term
<math>\textstyle \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^{n} \theta_{ij}^2</math>
which penalizes large values of the parameters.  Our cost function is now

<math>
\begin{align}
J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}  \right]
              + \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^n \theta_{ij}^2
\end{align}
</math>


'''译文''':

我们通过添加一个权重衰减项 <math>\textstyle \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^{n} \theta_{ij}^2</math>来修改损失函数，这个衰减项会惩罚过大的参数值，现在我们的损失函数变成：

<math>
\begin{align}
J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}  \right]
              + \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^n \theta_{ij}^2
\end{align}
</math>

'''一审''':
我们通过添加一个权重衰减项 <math>\textstyle \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^{n} \theta_{ij}^2</math>来修改代价函数，这个衰减项会惩罚过大的参数值，现在我们的代价函数变为：

<math>
\begin{align}
J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}  \right]
              + \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^n \theta_{ij}^2
\end{align}
</math>


'''原文''':

With this weight decay term (for any <math>\lambda > 0</math>), the cost function
<math>J(\theta)</math> is now strictly convex, and is guaranteed to have a
unique solution.  The Hessian is now invertible, and because <math>J(\theta)</math> is
convex, algorithms such as gradient descent, L-BFGS, etc. are guaranteed
to converge to the global minimum.


'''译文''':
( 对于任意的<math>\lambda > 0</math>) ，有了这个权重衰减项以后，损失函数就变成了严格的凸函数，可以保证解唯一了。此时的 Hessian 矩阵不再可逆，因为<math>J(\theta)</math>是凸的，梯度下降和 L-BFGS 之类的算法可以保证收敛到全局最优解。

'''一审''':
有了这个权重衰减项以后 (对于任意的<math>\lambda > 0</math>)，代价函数就变成了严格的凸函数，这样就可以保证得到唯一的解了。 此时的 Hessian矩阵 变为可逆矩阵 ， 并且因为<math>J(\theta)</math>是凸函数 ，梯度下降法和 L-BFGS 等算法可以保证收敛到全局最优解。

'''原文''':

To apply an optimization algorithm, we also need the derivative of this
new definition of <math>J(\theta)</math>.  One can show that the derivative is:
<math>
\begin{align}
\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = j\}  - p(y^{(i)} = j | x^{(i)}; \theta) ) \right]  } + \lambda \theta_j
\end{align}
</math>


'''译文''':
为了使用优化算法，我们需要求得这个新<math>J(\theta)</math>.函数的导数形式，如下：
<math>
\begin{align}
\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = j\}  - p(y^{(i)} = j | x^{(i)}; \theta) ) \right]  } + \lambda \theta_j
\end{align}
</math>


'''一审''':
为了使用优化算法，我们需要求得这个新定义的<math>J(\theta)</math>。函数的导数公式，我们可以得到导数公式如下：

<math>
\begin{align}
\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = j\}  - p(y^{(i)} = j | x^{(i)}; \theta) ) \right]  } + \lambda \theta_j
\end{align}
</math>

'''原文''':

By minimizing <math>J(\theta)</math> with respect to <math>\theta</math>, we will have a working implementation of softmax regression.


'''译文''':
通过最小化<math>J(\theta)</math> ，我们就能实现一个可用的softmax回归模型。


'''一审''':
通过对参数 <math>\theta</math>进行函数<math>J(\theta)</math> 的最小化求解，我们就得到了一个可用的 softmax 回归的实现。

==Softmax回归与Logistic 回归的关系 Relationship to Logistic Regression ==

'''原文''':

In the special case where <math>k = 2</math>, one can show that softmax regression reduces to logistic regression.
This shows that softmax regression is a generalization of logistic regression.  Concretely, when <math>k=2</math>,
the softmax regression hypothesis outputs

<math>
\begin{align}
h_\theta(x) &=

\frac{1}{ e^{\theta_1^Tx}  + e^{ \theta_2^T x^{(i)} } }
\begin{bmatrix}
e^{ \theta_1^T x } \\
e^{ \theta_2^T x }
\end{bmatrix}
\end{align}
</math>

'''译文''':

当类别数<math>k = 2</math>时，softmax回归退化为logistic回归。这一点表明了softmax回归是logistic回归的推广形式。具体地说，当<math>k = 2</math>时，softmax 回归的假设函数：

<math>
\begin{align}
h_\theta(x) &=

\frac{1}{ e^{\theta_1^Tx}  + e^{ \theta_2^T x^{(i)} } }
\begin{bmatrix}
e^{ \theta_1^T x } \\
e^{ \theta_2^T x }
\end{bmatrix}
\end{align}
</math>

'''一审''':
在类别数<math>k = 2</math>的特例中 ，我们会看到softmax回归退化成了logistic 回归。这一点表明了softmax回归是logistic 回归的 一般化形式。具体地说，当<math>k = 2</math>时，softmax回归的估值函数为 ：

<math>
\begin{align}
h_\theta(x) &=

\frac{1}{ e^{\theta_1^Tx}  + e^{ \theta_2^T x^{(i)} } }
\begin{bmatrix}
e^{ \theta_1^T x } \\
e^{ \theta_2^T x }
\end{bmatrix}
\end{align}
</math>

'''原文''':

Taking advantage of the fact that this hypothesis
is overparameterized and setting <math>\psi = \theta_1</math>,
we can subtract <math>\theta_1</math> from each of the two parameters, giving us

<math>
\begin{align}
h(x) &=

\frac{1}{ e^{\vec{0}^Tx}  + e^{ (\theta_2-\theta_1)^T x^{(i)} } }
\begin{bmatrix}
e^{ \vec{0}^T x } \\
e^{ (\theta_2-\theta_1)^T x }
\end{bmatrix} \\


&=
\begin{bmatrix}
\frac{1}{ 1 + e^{ (\theta_2-\theta_1)^T x^{(i)} } } \\
\frac{e^{ (\theta_2-\theta_1)^T x }}{ 1 + e^{ (\theta_2-\theta_1)^T x^{(i)} } }
\end{bmatrix} \\

&=
\begin{bmatrix}
\frac{1}{ 1  + e^{ (\theta_2-\theta_1)^T x^{(i)} } } \\
1 - \frac{1}{ 1  + e^{ (\theta_2-\theta_1)^T x^{(i)} } } \\
\end{bmatrix}
\end{align}
</math>

'''译文''':
利用 softmax 回归参数冗余的特点，我们设 <math>\psi = \theta_1</math>，在将<math>\theta_1</math>分别从两个参数中减掉，得到：

<math>
\begin{align}
h(x) &=

\frac{1}{ e^{\vec{0}^Tx}  + e^{ (\theta_2-\theta_1)^T x^{(i)} } }
\begin{bmatrix}
e^{ \vec{0}^T x } \\
e^{ (\theta_2-\theta_1)^T x }
\end{bmatrix} \\


&=
\begin{bmatrix}
\frac{1}{ 1 + e^{ (\theta_2-\theta_1)^T x^{(i)} } } \\
\frac{e^{ (\theta_2-\theta_1)^T x }}{ 1 + e^{ (\theta_2-\theta_1)^T x^{(i)} } }
\end{bmatrix} \\

&=
\begin{bmatrix}
\frac{1}{ 1  + e^{ (\theta_2-\theta_1)^T x^{(i)} } } \\
1 - \frac{1}{ 1  + e^{ (\theta_2-\theta_1)^T x^{(i)} } } \\
\end{bmatrix}
\end{align}
</math>

'''一审''':

利用估值函数参数冗余的优势，我们令<math>\psi = \theta_1</math>，并且从两个参数向量中都减去向量<math>\theta_1</math>，得到:

<math>
\begin{align}
h(x) &=

\frac{1}{ e^{\vec{0}^Tx}  + e^{ (\theta_2-\theta_1)^T x^{(i)} } }
\begin{bmatrix}
e^{ \vec{0}^T x } \\
e^{ (\theta_2-\theta_1)^T x }
\end{bmatrix} \\


&=
\begin{bmatrix}
\frac{1}{ 1 + e^{ (\theta_2-\theta_1)^T x^{(i)} } } \\
\frac{e^{ (\theta_2-\theta_1)^T x }}{ 1 + e^{ (\theta_2-\theta_1)^T x^{(i)} } }
\end{bmatrix} \\

&=
\begin{bmatrix}
\frac{1}{ 1  + e^{ (\theta_2-\theta_1)^T x^{(i)} } } \\
1 - \frac{1}{ 1  + e^{ (\theta_2-\theta_1)^T x^{(i)} } } \\
\end{bmatrix}
\end{align}
</math>

'''原文''':

Thus, replacing <math>\theta_2-\theta_1</math> with a single parameter vector <math>\theta'</math>, we find
that softmax regression predicts the probability of one of the classes as
<math>\frac{1}{ 1  + e^{ (\theta')^T x^{(i)} } }</math>,
and that of the other class as
<math>1 - \frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }</math>,
same as logistic regression.


'''译文''':
然后，将<math>\theta_2-\theta_1</math>用<math>\theta'</math>来表示，我们发现softmax回归预测其中一个类别的概率为 <math>\frac{1}{ 1  + e^{ (\theta')^T x^{(i)} } }</math>，另一个类别的概率为<math>1 - \frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }</math> ，这与 logistic回归是一致的。

'''一审''':
于是，将<math>\theta_2-\theta_1</math>用<math>\theta'</math>来表示，我们发现softmax回归预测其中一个类别的概率为 <math>\frac{1}{ 1  + e^{ (\theta')^T x^{(i)} } }</math>，另一个类别的概率为<math>1 - \frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }</math>，这与 logistic回归是一致的。

==Softmax 回归 vs. k 个二元分类器 Softmax Regression vs. k Binary Classifiers ==

'''原文''':

Suppose you are working on a music classification application, and there are
<math>k</math> types of music that you are trying to recognize.  Should you use a
softmax classifier, or should you build <math>k</math> separate binary classifiers using
logistic regression?

'''译文''':


如果你在开发一个音乐分类的应用，需要对<math>k</math>种类型的音乐进行分类，那么是选择softmax回归直接进行多分类，还是使用 logistic回归进行二分类再进行组合呢？


'''一审''':

如果你在开发一个音乐分类的应用，需要对<math>k</math>种类型的音乐进行识别，那么是选择使用softmax分类器呢，还是使用 logistic回归算法去建立 <math>k</math>个分离的二元分类器呢？

'''原文''':


This will depend on whether the four classes are ''mutually exclusive.''  For example,
if your four classes are classical, country, rock, and jazz, then assuming each
of your training examples is labeled with exactly one of these four class labels,
you should build a softmax classifier with <math>k=4</math>.
(If there're also some examples that are none of the above four classes,
then you can set <math>k=5</math> in softmax regression, and also have a fifth, "none of the above," class.)


'''译文''':

这一选择取决于你的类别之间是否互斥，例如，如果你有四个类别的音乐，分别为：古典音乐、乡村音乐、摇滚乐和爵士乐，那么你可以假设每个训练样本只会被打上一个标签（即：一首歌只能属于这四种音乐类型的其中一种），此时你应该使用类别数 <math>k=4</math>的softmax回归。（如果在你的数据集中，有的歌曲不属于以上四类的其中任何一类，那么你可以设置一个类别叫做“其他”，并将类别数 <math>k</math>设为5。）

'''一审''':

这一选择取决于你的类别之间是否互斥，例如，如果你有四个音乐类别，分别为：古典音乐、乡村音乐、摇滚乐和爵士乐，那么你可以假设每个训练样本只会被打上一个标签，此时你应该使用类别数<math>k=4</math>的softmax分类器。（如果在你的数据集中，有的歌曲不属于以上四类的其中任何一类，那么你可以将类别数<math>k</math>设为5，并且设置第五个类别叫做“以上皆否”，）

'''原文''':


If however your categories are has_vocals, dance, soundtrack, pop, then the
classes are not mutually exclusive; for example, there can be a piece of pop
music that comes from a soundtrack and in addition has vocals.  In this case, it
would be more appropriate to build 4 binary logistic regression classifiers.
This way, for each new musical piece, your algorithm can separately decide whether
it falls into each of the four categories.

'''译文''':
如果你的四个类别如下：声乐作品、舞曲、影视原声带、流行歌曲。我们可以看出这些类别之间并不是互斥的：一首歌曲可以是影视原声带，同时也是声乐作品。这种情况下，使用4个二分类的logistic 回归更为合适。这样，对于每一首歌，我们的算法可以分别判断它是否属于各个类别。

'''一审''':
如果你的四个类别如下：人声音乐、舞曲、影视原声、流行歌曲，那么这些类别之间并不是互斥的。例如：一首歌曲可以来源于影视原声，同时也包含人声 。这种情况下，使用4个二分类的logistic回归分类器更为合适。这样，对于每个新的音乐作品 ，我们的算法可以分别判断它是否属于各个类别。


'''原文''':


Now, consider a computer vision example, where you're trying to classify images into
three different classes.  (i) Suppose that your classes are indoor_scene,
outdoor_urban_scene, and outdoor_wilderness_scene.  Would you use softmax regression
or three logistic regression classifiers?  (ii) Now suppose your classes are
indoor_scene, black_and_white_image, and image_has_people.  Would you use softmax
regression or multiple logistic regression classifiers?

'''译文''':

现在我们来看一个计算视觉领域的例子，你的任务是将图像分到三个类别中。 (i) 假设这三个类别分别是：室内场景、城区场景、野外场景。你会使用 softmax回归还是3 个logistic回归呢？ (ii) 假设这三个类别分别是室内场景、黑白图片、包含人物的图片，你又会如何选择分类模型？

'''一审''':

现在我们来看一个计算视觉领域的例子，你的任务是将图像分到三个不同类别中。(i)假设这三个类别分别是：室内场景、户外城区场景、户外荒野场景。你会使用sofmax回归还是 3个logistic 回归分类器呢？ (ii) 现在假设这三个类别分别是室内场景、黑白图片、包含人物的图片，你又会选择softmax回归还是多个logistic回归分类器呢？

'''原文''':

In the first case, the classes are mutually exclusive, so a softmax regression
classifier would be appropriate.  In the second case, it would be more appropriate to build
three separate logistic regression classifiers.

'''译文''':

在第一个例子中，三个类别是互斥的，因此选择softmax回归更合适。而在第二个例子则应该选择 logistic回归。

'''一审''':

在第一个例子中，三个类别是互斥的，因此更适于选择softmax回归分类 。而在第二个例子中，建立三个独立的 logistic回归分类器更加合适。