Softmax回归

Softmax回归(Softmax Regression)

'''初译''':@knighterzjy

'''一审''':@GuitarFang

== Introduction介绍 ==
'''原文''':

In these notes, we describe the '''Softmax regression''' model.  This model generalizes logistic regression to
classification problems where the class label <math>y</math> can take on more than two possible values.
This will be useful for such problems as MNIST digit classification, where the goal is to distinguish between 10 different
numerical digits.  Softmax regression is a supervised learning algorithm, but we will later be
using it in conjuction with our deep learning/unsupervised feature learning methods.


'''译文''':

在本节中，我们介绍Softmax回归模型，该模型是logistic回归模型在多分类问题上的泛化，在多分类问题中，类标签y可以取两个以上的值。 Softmax回归模型可以直接应用于 MNIST 手写数字分类问题等多分类问题。Softmax回归是有监督的，不过我们接下来也会介绍它与深度学习/无监督学习方法的结合。
（译者注： MNIST 是一个手写数字识别库，由 NYU 的Yann LeCun 等人维护。 http://yann.lecun.com/exdb/mnist/ ）

'''一审''':

在本章中，我们介绍Softmax回归模型。该模型将logistic回归模型一般化，以用来解决类型标签y的可能取值多于两种的分类问题。Softmax回归模型对于诸如MNIST手写数字分类等问题是十分有用的，该问题的目的是辨识10个不同的单个数字。Softmax回归是一种有监督学习算法，但是我们接下来要将它与我们的深度学习/无监督特征学习方法结合起来使用。
（译者注：MNIST是一个手写数字识别库，由NYU的Yann LeCun等人维护。http://yann.lecun.com/exdb/mnist/）

'''原文''':
Recall that in logistic regression, we had a training set
<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>
of <math>m</math> labeled examples, where the input features are <math>x^{(i)} \in \Re^{n+1}</math>.  
(In this set of notes, we will use the notational convention of letting the feature vectors <math>x</math> be
<math>n+1</math> dimensional, with <math>x_0 = 1</math> corresponding to the intercept term.) 
With logistic regression, we were in the binary classification setting, so the labels 
were <math>y^{(i)} \in \{0,1\}</math>.  Our hypothesis took the form:

<math>\begin{align}
h_\theta(x) = \frac{1}{1+\exp(-\theta^Tx)},
\end{align}</math>



'''译文''':
回顾一下 logistic 回归，我们的训练集为<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>
，其中 m为样本数，<math>x^{(i)} \in \Re^{n+1}</math>为特征。
由于 logistic 回归是针对二分类问题的，因此类标 <math>y^{(i)} \in \{0,1\}</math>。假设函数如下：

'''一审''':
回想一下在 logistic 回归中，我们拥有一个包含 m 个被标记的样本的训练集 <math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>，其中输入特征值 <math>x^{(i)} \in \Re^{n+1}</math>。（在本章中，我们对出现的符号进行如下约定：特征向量 x 的维度为n+1 ，其中 x0=1对应 截距项 。）因为在Logistic 回归中，我们要解决的是 二元分类 问题，因此 类型标记 <math>y^{(i)} \in \{0,1\}</math>。 估值函数 如下：
<math>\begin{align}
h_\theta(x) = \frac{1}{1+\exp(-\theta^Tx)},
\end{align}</math>


'''原文''':
and the model parameters <math>\theta</math> were trained to minimize
the cost function

<math>
\begin{align}
J(\theta) = -\frac{1}{m} \left[ \sum_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \right]
\end{align}
</math>


'''译文''':



'''一审''':

'''原文''':
In the softmax regression setting, we are interested in multi-class
classification (as opposed to only binary classification), and so the label
<math>y</math> can take on <math>k</math> different values, rather than only
two.  Thus, in our training set
<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>,
we now have that <math>y^{(i)} \in \{1, 2, \ldots, k\}</math>.  (Note that
our convention will be to index the classes starting from 1, rather than from 0.)  For example,
in the MNIST digit recognition task, we would have <math>k=10</math> different classes.

'''译文''':

'''一审''':

'''原文''':
Given a test input <math>x</math>, we want our hypothesis to estimate
the probability that <math>p(y=j | x)</math> for each value of <math>j = 1, \ldots, k</math>.
I.e., we want to estimate the probability of the class label taking
on each of the <math>k</math> different possible values.  Thus, our hypothesis
will output a <math>k</math> dimensional vector (whose elements sum to 1) giving
us our <math>k</math> estimated probabilities.  Concretely, our hypothesis
<math>h_{\theta}(x)</math> takes the form:

<math>
\begin{align}
h_\theta(x^{(i)}) =
\begin{bmatrix}
p(y^{(i)} = 1 | x^{(i)}; \theta) \\
p(y^{(i)} = 2 | x^{(i)}; \theta) \\
\vdots \\
p(y^{(i)} = k | x^{(i)}; \theta)
\end{bmatrix}
=
\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }
\begin{bmatrix}
e^{ \theta_1^T x^{(i)} } \\
e^{ \theta_2^T x^{(i)} } \\
\vdots \\
e^{ \theta_k^T x^{(i)} } \\
\end{bmatrix}
\end{align}
</math>

'''译文''':

'''一审''':

'''原文''':
Here <math>\theta_1, \theta_2, \ldots, \theta_k \in \Re^{n+1}</math> are the
parameters of our model.  
Notice that
the term <math>\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } </math>
normalizes the distribution, so that it sums to one. 

'''译文''':

'''一审''':

'''原文''':
For convenience, we will also write 
<math>\theta</math> to denote all the
parameters of our model.  When you implement softmax regression, it is usually
convenient to represent <math>\theta</math> as a <math>k</math>-by-<math>(n+1)</math> matrix obtained by
stacking up <math>\theta_1, \theta_2, \ldots, \theta_k</math> in rows, so that

<math>
\theta = \begin{bmatrix}
\mbox{---} \theta_1^T \mbox{---} \\
\mbox{---} \theta_2^T \mbox{---} \\
\vdots \\
\mbox{---} \theta_k^T \mbox{---} \\
\end{bmatrix}
</math>

== 2 ==