Softmax回归

From Ufldl

Jump to: navigation, search
(3)
Line 369: Line 369:
-
== 3 ==  
+
== softmax回归参数化的特性 Properties of softmax regression parameterization ==
 +
 
 +
 
 +
'''原文''':
 +
 
 +
Softmax regression has an unusual property that it has a "redundant" set of parameters.  To explain what this means,
 +
suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math>
 +
from it, so that every <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math>
 +
(for every <math>j=1, \ldots, k</math>).  Our hypothesis
 +
now estimates the class label probabilities as
 +
 
 +
<math>
 +
\begin{align}
 +
p(y^{(i)} = j | x^{(i)} ; \theta)
 +
&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}}  \\
 +
&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\
 +
&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}.
 +
\end{align}
 +
</math>
 +
 
 +
'''译文''':
 +
 
 +
Softmax回归有一个不寻常的特点:它有一个“冗余”的参数集。为了便于阐述这一特点,假设我们已经得到了参数向量<math>\theta_j</math>,并从中减去了向量 <math>\psi</math>,这时,每一个<math>\theta_j</math>都变成了<math>\theta_j - \psi</math>(对于每一个<math>j=1, \ldots, k</math> )。此时我们的假设变成了以下的式子:
 +
<math>
 +
\begin{align}
 +
p(y^{(i)} = j | x^{(i)} ; \theta)
 +
&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}}  \\
 +
&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\
 +
&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}.
 +
\end{align}
 +
</math>
 +
 
 +
'''一审''':
 +
 
 +
Softmax回归算法 有一个不寻常的特性,就是它有一个“冗余”的参数集。为了便于阐述这一特点的意义,假设我们对每一个参数向量<math>\theta_j</math>进行操作,从中减去一个固定的向量<math>\psi</math>,于是每一个<math>\theta_j</math>就被<math>\theta_j - \psi</math>所替代(针对每一个  <math>j=1, \ldots, k</math>)。此时我们的估值函数对分类标记概率的估计为 :
 +
 
 +
<math>
 +
\begin{align}
 +
p(y^{(i)} = j | x^{(i)} ; \theta)
 +
&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}}  \\
 +
&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\
 +
&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}.
 +
\end{align}
 +
</math>
 +
 
 +
'''原文''':
 +
 
 +
 
 +
In other words, subtracting <math>\psi</math> from every <math>\theta_j</math>
 +
does not affect our hypothesis' predictions at all!  This shows that softmax
 +
regression's parameters are "redundant."  More formally, we say that our
 +
softmax model is '''overparameterized,''' meaning that for any hypothesis we might
 +
fit to the data, there are multiple parameter settings that give rise to exactly
 +
the same hypothesis function <math>h_\theta</math> mapping from inputs <math>x</math>
 +
to the predictions.
 +
 
 +
'''译文''':
 +
换句话说,从<math>\psi</math>中减去<math>\theta_j</math>完全不影响假设函数的预测结果!这一现象表明,softmax回归中存在冗余的参数。或者说,我们的 Softmax 模型参数比实际需要的多,对于任意的假设函数 <math>h_\theta</math> ,我们可以求出多组参数值。
 +
 
 +
'''一审''':
 +
 
 +
换句话说,从每个<math>\theta_j</math>中都减去<math>\psi</math>完全不会影响我们的估值函数的预测结果!这表明了softmax回归中的参数是“冗余”的。更正式一点来说,我们的Softmax模型被过度参数化了,这意味着对于任何我们用来与数据相拟合的估计值,都会存在多组参数集,它们能够生成完全相同的估值函数 <math>h_\theta</math> 将输入<math>x</math> 映射到预测值。
 +
 
 +
'''原文''':
 +
 
 +
Further, if the cost function <math>J(\theta)</math> is minimized by some
 +
setting of the parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math>,
 +
then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots,
 +
\theta_k - \psi)</math> for any value of <math>\psi</math>.  Thus, the
 +
minimizer of <math>J(\theta)</math> is not unique.  (Interestingly,
 +
<math>J(\theta)</math> is still convex, and thus gradient descent will
 +
not run into a local optima problems.  But the Hessian is singular/non-invertible,
 +
which causes a straightforward implementation of Newton's method to run into
 +
numerical problems.)
 +
 
 +
 
 +
 
 +
'''译文''':
 +
另外,如果损失函数<math>J(\theta)</math>由<math>(\theta_1,\theta_2,\ldots,\theta_k)</math>最小化了,它也可以由<math>(\theta_1 - \psi, \theta_2 - \psi, \ldots, \theta_k - \psi)</math>求得。因此,<math>J(\theta)</math> 的最小值是不唯一的。(有趣的是,由于<math>J(\theta)</math> 仍然是一个凸函数,因此梯度下降时不会陷入局部最优。但是 Hessian矩阵是奇异的/不可逆的,这会导致 Softmax的牛顿法实现版本出现数值计算的问题)
 +
 
 +
'''一审''':
 +
进一步而言,如果代价函数<math>J(\theta)</math>能够通过参数集<math>(\theta_1, \theta_2,\ldots, \theta_k)</math>得到最小值,那么它使用参数集<math>(\theta_1 - \psi, \theta_2 - \psi,\ldots, \theta_k - \psi)</math>同样也会得到最小值,其中<math>\psi</math>可以为任意向量。因此使<math>J(\theta)</math>最小化的解不是唯一的。(有趣的是,由于<math>J(\theta)</math>仍然是一个凸函数,因此梯度下降时不会遇到陷入局部最优解的问题。但是Hessian 矩阵是奇异的/不可逆的,这会直接导致 Softmax的牛顿法实现版本出现数值计算的问题)
 +
 
 +
'''原文''':
 +
 
 +
 
 +
Notice also that by setting <math>\psi = \theta_1</math>, one can always
 +
replace <math>\theta_1</math> with <math>\theta_1 - \psi = \vec{0}</math> (the vector of all
 +
0's), without affecting the hypothesis.  Thus, one could "eliminate" the vector
 +
of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for
 +
any single value of <math>j</math>), without harming the representational power
 +
of our hypothesis.  Indeed, rather than optimizing over the <math>k(n+1)</math>
 +
parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math> (where
 +
<math>\theta_j \in \Re^{n+1}</math>), one could instead set <math>\theta_1 =
 +
\vec{0}</math> and optimize only with respect to the <math>(k-1)(n+1)</math>
 +
remaining parameters, and this would work fine.
 +
 
 +
 
 +
'''译文''':
 +
我们注意到,当<math>\psi = \theta_1</math>时,我们可以将<math>\theta_1</math>变换为 <math>\theta_1 - \psi = \vec{0}</math>,而这一变换不影响模型结果。因此,我们可以减掉向量的参数<math>\theta_1</math>(或者减去其他的任意<math>\theta_j</math>)而不影响模型的结果。实际上,我们可以不必优化<math>k(n+1)</math>个参数<math>(\theta_1, \theta_2,\ldots, \theta_k)</math>,而只需要优化<math>(k-1)(n+1)</math>个。
 +
 
 +
'''一审''':
 +
 
 +
我们注意到,当<math>\psi = \theta_1</math>时,我们总是可以将<math>\theta_1</math>替换为 <math>\theta_1 - \psi = \vec{0}</math>(替换为全零向量) , 这并不会影响到估计值。因此我们可以“去除掉”参数向量<math>\theta_1</math>(或者任意其他 <math>\theta_j</math>中的其中一个)而不损害到我们估计值的实际功用。实际上,与其优化全部的<math>k(n+1)</math>个参数<math>(\theta_1, \theta_2,\ldots, \theta_k)</math>(其中<math>\theta_j \in \Re^{n+1}</math>),不如让我们令 <math>\theta_1 =
 +
\vec{0}</math> ,之后只优化剩余的<math>(k-1)(n+1)</math>个参数,这样算法依然能够正常工作。
 +
 
 +
 
 +
'''原文''':
 +
 
 +
In practice, however, it is often cleaner and simpler to implement the version which keeps
 +
all the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>, without
 +
arbitrarily setting one of them to zero.  But we will
 +
make one change to the cost function: Adding weight decay.  This will take care of
 +
the numerical problems associated with softmax regression's overparameterized representation.
 +
 
 +
 
 +
'''译文''':
 +
在实际过程中,实现一个保留所有参数<math>(\theta_1, \theta_2,\ldots, \theta_n)</math>的模型往往更简单清楚。但此时我们需要对损失函数做一个改动:增加权重衰减。权重衰减可以解决 softmax 回归的参数冗余问题。
 +
 
 +
'''一审''':
 +
 
 +
在实际过程中,实现一个保留所有参数<math>(\theta_1, \theta_2,\ldots, \theta_n)</math>、 不去任意地将某一参数向量置0的模型往往更简单清楚。但是我们需要对代价函数做一个改动:增加权重衰减。 这将有助于解决由Softmax回归算法的参数冗余形式所带来的计算问题。
== 4 ==
== 4 ==
== 5 ==
== 5 ==

Revision as of 09:55, 10 March 2013

Personal tools