Softmax回归

Revision as of 09:50, 10 March 2013 (view source)

Kandeng (Talk | contribs)

← Older edit

Revision as of 09:55, 10 March 2013 (view source)

Kandeng (Talk | contribs)

(→3)

Newer edit →

Line 369:

-

== 3 ==

+

== softmax回归参数化的特性 Properties of softmax regression parameterization ==

+

'''原文''':

+

Softmax regression has an unusual property that it has a "redundant" set of parameters. To explain what this means,

+

suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math>

+

from it, so that every <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math>

+

(for every <math>j=1, \ldots, k</math>). Our hypothesis

+

now estimates the class label probabilities as

+

<math>

+

\begin{align}

+

p(y^{(i)} = j | x^{(i)} ; \theta)

+

&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}} \\

+

&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\

+

&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}.

+

\end{align}

+

</math>

+

'''译文''':

+

Softmax回归有一个不寻常的特点：它有一个“冗余”的参数集。为了便于阐述这一特点，假设我们已经得到了参数向量<math>\theta_j</math>，并从中减去了向量 <math>\psi</math>，这时，每一个<math>\theta_j</math>都变成了<math>\theta_j - \psi</math>(对于每一个<math>j=1, \ldots, k</math> )。此时我们的假设变成了以下的式子：

+

<math>

+

\begin{align}

+

p(y^{(i)} = j | x^{(i)} ; \theta)

+

&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}} \\

+

&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\

+

&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}.

+

\end{align}

+

</math>

+

'''一审''':

+

Softmax回归算法有一个不寻常的特性，就是它有一个“冗余”的参数集。为了便于阐述这一特点的意义，假设我们对每一个参数向量<math>\theta_j</math>进行操作，从中减去一个固定的向量<math>\psi</math>，于是每一个<math>\theta_j</math>就被<math>\theta_j - \psi</math>所替代(针对每一个 <math>j=1, \ldots, k</math>)。此时我们的估值函数对分类标记概率的估计为：

+

<math>

+

\begin{align}

+

p(y^{(i)} = j | x^{(i)} ; \theta)

+

&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}} \\

+

&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\

+

&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}.

+

\end{align}

+

</math>

+

'''原文''':

+

In other words, subtracting <math>\psi</math> from every <math>\theta_j</math>

+

does not affect our hypothesis' predictions at all! This shows that softmax

+

regression's parameters are "redundant." More formally, we say that our

+

softmax model is '''overparameterized,''' meaning that for any hypothesis we might

+

fit to the data, there are multiple parameter settings that give rise to exactly

+

the same hypothesis function <math>h_\theta</math> mapping from inputs <math>x</math>

+

to the predictions.

+

'''译文''':

+

换句话说，从<math>\psi</math>中减去<math>\theta_j</math>完全不影响假设函数的预测结果！这一现象表明，softmax回归中存在冗余的参数。或者说，我们的 Softmax 模型参数比实际需要的多，对于任意的假设函数 <math>h_\theta</math> ，我们可以求出多组参数值。

+

'''一审''':

+

换句话说，从每个<math>\theta_j</math>中都减去<math>\psi</math>完全不会影响我们的估值函数的预测结果！这表明了softmax回归中的参数是“冗余”的。更正式一点来说，我们的Softmax模型被过度参数化了，这意味着对于任何我们用来与数据相拟合的估计值，都会存在多组参数集，它们能够生成完全相同的估值函数 <math>h_\theta</math> 将输入<math>x</math> 映射到预测值。

+

'''原文''':

+

Further, if the cost function <math>J(\theta)</math> is minimized by some

+

setting of the parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math>,

+

then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots,

+

\theta_k - \psi)</math> for any value of <math>\psi</math>. Thus, the

+

minimizer of <math>J(\theta)</math> is not unique. (Interestingly,

+

<math>J(\theta)</math> is still convex, and thus gradient descent will

+

not run into a local optima problems. But the Hessian is singular/non-invertible,

+

which causes a straightforward implementation of Newton's method to run into

+

numerical problems.)

+

'''译文''':

+

另外，如果损失函数<math>J(\theta)</math>由<math>(\theta_1,\theta_2,\ldots,\theta_k)</math>最小化了，它也可以由<math>(\theta_1 - \psi, \theta_2 - \psi, \ldots, \theta_k - \psi)</math>求得。因此，<math>J(\theta)</math> 的最小值是不唯一的。（有趣的是，由于<math>J(\theta)</math> 仍然是一个凸函数，因此梯度下降时不会陷入局部最优。但是 Hessian矩阵是奇异的/不可逆的，这会导致 Softmax的牛顿法实现版本出现数值计算的问题）

+

'''一审''':

+

进一步而言，如果代价函数<math>J(\theta)</math>能够通过参数集<math>(\theta_1, \theta_2,\ldots, \theta_k)</math>得到最小值，那么它使用参数集<math>(\theta_1 - \psi, \theta_2 - \psi,\ldots, \theta_k - \psi)</math>同样也会得到最小值，其中<math>\psi</math>可以为任意向量。因此使<math>J(\theta)</math>最小化的解不是唯一的。（有趣的是，由于<math>J(\theta)</math>仍然是一个凸函数，因此梯度下降时不会遇到陷入局部最优解的问题。但是Hessian 矩阵是奇异的/不可逆的，这会直接导致 Softmax的牛顿法实现版本出现数值计算的问题）

+

'''原文''':

+

Notice also that by setting <math>\psi = \theta_1</math>, one can always

+

replace <math>\theta_1</math> with <math>\theta_1 - \psi = \vec{0}</math> (the vector of all

+

0's), without affecting the hypothesis. Thus, one could "eliminate" the vector

+

of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for

+

any single value of <math>j</math>), without harming the representational power

+

of our hypothesis. Indeed, rather than optimizing over the <math>k(n+1)</math>

+

parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math> (where

+

<math>\theta_j \in \Re^{n+1}</math>), one could instead set <math>\theta_1 =

+

\vec{0}</math> and optimize only with respect to the <math>(k-1)(n+1)</math>

+

remaining parameters, and this would work fine.

+

'''译文''':

+

我们注意到，当<math>\psi = \theta_1</math>时，我们可以将<math>\theta_1</math>变换为 <math>\theta_1 - \psi = \vec{0}</math>，而这一变换不影响模型结果。因此，我们可以减掉向量的参数<math>\theta_1</math>（或者减去其他的任意<math>\theta_j</math>）而不影响模型的结果。实际上，我们可以不必优化<math>k(n+1)</math>个参数<math>(\theta_1, \theta_2,\ldots, \theta_k)</math>，而只需要优化<math>(k-1)(n+1)</math>个。

+

'''一审''':

+

我们注意到，当<math>\psi = \theta_1</math>时，我们总是可以将<math>\theta_1</math>替换为 <math>\theta_1 - \psi = \vec{0}</math>（替换为全零向量），这并不会影响到估计值。因此我们可以“去除掉”参数向量<math>\theta_1</math>（或者任意其他 <math>\theta_j</math>中的其中一个）而不损害到我们估计值的实际功用。实际上，与其优化全部的<math>k(n+1)</math>个参数<math>(\theta_1, \theta_2,\ldots, \theta_k)</math>（其中<math>\theta_j \in \Re^{n+1}</math>），不如让我们令 <math>\theta_1 =

+

\vec{0}</math> ，之后只优化剩余的<math>(k-1)(n+1)</math>个参数，这样算法依然能够正常工作。

+

'''原文''':

+

In practice, however, it is often cleaner and simpler to implement the version which keeps

+

all the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>, without

+

arbitrarily setting one of them to zero. But we will

+

make one change to the cost function: Adding weight decay. This will take care of

+

the numerical problems associated with softmax regression's overparameterized representation.

+

'''译文''':

+

在实际过程中，实现一个保留所有参数<math>(\theta_1, \theta_2,\ldots, \theta_n)</math>的模型往往更简单清楚。但此时我们需要对损失函数做一个改动：增加权重衰减。权重衰减可以解决 softmax 回归的参数冗余问题。

+

'''一审''':

+

在实际过程中，实现一个保留所有参数<math>(\theta_1, \theta_2,\ldots, \theta_n)</math>、不去任意地将某一参数向量置0的模型往往更简单清楚。但是我们需要对代价函数做一个改动：增加权重衰减。这将有助于解决由Softmax回归算法的参数冗余形式所带来的计算问题。

== 4 ==

== 5 ==

Softmax回归

From Ufldl

Revision as of 09:55, 10 March 2013

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 369: / Line 369: @@
-== 3 ==
+== softmax回归参数化的特性 Properties of softmax regression parameterization ==
+'''原文''':
+Softmax regression has an unusual property that it has a "redundant" set of parameters.  To explain what this means,
+suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math>
+from it, so that every <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math>
+(for every <math>j=1, \ldots, k</math>).  Our hypothesis
+now estimates the class label probabilities as
+<math>
+\begin{align}
+p(y^{(i)} = j | x^{(i)} ; \theta)
+&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}}  \\
+&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\
+&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}.
+\end{align}
+</math>
+'''译文''':
+Softmax回归有一个不寻常的特点：它有一个“冗余”的参数集。为了便于阐述这一特点，假设我们已经得到了参数向量<math>\theta_j</math>，并从中减去了向量 <math>\psi</math>，这时，每一个<math>\theta_j</math>都变成了<math>\theta_j - \psi</math>(对于每一个<math>j=1, \ldots, k</math> )。此时我们的假设变成了以下的式子：
+<math>
+\begin{align}
+p(y^{(i)} = j | x^{(i)} ; \theta)
+&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}}  \\
+&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\
+&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}.
+\end{align}
+</math>
+'''一审''':
+Softmax回归算法 有一个不寻常的特性，就是它有一个“冗余”的参数集。为了便于阐述这一特点的意义，假设我们对每一个参数向量<math>\theta_j</math>进行操作，从中减去一个固定的向量<math>\psi</math>，于是每一个<math>\theta_j</math>就被<math>\theta_j - \psi</math>所替代(针对每一个  <math>j=1, \ldots, k</math>)。此时我们的估值函数对分类标记概率的估计为 ：
+<math>
+\begin{align}
+p(y^{(i)} = j | x^{(i)} ; \theta)
+&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}}  \\
+&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\
+&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}.
+\end{align}
+</math>
+'''原文''':
+In other words, subtracting <math>\psi</math> from every <math>\theta_j</math>
+does not affect our hypothesis' predictions at all!  This shows that softmax
+regression's parameters are "redundant."  More formally, we say that our
+softmax model is '''overparameterized,''' meaning that for any hypothesis we might
+fit to the data, there are multiple parameter settings that give rise to exactly
+the same hypothesis function <math>h_\theta</math> mapping from inputs <math>x</math>
+to the predictions.
+'''译文''':
+换句话说，从<math>\psi</math>中减去<math>\theta_j</math>完全不影响假设函数的预测结果！这一现象表明，softmax回归中存在冗余的参数。或者说，我们的 Softmax 模型参数比实际需要的多，对于任意的假设函数 <math>h_\theta</math> ，我们可以求出多组参数值。
+'''一审''':
+换句话说，从每个<math>\theta_j</math>中都减去<math>\psi</math>完全不会影响我们的估值函数的预测结果！这表明了softmax回归中的参数是“冗余”的。更正式一点来说，我们的Softmax模型被过度参数化了，这意味着对于任何我们用来与数据相拟合的估计值，都会存在多组参数集，它们能够生成完全相同的估值函数 <math>h_\theta</math> 将输入<math>x</math> 映射到预测值。
+'''原文''':
+Further, if the cost function <math>J(\theta)</math> is minimized by some
+setting of the parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math>,
+then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots,
+\theta_k - \psi)</math> for any value of <math>\psi</math>.  Thus, the
+minimizer of <math>J(\theta)</math> is not unique.  (Interestingly,
+<math>J(\theta)</math> is still convex, and thus gradient descent will
+not run into a local optima problems.  But the Hessian is singular/non-invertible,
+which causes a straightforward implementation of Newton's method to run into
+numerical problems.)
+'''译文''':
+另外，如果损失函数<math>J(\theta)</math>由<math>(\theta_1,\theta_2,\ldots,\theta_k)</math>最小化了，它也可以由<math>(\theta_1 - \psi, \theta_2 - \psi, \ldots, \theta_k - \psi)</math>求得。因此，<math>J(\theta)</math> 的最小值是不唯一的。（有趣的是，由于<math>J(\theta)</math> 仍然是一个凸函数，因此梯度下降时不会陷入局部最优。但是 Hessian矩阵是奇异的/不可逆的，这会导致 Softmax的牛顿法实现版本出现数值计算的问题）
+'''一审''':
+进一步而言，如果代价函数<math>J(\theta)</math>能够通过参数集<math>(\theta_1, \theta_2,\ldots, \theta_k)</math>得到最小值，那么它使用参数集<math>(\theta_1 - \psi, \theta_2 - \psi,\ldots, \theta_k - \psi)</math>同样也会得到最小值，其中<math>\psi</math>可以为任意向量。因此使<math>J(\theta)</math>最小化的解不是唯一的。（有趣的是，由于<math>J(\theta)</math>仍然是一个凸函数，因此梯度下降时不会遇到陷入局部最优解的问题。但是Hessian 矩阵是奇异的/不可逆的，这会直接导致 Softmax的牛顿法实现版本出现数值计算的问题）
+'''原文''':
+Notice also that by setting <math>\psi = \theta_1</math>, one can always
+replace <math>\theta_1</math> with <math>\theta_1 - \psi = \vec{0}</math> (the vector of all
+'s), without affecting the hypothesis.  Thus, one could "eliminate" the vector
+of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for
+any single value of <math>j</math>), without harming the representational power
+of our hypothesis.  Indeed, rather than optimizing over the <math>k(n+1)</math>
+parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math> (where
+<math>\theta_j \in \Re^{n+1}</math>), one could instead set <math>\theta_1 =
+\vec{0}</math> and optimize only with respect to the <math>(k-1)(n+1)</math>
+remaining parameters, and this would work fine.
+'''译文''':
+我们注意到，当<math>\psi = \theta_1</math>时，我们可以将<math>\theta_1</math>变换为 <math>\theta_1 - \psi = \vec{0}</math>，而这一变换不影响模型结果。因此，我们可以减掉向量的参数<math>\theta_1</math>（或者减去其他的任意<math>\theta_j</math>）而不影响模型的结果。实际上，我们可以不必优化<math>k(n+1)</math>个参数<math>(\theta_1, \theta_2,\ldots, \theta_k)</math>，而只需要优化<math>(k-1)(n+1)</math>个。
+'''一审''':
+我们注意到，当<math>\psi = \theta_1</math>时，我们总是可以将<math>\theta_1</math>替换为 <math>\theta_1 - \psi = \vec{0}</math>（替换为全零向量） ， 这并不会影响到估计值。因此我们可以“去除掉”参数向量<math>\theta_1</math>（或者任意其他 <math>\theta_j</math>中的其中一个）而不损害到我们估计值的实际功用。实际上，与其优化全部的<math>k(n+1)</math>个参数<math>(\theta_1, \theta_2,\ldots, \theta_k)</math>（其中<math>\theta_j \in \Re^{n+1}</math>），不如让我们令 <math>\theta_1 =
+\vec{0}</math> ，之后只优化剩余的<math>(k-1)(n+1)</math>个参数，这样算法依然能够正常工作。
+'''原文''':
+In practice, however, it is often cleaner and simpler to implement the version which keeps
+all the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>, without
+arbitrarily setting one of them to zero.  But we will
+make one change to the cost function: Adding weight decay.  This will take care of
+the numerical problems associated with softmax regression's overparameterized representation.
+'''译文''':
+在实际过程中，实现一个保留所有参数<math>(\theta_1, \theta_2,\ldots, \theta_n)</math>的模型往往更简单清楚。但此时我们需要对损失函数做一个改动：增加权重衰减。权重衰减可以解决 softmax 回归的参数冗余问题。
+'''一审''':
+在实际过程中，实现一个保留所有参数<math>(\theta_1, \theta_2,\ldots, \theta_n)</math>、 不去任意地将某一参数向量置0的模型往往更简单清楚。但是我们需要对代价函数做一个改动：增加权重衰减。 这将有助于解决由Softmax回归算法的参数冗余形式所带来的计算问题。
 == 4 ==
 == 5 ==