Softmax回归

Revision as of 05:42, 16 March 2013 (view source)

Kandeng (Talk | contribs)

(→代价函数 Cost Function)

← Older edit

Revision as of 05:54, 16 March 2013 (view source)

Kandeng (Talk | contribs)

(→softmax回归参数化的特性 Properties of softmax regression parameterization)

Newer edit →

Line 111:

当实现 softmax 回归算法时，我们通常会使用上述代价函数的一个改进版本。具体来说，就是和权重衰减(weight decay)一起使用。我们接下来介绍使用它的动机和细节。

-

== ~~softmax回归参数化的特性 Properties of softmax regression parameterization~~ ==

+

== softmax回归模型参数化的特点==

-

+

Softmax回归有一个不寻常的特点：它有一个“冗余”的参数集。为了便于阐述这一特点，假设我们从参数向量<math>\theta_j</math>中减去了向量<math>\psi</math>，这时，每一个<math>\theta_j</math>都变成了<math>\theta_j - \psi</math>(<math>j=1, \ldots, k</math>)。此时假设函数变成了以下的式子：

-

~~'''原文''':~~

+

-

+

-

~~Softmax regression has an unusual property that it has a "redundant" set of parameters. To explain what this means,~~

+

-

~~suppose we take each of our parameter vectors~~ <math>\theta_j</math>~~, and subtract some fixed vector~~ <math>\psi</math>

+

-

~~from it, so that every~~ <math>\theta_j</math> ~~is now replaced with~~ <math>\theta_j - \psi</math>

+

-

(~~for every~~ <math>j=1, \ldots, k</math>)~~. Our hypothesis~~

+

-

~~now estimates the class label probabilities as~~

+

<math>

Line 131:

Line 124:

</math>

-

~~'''译文''':~~

-

Softmax回归有一个不寻常的特点：它有一个“冗余”的参数集。为了便于阐述这一特点，假设我们已经得到了参数向量<math>\theta_j</math>，并从中减去了向量 <math>\psi</math>，这时，每一个<math>\theta_j</math>都变成了<math>\theta_j - \psi</math>(对于每一个<math>j=1, \ldots, k</math> )。此时我们的假设变成了以下的式子：

-

~~<math>~~

-

~~\begin{align}~~

-

~~p(y^{(i)} = j | x^{(i)} ; \theta)~~

-

~~&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}} \\~~

-

~~&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\~~

-

~~&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}.~~

-

~~\end{align}~~

-

~~</math>~~

-

~~'''一审''':~~

-

Softmax回归算法有一个不寻常的特性，就是它有一个“冗余”的参数集。为了便于阐述这一特点的意义，假设我们对每一个参数向量<math>\theta_j</math>进行操作，从中减去一个固定的向量<math>\psi</math>，于是每一个<math>\theta_j</math>就被<math>\theta_j - \psi</math>所替代(针对每一个 <math>j=1, \ldots, k</math>)。此时我们的估值函数对分类标记概率的估计为：

-

~~<math>~~

-

~~\begin{align}~~

-

~~p(y^{(i)} = j | x^{(i)} ; \theta)~~

-

~~&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}} \\~~

-

~~&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\~~

-

~~&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}.~~

-

~~\end{align}~~

-

~~</math>~~

-

~~'''原文''':~~

-

~~In other words, subtracting <math>\psi</math> from every <math>\theta_j</math>~~

-

~~does not affect our hypothesis' predictions at all! This shows that softmax~~

-

~~regression's parameters are "redundant." More formally, we say that our~~

-

~~softmax model is '''overparameterized,''' meaning that for any hypothesis we might~~

-

~~fit to the data, there are multiple parameter settings that give rise to exactly~~

-

~~the same hypothesis function <math>h_\theta</math> mapping from inputs <math>x</math>~~

-

~~to the predictions.~~

-

~~'''译文''':~~

-

换句话说，从<math>\psi</math>中减去<math>\theta_j</math>完全不影响假设函数的预测结果！这一现象表明，softmax回归中存在冗余的参数。或者说，我们的 Softmax 模型参数比实际需要的多，对于任意的假设函数 <math>h_\theta</math> ，我们可以求出多组参数值。

-

~~'''一审''':~~

-

换句话说，从每个<math>\theta_j</math>中都减去<math>\psi</math>完全不会影响我们的估值函数的预测结果！这表明了softmax回归中的参数是“冗余”的。更正式一点来说，我们的Softmax模型被过度参数化了，这意味着对于任何我们用来与数据相拟合的估计值，都会存在多组参数集，它们能够生成完全相同的估值函数 <math>h_\theta</math> 将输入<math>x</math> 映射到预测值。

-

~~'''原文''':~~

-

~~Further, if the cost function <math>J(\theta)</math> is minimized by some~~

-

~~setting of the parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math>,~~

-

~~then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots,~~

-

~~\theta_k - \psi)</math> for any value of <math>\psi</math>. Thus, the~~

-

~~minimizer of <math>J(\theta)</math> is not unique. (Interestingly,~~

-

~~<math>J(\theta)</math> is still convex, and thus gradient descent will~~

-

~~not run into a local optima problems. But the Hessian is singular/non-invertible,~~

-

~~which causes a straightforward implementation of Newton's method to run into~~

-

~~numerical problems.)~~

-

~~'''译文''':~~

-

另外，如果损失函数<math>J(\theta)</math>由<math>(\theta_1,\theta_2,\ldots,\theta_k)</math>最小化了，它也可以由<math>(\theta_1 - \psi, \theta_2 - \psi, \ldots, \theta_k - \psi)</math>求得。因此，<math>J(\theta)</math> 的最小值是不唯一的。（有趣的是，由于<math>J(\theta)</math> 仍然是一个凸函数，因此梯度下降时不会陷入局部最优。但是 Hessian矩阵是奇异的/不可逆的，这会导致 Softmax的牛顿法实现版本出现数值计算的问题）

-

~~'''一审''':~~

-

进一步而言，如果代价函数<math>J(\theta)</math>能够通过参数集<math>(\theta_1, \theta_2,\ldots, \theta_k)</math>得到最小值，那么它使用参数集<math>(\theta_1 - \psi, \theta_2 - \psi,\ldots, \theta_k - \psi)</math>同样也会得到最小值，其中<math>\psi</math>可以为任意向量。因此使<math>J(\theta)</math>最小化的解不是唯一的。（有趣的是，由于<math>J(\theta)</math>仍然是一个凸函数，因此梯度下降时不会遇到陷入局部最优解的问题。但是Hessian 矩阵是奇异的/不可逆的，这会直接导致 Softmax的牛顿法实现版本出现数值计算的问题）

-

~~'''原文''':~~

-

~~Notice also that by setting <math>\psi = \theta_1</math>, one can always~~

-

~~replace <math>\theta_1</math> with <math>\theta_1 - \psi = \vec{0}</math> (the vector of all~~

-

~~0's), without affecting the hypothesis. Thus, one could "eliminate" the vector~~

-

~~of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for~~

-

~~any single value of <math>j</math>), without harming the representational power~~

-

~~of our hypothesis. Indeed, rather than optimizing over the <math>k(n+1)</math>~~

-

~~parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math> (where~~

-

~~<math>\theta_j \in \Re^{n+1}</math>), one could instead set <math>\theta_1 =~~

-

~~\vec{0}</math> and optimize only with respect to the <math>(k-1)(n+1)</math>~~

-

~~remaining parameters, and this would work fine.~~

-

~~'''译文''':~~

-

我们注意到，当<math>\psi = \theta_1</math>时，我们可以将<math>\theta_1</math>变换为 <math>\theta_1 - \psi = \vec{0}</math>，而这一变换不影响模型结果。因此，我们可以减掉向量的参数<math>\theta_1</math>（或者减去其他的任意<math>\theta_j</math>）而不影响模型的结果。实际上，我们可以不必优化<math>k(n+1)</math>个参数<math>(\theta_1, \theta_2,\ldots, \theta_k)</math>，而只需要优化<math>(k-1)(n+1)</math>个。

-

~~'''一审''':~~

-

~~我们注意到，当~~<math>\~~psi = \theta_1~~</math>~~时，我们总是可以将~~<math>~~\theta_1</math>替换为 <math>\theta_1 -~~ \psi ~~= \vec{0}~~</math>~~（替换为全零向量），这并不会影响到估计值。因此我们可以“去除掉”参数向量~~<math>\theta_1</math>（或者任意其他 <math>\theta_j</math>中的其中一个）而不损害到我们估计值的实际功用。实际上，与其优化全部的<math>k(n+1)</math>个参数<math>(\theta_1, \theta_2,\ldots, \theta_k)</math>（其中<math>\theta_j \in \Re^{n+1}</math>），不如让我们令 <math>\theta_1 =

+

换句话说，从<math>\theta_j</math>中减去<math>\psi</math>完全不影响假设函数的预测结果！这表明前面的softmax回归模型中存在冗余的参数。更正式一点来说， Softmax模型被过度参数化了。对于任意一个用于拟合数据的假设函数，可以求出多组参数值，这些参数得到的是完全相同的假设函数<math>h_\theta</math>。

-

~~\vec{0}</math> ，之后只优化剩余的<math>(k-1)(n+1)~~</math>~~个参数，这样算法依然能够正常工作。~~

+

-

~~'''原文''':~~

+

进一步而言，如果参数<math>(\theta_1, \theta_2,\ldots, \theta_k)</math>是代价函数<math>J(\theta)</math>的极小值点，那么<math>(\theta_1 - \psi, \theta_2 - \psi,\ldots,

+

\theta_k - \psi)</math>同样也是它的极小值点，其中<math>\psi</math>可以为任意向量。因此使<math>J(\theta)</math>最小化的解不是唯一的。（有趣的是，由于<math>J(\theta)</math>仍然是一个凸函数，因此梯度下降时不会遇到局部最优解的问题。但是Hessian 矩阵是奇异的/不可逆的，这会直接导致采用牛顿法优化就遇到数值计算的问题）

-

~~In practice, however, it is often cleaner and simpler to implement the version which keeps~~

-

~~all the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>, without~~

-

~~arbitrarily setting one of them to zero. But we will~~

-

~~make one change to the cost function: Adding weight decay. This will take care of~~

-

~~the numerical problems associated with softmax regression's overparameterized representation.~~

+

注意，当<math>\psi = \theta_1</math>时，我们总是可以将<math>\theta_1</math>替换为<math>\theta_1 - \psi = \vec{0}</math>（即替换为全零向量），并且这种变换不会影响假设函数。因此我们可以去掉参数向量<math>\theta_1</math>（或者其他<math>\theta_j</math>中的任意一个）而不影响假设函数的表达能力。实际上，与其优化全部的<math>k(n+1)</math>个参数<math>(\theta_1, \theta_2,\ldots, \theta_k)</math>（其中<math>\theta_j \in \Re^{n+1}</math>），我们可以令<math>\theta_1 =

+

\vec{0}</math>，只优化剩余的<math>(k-1)(n+1)</math>个参数，这样算法依然能够正常工作。

-

~~'''译文''':~~

-

在实际过程中，实现一个保留所有参数<math>(\theta_1, \theta_2,\ldots, \theta_n)</math>的模型往往更简单清楚。但此时我们需要对损失函数做一个改动：增加权重衰减。权重衰减可以解决 softmax 回归的参数冗余问题。

-

~~'''一审''':~~

-

~~在实际过程中，实现一个保留所有参数~~<math>(\theta_1, \theta_2,\ldots, \theta_n)</math>、不去任意地将某一参数向量置0的模型往往更简单清楚。但是我们需要对代价函数做一个改动：增加权重衰减。这将有助于解决由Softmax回归算法的参数冗余形式所带来的计算问题。

+

在实际应用中，为了使算法实现更简单清楚，往往保留所有参数<math>(\theta_1, \theta_2,\ldots, \theta_n)</math>，而不任意地将某一参数设置为0。但此时我们需要对代价函数做一个改动：加入权重衰减。权重衰减可以解决 softmax 回归的参数冗余所带来的数值问题。

==权重衰减 Weight Decay ==

Softmax回归

From Ufldl

Revision as of 05:54, 16 March 2013

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 111: / Line 111: @@
 当实现 softmax 回归算法时， 我们通常会使用上述代价函数的一个改进版本。具体来说，就是和权重衰减(weight decay)一起使用。我们接下来介绍使用它的动机和细节。
-== softmax回归参数化的特性 Properties of softmax regression parameterization ==
+== softmax回归模型参数化的特点==
+Softmax回归有一个不寻常的特点：它有一个“冗余”的参数集。为了便于阐述这一特点，假设我们从参数向量<math>\theta_j</math>中减去了向量<math>\psi</math>，这时，每一个<math>\theta_j</math>都变成了<math>\theta_j - \psi</math>(<math>j=1, \ldots, k</math>)。此时假设函数变成了以下的式子：
-'''原文''':
-Softmax regression has an unusual property that it has a "redundant" set of parameters.  To explain what this means,
-suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math>
-from it, so that every <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math>
-(for every <math>j=1, \ldots, k</math>).  Our hypothesis
-now estimates the class label probabilities as
 <math>
@@ Line 131: / Line 124: @@
 </math>
-'''译文''':
-Softmax回归有一个不寻常的特点：它有一个“冗余”的参数集。为了便于阐述这一特点，假设我们已经得到了参数向量<math>\theta_j</math>，并从中减去了向量 <math>\psi</math>，这时，每一个<math>\theta_j</math>都变成了<math>\theta_j - \psi</math>(对于每一个<math>j=1, \ldots, k</math> )。此时我们的假设变成了以下的式子：
-<math>
-\begin{align}
-p(y^{(i)} = j | x^{(i)} ; \theta)
-&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}}  \\
-&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\
-&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}.
-\end{align}
-</math>
-'''一审''':
-Softmax回归算法 有一个不寻常的特性，就是它有一个“冗余”的参数集。为了便于阐述这一特点的意义，假设我们对每一个参数向量<math>\theta_j</math>进行操作，从中减去一个固定的向量<math>\psi</math>，于是每一个<math>\theta_j</math>就被<math>\theta_j - \psi</math>所替代(针对每一个  <math>j=1, \ldots, k</math>)。此时我们的估值函数对分类标记概率的估计为 ：
-<math>
-\begin{align}
-p(y^{(i)} = j | x^{(i)} ; \theta)
-&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}}  \\
-&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\
-&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}.
-\end{align}
-</math>
-'''原文''':
-In other words, subtracting <math>\psi</math> from every <math>\theta_j</math>
-does not affect our hypothesis' predictions at all!  This shows that softmax
-regression's parameters are "redundant."  More formally, we say that our
-softmax model is '''overparameterized,''' meaning that for any hypothesis we might
-fit to the data, there are multiple parameter settings that give rise to exactly
-the same hypothesis function <math>h_\theta</math> mapping from inputs <math>x</math>
-to the predictions.
-'''译文''':
-换句话说，从<math>\psi</math>中减去<math>\theta_j</math>完全不影响假设函数的预测结果！这一现象表明，softmax回归中存在冗余的参数。或者说，我们的 Softmax 模型参数比实际需要的多，对于任意的假设函数 <math>h_\theta</math> ，我们可以求出多组参数值。
-'''一审''':
-换句话说，从每个<math>\theta_j</math>中都减去<math>\psi</math>完全不会影响我们的估值函数的预测结果！这表明了softmax回归中的参数是“冗余”的。更正式一点来说，我们的Softmax模型被过度参数化了，这意味着对于任何我们用来与数据相拟合的估计值，都会存在多组参数集，它们能够生成完全相同的估值函数 <math>h_\theta</math> 将输入<math>x</math> 映射到预测值。
-'''原文''':
-Further, if the cost function <math>J(\theta)</math> is minimized by some
-setting of the parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math>,
-then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots,
-\theta_k - \psi)</math> for any value of <math>\psi</math>.  Thus, the
-minimizer of <math>J(\theta)</math> is not unique.  (Interestingly,
-<math>J(\theta)</math> is still convex, and thus gradient descent will
-not run into a local optima problems.  But the Hessian is singular/non-invertible,
-which causes a straightforward implementation of Newton's method to run into
-numerical problems.)
-'''译文''':
-另外，如果损失函数<math>J(\theta)</math>由<math>(\theta_1,\theta_2,\ldots,\theta_k)</math>最小化了，它也可以由<math>(\theta_1 - \psi, \theta_2 - \psi, \ldots, \theta_k - \psi)</math>求得。因此，<math>J(\theta)</math> 的最小值是不唯一的。（有趣的是，由于<math>J(\theta)</math> 仍然是一个凸函数，因此梯度下降时不会陷入局部最优。但是 Hessian矩阵是奇异的/不可逆的，这会导致 Softmax的牛顿法实现版本出现数值计算的问题）
-'''一审''':
-进一步而言，如果代价函数<math>J(\theta)</math>能够通过参数集<math>(\theta_1, \theta_2,\ldots, \theta_k)</math>得到最小值，那么它使用参数集<math>(\theta_1 - \psi, \theta_2 - \psi,\ldots, \theta_k - \psi)</math>同样也会得到最小值，其中<math>\psi</math>可以为任意向量。因此使<math>J(\theta)</math>最小化的解不是唯一的。（有趣的是，由于<math>J(\theta)</math>仍然是一个凸函数，因此梯度下降时不会遇到陷入局部最优解的问题。但是Hessian 矩阵是奇异的/不可逆的，这会直接导致 Softmax的牛顿法实现版本出现数值计算的问题）
-'''原文''':
-Notice also that by setting <math>\psi = \theta_1</math>, one can always
-replace <math>\theta_1</math> with <math>\theta_1 - \psi = \vec{0}</math> (the vector of all
-'s), without affecting the hypothesis.  Thus, one could "eliminate" the vector
-of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for
-any single value of <math>j</math>), without harming the representational power
-of our hypothesis.  Indeed, rather than optimizing over the <math>k(n+1)</math>
-parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math> (where
-<math>\theta_j \in \Re^{n+1}</math>), one could instead set <math>\theta_1 =
-\vec{0}</math> and optimize only with respect to the <math>(k-1)(n+1)</math>
-remaining parameters, and this would work fine.
-'''译文''':
-我们注意到，当<math>\psi = \theta_1</math>时，我们可以将<math>\theta_1</math>变换为 <math>\theta_1 - \psi = \vec{0}</math>，而这一变换不影响模型结果。因此，我们可以减掉向量的参数<math>\theta_1</math>（或者减去其他的任意<math>\theta_j</math>）而不影响模型的结果。实际上，我们可以不必优化<math>k(n+1)</math>个参数<math>(\theta_1, \theta_2,\ldots, \theta_k)</math>，而只需要优化<math>(k-1)(n+1)</math>个。
-'''一审''':
-我们注意到，当<math>\psi = \theta_1</math>时，我们总是可以将<math>\theta_1</math>替换为 <math>\theta_1 - \psi = \vec{0}</math>（替换为全零向量） ， 这并不会影响到估计值。因此我们可以“去除掉”参数向量<math>\theta_1</math>（或者任意其他 <math>\theta_j</math>中的其中一个）而不损害到我们估计值的实际功用。实际上，与其优化全部的<math>k(n+1)</math>个参数<math>(\theta_1, \theta_2,\ldots, \theta_k)</math>（其中<math>\theta_j \in \Re^{n+1}</math>），不如让我们令 <math>\theta_1 =
+换句话说，从<math>\theta_j</math>中减去<math>\psi</math>完全不影响假设函数的预测结果！这表明前面的softmax回归模型中存在冗余的参数。更正式一点来说， Softmax模型被过度参数化了。对于任意一个用于拟合数据的假设函数，可以求出多组参数值，这些参数得到的是完全相同的假设函数<math>h_\theta</math>。
-\vec{0}</math> ，之后只优化剩余的<math>(k-1)(n+1)</math>个参数，这样算法依然能够正常工作。
-'''原文''':
+进一步而言，如果参数<math>(\theta_1, \theta_2,\ldots, \theta_k)</math>是代价函数<math>J(\theta)</math>的极小值点，那么<math>(\theta_1 - \psi, \theta_2 - \psi,\ldots,
+\theta_k - \psi)</math>同样也是它的极小值点，其中<math>\psi</math>可以为任意向量。因此使<math>J(\theta)</math>最小化的解不是唯一的。（有趣的是，由于<math>J(\theta)</math>仍然是一个凸函数，因此梯度下降时不会遇到局部最优解的问题。但是Hessian 矩阵是奇异的/不可逆的，这会直接导致采用牛顿法优化就遇到数值计算的问题）
-In practice, however, it is often cleaner and simpler to implement the version which keeps
-all the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>, without
-arbitrarily setting one of them to zero.  But we will
-make one change to the cost function: Adding weight decay.  This will take care of
-the numerical problems associated with softmax regression's overparameterized representation.
+注意，当<math>\psi = \theta_1</math>时，我们总是可以将<math>\theta_1</math>替换为<math>\theta_1 - \psi = \vec{0}</math>（即替换为全零向量），并且这种变换不会影响假设函数。因此我们可以去掉参数向量<math>\theta_1</math>（或者其他<math>\theta_j</math>中的任意一个）而不影响假设函数的表达能力。实际上，与其优化全部的<math>k(n+1)</math>个参数<math>(\theta_1, \theta_2,\ldots, \theta_k)</math>（其中<math>\theta_j \in \Re^{n+1}</math>），我们可以令<math>\theta_1 =
+\vec{0}</math>，只优化剩余的<math>(k-1)(n+1)</math>个参数，这样算法依然能够正常工作。
-'''译文''':
-在实际过程中，实现一个保留所有参数<math>(\theta_1, \theta_2,\ldots, \theta_n)</math>的模型往往更简单清楚。但此时我们需要对损失函数做一个改动：增加权重衰减。权重衰减可以解决 softmax 回归的参数冗余问题。
-'''一审''':
-在实际过程中，实现一个保留所有参数<math>(\theta_1, \theta_2,\ldots, \theta_n)</math>、 不去任意地将某一参数向量置0的模型往往更简单清楚。但是我们需要对代价函数做一个改动：增加权重衰减。 这将有助于解决由Softmax回归算法的参数冗余形式所带来的计算问题。
+在实际应用中，为了使算法实现更简单清楚，往往保留所有参数<math>(\theta_1, \theta_2,\ldots, \theta_n)</math>，而不任意地将某一参数设置为0。但此时我们需要对代价函数做一个改动：加入权重衰减。权重衰减可以解决 softmax 回归的参数冗余所带来的数值问题。
 ==权重衰减  Weight Decay ==