class LogisticGradient extends Gradient
Compute gradient and loss for a multinomial logistic loss function, as used in multiclass classification (it is also used in binary logistic regression).
In The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition
by Trevor Hastie, Robert Tibshirani, and Jerome Friedman, which can be downloaded from
http://statweb.stanford.edu/~tibs/ElemStatLearn/ , Eq. (4.17) on page 119 gives the formula of
multinomial logistic regression model. A simple calculation shows that
$$ P(y=0x, w) = 1 / (1 + \sum_i^{K1} \exp(x w_i))\\ P(y=1x, w) = exp(x w_1) / (1 + \sum_i^{K1} \exp(x w_i))\\ ...\\ P(y=K1x, w) = exp(x w_{K1}) / (1 + \sum_i^{K1} \exp(x w_i))\\ $$
for K classes multiclass classification problem.
The model weights \(w = (w_1, w_2, ..., w_{K1})^T\) becomes a matrix which has dimension of (K1) * (N+1) if the intercepts are added. If the intercepts are not added, the dimension will be (K1) * N.
As a result, the loss of objective function for a single instance of data can be written as
$$ \begin{align} l(w, x) &= log P(yx, w) = \alpha(y) log P(y=0x, w)  (1\alpha(y)) log P(yx, w) \\ &= log(1 + \sum_i^{K1}\exp(x w_i))  (1\alpha(y)) x w_{y1} \\ &= log(1 + \sum_i^{K1}\exp(margins_i))  (1\alpha(y)) margins_{y1} \end{align} $$
where $\alpha(i) = 1$ if \(i \ne 0\), and $\alpha(i) = 0$ if \(i == 0\), \(margins_i = x w_i\).
For optimization, we have to calculate the first derivative of the loss function, and a simple calculation shows that
$$ \begin{align} \frac{\partial l(w, x)}{\partial w_{ij}} &= (\exp(x w_i) / (1 + \sum_k^{K1} \exp(x w_k))  (1\alpha(y)\delta_{y, i+1})) * x_j \\ &= multiplier_i * x_j \end{align} $$
where $\delta_{i, j} = 1$ if \(i == j\), $\delta_{i, j} = 0$ if \(i != j\), and multiplier = $\exp(margins_i) / (1 + \sum_k^{K1} \exp(margins_i))  (1\alpha(y)\delta_{y, i+1})$
If any of margins is larger than 709.78, the numerical computation of multiplier and loss
function will be suffered from arithmetic overflow. This issue occurs when there are outliers
in data which are far away from hyperplane, and this will cause the failing of training once
infinity / infinity is introduced. Note that this is only a concern when max(margins)
>
0.
Fortunately, when max(margins) = maxMargin >
0, the loss function and the multiplier
can be easily rewritten into the following equivalent numerically stable formula.
$$ \begin{align} l(w, x) &= log(1 + \sum_i^{K1}\exp(margins_i))  (1\alpha(y)) margins_{y1} \\ &= log(\exp(maxMargin) + \sum_i^{K1}\exp(margins_i  maxMargin)) + maxMargin  (1\alpha(y)) margins_{y1} \\ &= log(1 + sum) + maxMargin  (1\alpha(y)) margins_{y1} \end{align} $$
where sum = $\exp(maxMargin) + \sum_i^{K1}\exp(margins_i  maxMargin)  1$.
Note that each term, $(margins_i  maxMargin)$ in $\exp$ is smaller than zero; as a result, overflow will not happen with this formula.
For multiplier, similar trick can be applied as the following,
$$ \begin{align} multiplier &= \exp(margins_i) / (1 + \sum_k^{K1} \exp(margins_i))  (1\alpha(y)\delta_{y, i+1}) \\ &= \exp(margins_i  maxMargin) / (1 + sum)  (1\alpha(y)\delta_{y, i+1}) \end{align} $$
where each term in $\exp$ is also smaller than zero, so overflow is not a concern.
For the detailed mathematical derivation, see the reference at http://www.slideshare.net/dbtsai/20140620mlor36132297
 Source
 Gradient.scala
 Alphabetic
 By Inheritance
 LogisticGradient
 Gradient
 Serializable
 Serializable
 AnyRef
 Any
 Hide All
 Show All
 Public
 All
Instance Constructors
Value Members

final
def
!=(arg0: Any): Boolean
 Definition Classes
 AnyRef → Any

final
def
##(): Int
 Definition Classes
 AnyRef → Any

final
def
==(arg0: Any): Boolean
 Definition Classes
 AnyRef → Any

final
def
asInstanceOf[T0]: T0
 Definition Classes
 Any

def
clone(): AnyRef
 Attributes
 protected[lang]
 Definition Classes
 AnyRef
 Annotations
 @throws( ... ) @native()

def
compute(data: Vector, label: Double, weights: Vector, cumGradient: Vector): Double
Compute the gradient and loss given the features of a single data point, add the gradient to a provided vector to avoid creating new objects, and return loss.
Compute the gradient and loss given the features of a single data point, add the gradient to a provided vector to avoid creating new objects, and return loss.
 data
features for one data point
 label
label for this data point
 weights
weights/coefficients corresponding to features
 cumGradient
the computed gradient will be added to this vector
 returns
loss
 Definition Classes
 LogisticGradient → Gradient

def
compute(data: Vector, label: Double, weights: Vector): (Vector, Double)
Compute the gradient and loss given the features of a single data point.
Compute the gradient and loss given the features of a single data point.
 data
features for one data point
 label
label for this data point
 weights
weights/coefficients corresponding to features
 returns
(gradient: Vector, loss: Double)
 Definition Classes
 Gradient

final
def
eq(arg0: AnyRef): Boolean
 Definition Classes
 AnyRef

def
equals(arg0: Any): Boolean
 Definition Classes
 AnyRef → Any

def
finalize(): Unit
 Attributes
 protected[lang]
 Definition Classes
 AnyRef
 Annotations
 @throws( classOf[java.lang.Throwable] )

final
def
getClass(): Class[_]
 Definition Classes
 AnyRef → Any
 Annotations
 @native()

def
hashCode(): Int
 Definition Classes
 AnyRef → Any
 Annotations
 @native()

final
def
isInstanceOf[T0]: Boolean
 Definition Classes
 Any

final
def
ne(arg0: AnyRef): Boolean
 Definition Classes
 AnyRef

final
def
notify(): Unit
 Definition Classes
 AnyRef
 Annotations
 @native()

final
def
notifyAll(): Unit
 Definition Classes
 AnyRef
 Annotations
 @native()

final
def
synchronized[T0](arg0: ⇒ T0): T0
 Definition Classes
 AnyRef

def
toString(): String
 Definition Classes
 AnyRef → Any

final
def
wait(): Unit
 Definition Classes
 AnyRef
 Annotations
 @throws( ... )

final
def
wait(arg0: Long, arg1: Int): Unit
 Definition Classes
 AnyRef
 Annotations
 @throws( ... )

final
def
wait(arg0: Long): Unit
 Definition Classes
 AnyRef
 Annotations
 @throws( ... ) @native()