org.apache.spark.ml.Predictor<FeaturesType,Learner,M>

org.apache.spark.ml.regression.Regressor<Vector,LinearRegression,LinearRegressionModel>

org.apache.spark.ml.regression.LinearRegression

All Implemented Interfaces:: Serializable, org.apache.spark.internal.Logging, Params, HasAggregationDepth, HasElasticNetParam, HasFeaturesCol, HasFitIntercept, HasLabelCol, HasLoss, HasMaxBlockSizeInMB, HasMaxIter, HasPredictionCol, HasRegParam, HasSolver, HasStandardization, HasTol, HasWeightCol, PredictorParams, LinearRegressionParams, DefaultParamsWritable, Identifiable, MLWritable

public class LinearRegression extends Regressor<Vector,LinearRegression,LinearRegressionModel> implements LinearRegressionParams, DefaultParamsWritable, org.apache.spark.internal.Logging

Linear regression.

The learning objective is to minimize the specified loss function, with regularization. This supports two kinds of loss: - squaredError (a.k.a squared loss) - huber (a hybrid of squared error for relatively small errors and absolute error for relatively large ones, and we estimate the scale parameter from training data)

This supports multiple types of regularization: - none (a.k.a. ordinary least squares) - L2 (ridge regression) - L1 (Lasso) - L2 + L1 (elastic net)

The squared error objective function is:

$$ \begin{align} \min_{w}\frac{1}{2n}{\sum_{i=1}^n(X_{i}w - y_{i})^{2} + \lambda\left[\frac{1-\alpha}{2}{||w||_{2}}^{2} + \alpha{||w||_{1}}\right]} \end{align} $$

The huber objective function is:

$$ \begin{align} \min_{w, \sigma}\frac{1}{2n}{\sum_{i=1}^n\left(\sigma + H_m\left(\frac{X_{i}w - y_{i}}{\sigma}\right)\sigma\right) + \frac{1}{2}\lambda {||w||_2}^2} \end{align} $$

where

$$ \begin{align} H_m(z) = \begin{cases} z^2, & \text {if } |z| < \epsilon, \\ 2\epsilon|z| - \epsilon^2, & \text{otherwise} \end{cases} \end{align} $$

Since 3.1.0, it supports stacking instances into blocks and using GEMV for better performance. The block size will be 1.0 MB, if param maxBlockSizeInMB is set 0.0 by default.

Note: Fitting with huber loss only supports none and L2 regularization.

See Also:

Serialized Form

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging
org.apache.spark.internal.Logging.LogStringContext, org.apache.spark.internal.Logging.SparkShellLoggingFilter
Constructor Summary

Constructors

Constructor

Description

LinearRegression()

LinearRegression(String uid)
Method Summary

Modifier and Type

Method

Description

final IntParam

aggregationDepth()

Param for suggested depth for treeAggregate (>= 2).

LinearRegression

copy(ParamMap extra)

Creates a copy of this instance with the same UID and some extra params.

final DoubleParam

elasticNetParam()

Param for the ElasticNet mixing parameter, in range [0, 1].

final DoubleParam

epsilon()

The shape parameter to control the amount of robustness.

long

estimateModelSize(Dataset<?> dataset)

final BooleanParam

fitIntercept()

Param for whether to fit an intercept term.

static LinearRegression

load(String path)

final Param<String>

loss()

The loss function to be optimized.

static int

MAX_FEATURES_FOR_NORMAL_SOLVER()

When using LinearRegression.solver == "normal", the solver must limit the number of features to at most this number.

final DoubleParam

maxBlockSizeInMB()

Param for Maximum memory in MB for stacking input data into blocks.

final IntParam

maxIter()

Param for maximum number of iterations (>= 0).

static MLReader<T>

read()

final DoubleParam

regParam()

Param for regularization parameter (>= 0).

LinearRegression

setAggregationDepth(int value)

Suggested depth for treeAggregate (greater than or equal to 2).

LinearRegression

setElasticNetParam(double value)

Set the ElasticNet mixing parameter.

LinearRegression

setEpsilon(double value)

Sets the value of param epsilon().

LinearRegression

setFitIntercept(boolean value)

Set if we should fit the intercept.

LinearRegression

setLoss(String value)

Sets the value of param loss().

LinearRegression

setMaxBlockSizeInMB(double value)

Sets the value of param maxBlockSizeInMB().

LinearRegression

setMaxIter(int value)

Set the maximum number of iterations.

LinearRegression

setRegParam(double value)

Set the regularization parameter.

LinearRegression

setSolver(String value)

Set the solver algorithm used for optimization.

LinearRegression

setStandardization(boolean value)

Whether to standardize the training features before fitting the model.

LinearRegression

setTol(double value)

Set the convergence tolerance of iterations.

LinearRegression

setWeightCol(String value)

Whether to over-/under-sample training instances according to the given weights in weightCol.

final Param<String>

solver()

The solver algorithm for optimization.

final BooleanParam

standardization()

Param for whether to standardize the training features before fitting the model.

final DoubleParam

tol()

Param for the convergence tolerance for iterative algorithms (>= 0).

String

uid()

An immutable unique ID for the object and its derivatives.

final Param<String>

weightCol()

Param for weight column name.

Methods inherited from class org.apache.spark.ml.Predictor
featuresCol, fit, labelCol, predictionCol, setFeaturesCol, setLabelCol, setPredictionCol, transformSchema

Methods inherited from class org.apache.spark.ml.Estimator
fit, fit, fit, fit

Methods inherited from class org.apache.spark.ml.PipelineStage
params

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.spark.ml.util.DefaultParamsWritable
write

Methods inherited from interface org.apache.spark.ml.param.shared.HasAggregationDepth
getAggregationDepth

Methods inherited from interface org.apache.spark.ml.param.shared.HasElasticNetParam
getElasticNetParam

Methods inherited from interface org.apache.spark.ml.param.shared.HasFeaturesCol
featuresCol, getFeaturesCol

Methods inherited from interface org.apache.spark.ml.param.shared.HasFitIntercept
getFitIntercept

Methods inherited from interface org.apache.spark.ml.param.shared.HasLabelCol
getLabelCol, labelCol

Methods inherited from interface org.apache.spark.ml.param.shared.HasLoss
getLoss

Methods inherited from interface org.apache.spark.ml.param.shared.HasMaxBlockSizeInMB
getMaxBlockSizeInMB

Methods inherited from interface org.apache.spark.ml.param.shared.HasMaxIter
getMaxIter

Methods inherited from interface org.apache.spark.ml.param.shared.HasPredictionCol
getPredictionCol, predictionCol

Methods inherited from interface org.apache.spark.ml.param.shared.HasRegParam
getRegParam

Methods inherited from interface org.apache.spark.ml.param.shared.HasSolver
getSolver

Methods inherited from interface org.apache.spark.ml.param.shared.HasStandardization
getStandardization

Methods inherited from interface org.apache.spark.ml.param.shared.HasTol
getTol

Methods inherited from interface org.apache.spark.ml.param.shared.HasWeightCol
getWeightCol

Methods inherited from interface org.apache.spark.ml.util.Identifiable
toString

Methods inherited from interface org.apache.spark.ml.regression.LinearRegressionParams
getEpsilon, validateAndTransformSchema

Methods inherited from interface org.apache.spark.internal.Logging
initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logBasedOnLevel, logDebug, logDebug, logDebug, logDebug, logError, logError, logError, logError, logInfo, logInfo, logInfo, logInfo, logName, LogStringContext, logTrace, logTrace, logTrace, logTrace, logWarning, logWarning, logWarning, logWarning, MDC, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq, withLogContext

Methods inherited from interface org.apache.spark.ml.util.MLWritable
save

Methods inherited from interface org.apache.spark.ml.param.Params
clear, copyValues, defaultCopy, defaultParamMap, estimateMatadataSize, explainParam, explainParams, extractParamMap, extractParamMap, get, getDefault, getOrDefault, getParam, hasDefault, hasParam, isDefined, isSet, onParamChange, paramMap, params, set, set, set, setDefault, setDefault, shouldOwn

Constructor Details
- LinearRegression
  
  public LinearRegression(String uid)
- LinearRegression
  
  public LinearRegression()
Method Details
- load
  
  public static LinearRegression load(String path)
- MAX_FEATURES_FOR_NORMAL_SOLVER
  
  public static int MAX_FEATURES_FOR_NORMAL_SOLVER()
  
  When using LinearRegression.solver == "normal", the solver must limit the number of features to at most this number. The entire covariance matrix X^T^X will be collected to the driver. This limit helps prevent memory overflow errors.
  
  Returns:
  
  (undocumented)
- read
  
  public static MLReader<T> read()
- solver
  
  public final Param<String> solver()
  
  Description copied from interface: LinearRegressionParams
  
  The solver algorithm for optimization. Supported options: "l-bfgs", "normal" and "auto". Default: "auto"
  
  Specified by:
  
  solver in interface HasSolver
  
  Specified by:
  
  solver in interface LinearRegressionParams
  
  Returns:
  
  (undocumented)
- loss
  
  public final Param<String> loss()
  
  Description copied from interface: LinearRegressionParams
  
  The loss function to be optimized. Supported options: "squaredError" and "huber". Default: "squaredError"
  
  Specified by:
  
  loss in interface HasLoss
  
  Specified by:
  
  loss in interface LinearRegressionParams
  
  Returns:
  
  (undocumented)
- epsilon
  
  public final DoubleParam epsilon()
  
  Description copied from interface: LinearRegressionParams
  
  The shape parameter to control the amount of robustness. Must be > 1.0. At larger values of epsilon, the huber criterion becomes more similar to least squares regression; for small values of epsilon, the criterion is more similar to L1 regression. Default is 1.35 to get as much robustness as possible while retaining 95% statistical efficiency for normally distributed data. It matches sklearn HuberRegressor and is "M" from A robust hybrid of lasso and ridge regression. Only valid when "loss" is "huber".
  
  Specified by:
  
  epsilon in interface LinearRegressionParams
  
  Returns:
  
  (undocumented)
- maxBlockSizeInMB
  
  public final DoubleParam maxBlockSizeInMB()
  
  Description copied from interface: HasMaxBlockSizeInMB
  
  Param for Maximum memory in MB for stacking input data into blocks. Data is stacked within partitions. If more than remaining data size in a partition then it is adjusted to the data size. Default 0.0 represents choosing optimal value, depends on specific algorithm. Must be >= 0..
  
  Specified by:
  
  maxBlockSizeInMB in interface HasMaxBlockSizeInMB
  
  Returns:
  
  (undocumented)
- aggregationDepth
  
  public final IntParam aggregationDepth()
  
  Description copied from interface: HasAggregationDepth
  
  Param for suggested depth for treeAggregate (>= 2).
  
  Specified by:
  
  aggregationDepth in interface HasAggregationDepth
  
  Returns:
  
  (undocumented)
- weightCol
  
  public final Param<String> weightCol()
  
  Description copied from interface: HasWeightCol
  
  Param for weight column name. If this is not set or empty, we treat all instance weights as 1.0.
  
  Specified by:
  
  weightCol in interface HasWeightCol
  
  Returns:
  
  (undocumented)
- standardization
  
  public final BooleanParam standardization()
  
  Description copied from interface: HasStandardization
  
  Param for whether to standardize the training features before fitting the model.
  
  Specified by:
  
  standardization in interface HasStandardization
  
  Returns:
  
  (undocumented)
- fitIntercept
  
  public final BooleanParam fitIntercept()
  
  Description copied from interface: HasFitIntercept
  
  Param for whether to fit an intercept term.
  
  Specified by:
  
  fitIntercept in interface HasFitIntercept
  
  Returns:
  
  (undocumented)
- tol
  
  public final DoubleParam tol()
  
  Description copied from interface: HasTol
  
  Param for the convergence tolerance for iterative algorithms (>= 0).
  
  Specified by:
  
  tol in interface HasTol
  
  Returns:
  
  (undocumented)
- maxIter
  
  public final IntParam maxIter()
  
  Description copied from interface: HasMaxIter
  
  Param for maximum number of iterations (>= 0).
  
  Specified by:
  
  maxIter in interface HasMaxIter
  
  Returns:
  
  (undocumented)
- elasticNetParam
  
  public final DoubleParam elasticNetParam()
  
  Description copied from interface: HasElasticNetParam
  
  Param for the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.
  
  Specified by:
  
  elasticNetParam in interface HasElasticNetParam
  
  Returns:
  
  (undocumented)
- regParam
  
  public final DoubleParam regParam()
  
  Description copied from interface: HasRegParam
  
  Param for regularization parameter (>= 0).
  
  Specified by:
  
  regParam in interface HasRegParam
  
  Returns:
  
  (undocumented)
- uid
  
  public String uid()
  
  Description copied from interface: Identifiable
  
  An immutable unique ID for the object and its derivatives.
  
  Specified by:
  
  uid in interface Identifiable
  
  Returns:
  
  (undocumented)
- setRegParam
  
  public LinearRegression setRegParam(double value)
  
  Set the regularization parameter. Default is 0.0.
  
  Parameters:
  
  value - (undocumented)
  
  Returns:
  
  (undocumented)
- setFitIntercept
  
  public LinearRegression setFitIntercept(boolean value)
  
  Set if we should fit the intercept. Default is true.
  
  Parameters:
  
  value - (undocumented)
  
  Returns:
  
  (undocumented)
- setStandardization
  
  public LinearRegression setStandardization(boolean value)
  
  Whether to standardize the training features before fitting the model. The coefficients of models will be always returned on the original scale, so it will be transparent for users. Default is true.
  
  Parameters:
  
  value - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Note:
  
  With/without standardization, the models should be always converged to the same solution when no regularization is applied. In R's GLMNET package, the default behavior is true as well.
- setElasticNetParam
  
  public LinearRegression setElasticNetParam(double value)
  
  Set the ElasticNet mixing parameter. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. For alpha in (0,1), the penalty is a combination of L1 and L2. Default is 0.0 which is an L2 penalty.
  Note: Fitting with huber loss only supports None and L2 regularization, so throws exception if this param is non-zero value.
  
  Parameters:
  
  value - (undocumented)
  
  Returns:
  
  (undocumented)
- setMaxIter
  
  public LinearRegression setMaxIter(int value)
  
  Set the maximum number of iterations. Default is 100.
  
  Parameters:
  
  value - (undocumented)
  
  Returns:
  
  (undocumented)
- setTol
  
  public LinearRegression setTol(double value)
  
  Set the convergence tolerance of iterations. Smaller value will lead to higher accuracy with the cost of more iterations. Default is 1E-6.
  
  Parameters:
  
  value - (undocumented)
  
  Returns:
  
  (undocumented)
- setWeightCol
  
  public LinearRegression setWeightCol(String value)
  
  Whether to over-/under-sample training instances according to the given weights in weightCol. If not set or empty, all instances are treated equally (weight 1.0). Default is not set, so all instances have weight one.
  
  Parameters:
  
  value - (undocumented)
  
  Returns:
  
  (undocumented)
- setSolver
  
  public LinearRegression setSolver(String value)
  
  Set the solver algorithm used for optimization. In case of linear regression, this can be "l-bfgs", "normal" and "auto". - "l-bfgs" denotes Limited-memory BFGS which is a limited-memory quasi-Newton optimization method. - "normal" denotes using Normal Equation as an analytical solution to the linear regression problem. This solver is limited to LinearRegression.MAX_FEATURES_FOR_NORMAL_SOLVER. - "auto" (default) means that the solver algorithm is selected automatically. The Normal Equations solver will be used when possible, but this will automatically fall back to iterative optimization methods when needed.
  Note: Fitting with huber loss doesn't support normal solver, so throws exception if this param was set with "normal".
  
  Parameters:
  
  value - (undocumented)
  
  Returns:
  
  (undocumented)
- setAggregationDepth
  
  public LinearRegression setAggregationDepth(int value)
  
  Suggested depth for treeAggregate (greater than or equal to 2). If the dimensions of features or the number of partitions are large, this param could be adjusted to a larger size. Default is 2.
  
  Parameters:
  
  value - (undocumented)
  
  Returns:
  
  (undocumented)
- setLoss
  
  public LinearRegression setLoss(String value)
  
  Sets the value of param loss(). Default is "squaredError".
  
  Parameters:
  
  value - (undocumented)
  
  Returns:
  
  (undocumented)
- setEpsilon
  
  public LinearRegression setEpsilon(double value)
  
  Sets the value of param epsilon(). Default is 1.35.
  
  Parameters:
  
  value - (undocumented)
  
  Returns:
  
  (undocumented)
- setMaxBlockSizeInMB
  
  public LinearRegression setMaxBlockSizeInMB(double value)
  
  Sets the value of param maxBlockSizeInMB(). Default is 0.0, then 1.0 MB will be chosen.
  
  Parameters:
  
  value - (undocumented)
  
  Returns:
  
  (undocumented)
- copy
  
  public LinearRegression copy(ParamMap extra)
  
  Description copied from interface: Params
  
  Creates a copy of this instance with the same UID and some extra params. Subclasses should implement this method and set the return type properly. See defaultCopy().
  
  Specified by:
  
  copy in interface Params
  
  Specified by:
  
  copy in class Predictor<Vector,LinearRegression,LinearRegressionModel>
  
  Parameters:
  
  extra - (undocumented)
  
  Returns:
  
  (undocumented)
- estimateModelSize
  
  public long estimateModelSize(Dataset<?> dataset)

Class LinearRegression

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging

Constructor Summary

Method Summary

Methods inherited from class org.apache.spark.ml.Predictor

Methods inherited from class org.apache.spark.ml.Estimator

Methods inherited from class org.apache.spark.ml.PipelineStage

Methods inherited from class java.lang.Object

Methods inherited from interface org.apache.spark.ml.util.DefaultParamsWritable

Methods inherited from interface org.apache.spark.ml.param.shared.HasAggregationDepth

Methods inherited from interface org.apache.spark.ml.param.shared.HasElasticNetParam

Methods inherited from interface org.apache.spark.ml.param.shared.HasFeaturesCol

Methods inherited from interface org.apache.spark.ml.param.shared.HasFitIntercept

Methods inherited from interface org.apache.spark.ml.param.shared.HasLabelCol

Methods inherited from interface org.apache.spark.ml.param.shared.HasLoss

Methods inherited from interface org.apache.spark.ml.param.shared.HasMaxBlockSizeInMB

Methods inherited from interface org.apache.spark.ml.param.shared.HasMaxIter

Methods inherited from interface org.apache.spark.ml.param.shared.HasPredictionCol

Methods inherited from interface org.apache.spark.ml.param.shared.HasRegParam

Methods inherited from interface org.apache.spark.ml.param.shared.HasSolver

Methods inherited from interface org.apache.spark.ml.param.shared.HasStandardization

Methods inherited from interface org.apache.spark.ml.param.shared.HasTol

Methods inherited from interface org.apache.spark.ml.param.shared.HasWeightCol

Methods inherited from interface org.apache.spark.ml.util.Identifiable

Methods inherited from interface org.apache.spark.ml.regression.LinearRegressionParams

Methods inherited from interface org.apache.spark.internal.Logging

Methods inherited from interface org.apache.spark.ml.util.MLWritable

Methods inherited from interface org.apache.spark.ml.param.Params

Constructor Details

LinearRegression

LinearRegression

Method Details

load

MAX_FEATURES_FOR_NORMAL_SOLVER

read

solver

loss

epsilon

maxBlockSizeInMB

aggregationDepth

weightCol

standardization

fitIntercept

tol

maxIter

elasticNetParam

regParam

uid

setRegParam

setFitIntercept

setStandardization

setElasticNetParam

setMaxIter

setTol

setWeightCol

setSolver

setAggregationDepth

setLoss

setEpsilon

setMaxBlockSizeInMB

copy

estimateModelSize