Class LinearRegression
- All Implemented Interfaces:
Serializable,org.apache.spark.internal.Logging,Params,HasAggregationDepth,HasElasticNetParam,HasFeaturesCol,HasFitIntercept,HasLabelCol,HasLoss,HasMaxBlockSizeInMB,HasMaxIter,HasPredictionCol,HasRegParam,HasSolver,HasStandardization,HasTol,HasWeightCol,PredictorParams,LinearRegressionParams,DefaultParamsWritable,Identifiable,MLWritable
The learning objective is to minimize the specified loss function, with regularization. This supports two kinds of loss: - squaredError (a.k.a squared loss) - huber (a hybrid of squared error for relatively small errors and absolute error for relatively large ones, and we estimate the scale parameter from training data)
This supports multiple types of regularization: - none (a.k.a. ordinary least squares) - L2 (ridge regression) - L1 (Lasso) - L2 + L1 (elastic net)
The squared error objective function is:
$$ \begin{align} \min_{w}\frac{1}{2n}{\sum_{i=1}^n(X_{i}w - y_{i})^{2} + \lambda\left[\frac{1-\alpha}{2}{||w||_{2}}^{2} + \alpha{||w||_{1}}\right]} \end{align} $$
The huber objective function is:
$$ \begin{align} \min_{w, \sigma}\frac{1}{2n}{\sum_{i=1}^n\left(\sigma + H_m\left(\frac{X_{i}w - y_{i}}{\sigma}\right)\sigma\right) + \frac{1}{2}\lambda {||w||_2}^2} \end{align} $$
where
$$ \begin{align} H_m(z) = \begin{cases} z^2, & \text {if } |z| < \epsilon, \\ 2\epsilon|z| - \epsilon^2, & \text{otherwise} \end{cases} \end{align} $$
Since 3.1.0, it supports stacking instances into blocks and using GEMV for better performance. The block size will be 1.0 MB, if param maxBlockSizeInMB is set 0.0 by default.
Note: Fitting with huber loss only supports none and L2 regularization.
- See Also:
-
Nested Class Summary
Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging
org.apache.spark.internal.Logging.LogStringContext, org.apache.spark.internal.Logging.SparkShellLoggingFilter -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionfinal IntParamParam for suggested depth for treeAggregate (>= 2).Creates a copy of this instance with the same UID and some extra params.final DoubleParamParam for the ElasticNet mixing parameter, in range [0, 1].final DoubleParamepsilon()The shape parameter to control the amount of robustness.longestimateModelSize(Dataset<?> dataset) final BooleanParamParam for whether to fit an intercept term.static LinearRegressionloss()The loss function to be optimized.static intWhen usingLinearRegression.solver== "normal", the solver must limit the number of features to at most this number.final DoubleParamParam for Maximum memory in MB for stacking input data into blocks.final IntParammaxIter()Param for maximum number of iterations (>= 0).static MLReader<T>read()final DoubleParamregParam()Param for regularization parameter (>= 0).setAggregationDepth(int value) Suggested depth for treeAggregate (greater than or equal to 2).setElasticNetParam(double value) Set the ElasticNet mixing parameter.setEpsilon(double value) Sets the value of paramepsilon().setFitIntercept(boolean value) Set if we should fit the intercept.Sets the value of paramloss().setMaxBlockSizeInMB(double value) Sets the value of parammaxBlockSizeInMB().setMaxIter(int value) Set the maximum number of iterations.setRegParam(double value) Set the regularization parameter.Set the solver algorithm used for optimization.setStandardization(boolean value) Whether to standardize the training features before fitting the model.setTol(double value) Set the convergence tolerance of iterations.setWeightCol(String value) Whether to over-/under-sample training instances according to the given weights in weightCol.solver()The solver algorithm for optimization.final BooleanParamParam for whether to standardize the training features before fitting the model.final DoubleParamtol()Param for the convergence tolerance for iterative algorithms (>= 0).uid()An immutable unique ID for the object and its derivatives.Param for weight column name.Methods inherited from class org.apache.spark.ml.Predictor
featuresCol, fit, labelCol, predictionCol, setFeaturesCol, setLabelCol, setPredictionCol, transformSchemaMethods inherited from class org.apache.spark.ml.PipelineStage
paramsMethods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface org.apache.spark.ml.util.DefaultParamsWritable
writeMethods inherited from interface org.apache.spark.ml.param.shared.HasAggregationDepth
getAggregationDepthMethods inherited from interface org.apache.spark.ml.param.shared.HasElasticNetParam
getElasticNetParamMethods inherited from interface org.apache.spark.ml.param.shared.HasFeaturesCol
featuresCol, getFeaturesColMethods inherited from interface org.apache.spark.ml.param.shared.HasFitIntercept
getFitInterceptMethods inherited from interface org.apache.spark.ml.param.shared.HasLabelCol
getLabelCol, labelColMethods inherited from interface org.apache.spark.ml.param.shared.HasMaxBlockSizeInMB
getMaxBlockSizeInMBMethods inherited from interface org.apache.spark.ml.param.shared.HasMaxIter
getMaxIterMethods inherited from interface org.apache.spark.ml.param.shared.HasPredictionCol
getPredictionCol, predictionColMethods inherited from interface org.apache.spark.ml.param.shared.HasRegParam
getRegParamMethods inherited from interface org.apache.spark.ml.param.shared.HasStandardization
getStandardizationMethods inherited from interface org.apache.spark.ml.param.shared.HasWeightCol
getWeightColMethods inherited from interface org.apache.spark.ml.util.Identifiable
toStringMethods inherited from interface org.apache.spark.ml.regression.LinearRegressionParams
getEpsilon, validateAndTransformSchemaMethods inherited from interface org.apache.spark.internal.Logging
initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logBasedOnLevel, logDebug, logDebug, logDebug, logDebug, logError, logError, logError, logError, logInfo, logInfo, logInfo, logInfo, logName, LogStringContext, logTrace, logTrace, logTrace, logTrace, logWarning, logWarning, logWarning, logWarning, MDC, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq, withLogContextMethods inherited from interface org.apache.spark.ml.util.MLWritable
saveMethods inherited from interface org.apache.spark.ml.param.Params
clear, copyValues, defaultCopy, defaultParamMap, estimateMatadataSize, explainParam, explainParams, extractParamMap, extractParamMap, get, getDefault, getOrDefault, getParam, hasDefault, hasParam, isDefined, isSet, onParamChange, paramMap, params, set, set, set, setDefault, setDefault, shouldOwn
-
Constructor Details
-
LinearRegression
-
LinearRegression
public LinearRegression()
-
-
Method Details
-
load
-
MAX_FEATURES_FOR_NORMAL_SOLVER
public static int MAX_FEATURES_FOR_NORMAL_SOLVER()When usingLinearRegression.solver== "normal", the solver must limit the number of features to at most this number. The entire covariance matrix X^T^X will be collected to the driver. This limit helps prevent memory overflow errors.- Returns:
- (undocumented)
-
read
-
solver
Description copied from interface:LinearRegressionParamsThe solver algorithm for optimization. Supported options: "l-bfgs", "normal" and "auto". Default: "auto"- Specified by:
solverin interfaceHasSolver- Specified by:
solverin interfaceLinearRegressionParams- Returns:
- (undocumented)
-
loss
Description copied from interface:LinearRegressionParamsThe loss function to be optimized. Supported options: "squaredError" and "huber". Default: "squaredError"- Specified by:
lossin interfaceHasLoss- Specified by:
lossin interfaceLinearRegressionParams- Returns:
- (undocumented)
-
epsilon
Description copied from interface:LinearRegressionParamsThe shape parameter to control the amount of robustness. Must be > 1.0. At larger values of epsilon, the huber criterion becomes more similar to least squares regression; for small values of epsilon, the criterion is more similar to L1 regression. Default is 1.35 to get as much robustness as possible while retaining 95% statistical efficiency for normally distributed data. It matches sklearn HuberRegressor and is "M" from A robust hybrid of lasso and ridge regression. Only valid when "loss" is "huber".- Specified by:
epsilonin interfaceLinearRegressionParams- Returns:
- (undocumented)
-
maxBlockSizeInMB
Description copied from interface:HasMaxBlockSizeInMBParam for Maximum memory in MB for stacking input data into blocks. Data is stacked within partitions. If more than remaining data size in a partition then it is adjusted to the data size. Default 0.0 represents choosing optimal value, depends on specific algorithm. Must be >= 0..- Specified by:
maxBlockSizeInMBin interfaceHasMaxBlockSizeInMB- Returns:
- (undocumented)
-
aggregationDepth
Description copied from interface:HasAggregationDepthParam for suggested depth for treeAggregate (>= 2).- Specified by:
aggregationDepthin interfaceHasAggregationDepth- Returns:
- (undocumented)
-
weightCol
Description copied from interface:HasWeightColParam for weight column name. If this is not set or empty, we treat all instance weights as 1.0.- Specified by:
weightColin interfaceHasWeightCol- Returns:
- (undocumented)
-
standardization
Description copied from interface:HasStandardizationParam for whether to standardize the training features before fitting the model.- Specified by:
standardizationin interfaceHasStandardization- Returns:
- (undocumented)
-
fitIntercept
Description copied from interface:HasFitInterceptParam for whether to fit an intercept term.- Specified by:
fitInterceptin interfaceHasFitIntercept- Returns:
- (undocumented)
-
tol
Description copied from interface:HasTolParam for the convergence tolerance for iterative algorithms (>= 0). -
maxIter
Description copied from interface:HasMaxIterParam for maximum number of iterations (>= 0).- Specified by:
maxIterin interfaceHasMaxIter- Returns:
- (undocumented)
-
elasticNetParam
Description copied from interface:HasElasticNetParamParam for the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.- Specified by:
elasticNetParamin interfaceHasElasticNetParam- Returns:
- (undocumented)
-
regParam
Description copied from interface:HasRegParamParam for regularization parameter (>= 0).- Specified by:
regParamin interfaceHasRegParam- Returns:
- (undocumented)
-
uid
Description copied from interface:IdentifiableAn immutable unique ID for the object and its derivatives.- Specified by:
uidin interfaceIdentifiable- Returns:
- (undocumented)
-
setRegParam
Set the regularization parameter. Default is 0.0.- Parameters:
value- (undocumented)- Returns:
- (undocumented)
-
setFitIntercept
Set if we should fit the intercept. Default is true.- Parameters:
value- (undocumented)- Returns:
- (undocumented)
-
setStandardization
Whether to standardize the training features before fitting the model. The coefficients of models will be always returned on the original scale, so it will be transparent for users. Default is true.- Parameters:
value- (undocumented)- Returns:
- (undocumented)
- Note:
- With/without standardization, the models should be always converged to the same solution when no regularization is applied. In R's GLMNET package, the default behavior is true as well.
-
setElasticNetParam
Set the ElasticNet mixing parameter. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. For alpha in (0,1), the penalty is a combination of L1 and L2. Default is 0.0 which is an L2 penalty.Note: Fitting with huber loss only supports None and L2 regularization, so throws exception if this param is non-zero value.
- Parameters:
value- (undocumented)- Returns:
- (undocumented)
-
setMaxIter
Set the maximum number of iterations. Default is 100.- Parameters:
value- (undocumented)- Returns:
- (undocumented)
-
setTol
Set the convergence tolerance of iterations. Smaller value will lead to higher accuracy with the cost of more iterations. Default is 1E-6.- Parameters:
value- (undocumented)- Returns:
- (undocumented)
-
setWeightCol
Whether to over-/under-sample training instances according to the given weights in weightCol. If not set or empty, all instances are treated equally (weight 1.0). Default is not set, so all instances have weight one.- Parameters:
value- (undocumented)- Returns:
- (undocumented)
-
setSolver
Set the solver algorithm used for optimization. In case of linear regression, this can be "l-bfgs", "normal" and "auto". - "l-bfgs" denotes Limited-memory BFGS which is a limited-memory quasi-Newton optimization method. - "normal" denotes using Normal Equation as an analytical solution to the linear regression problem. This solver is limited toLinearRegression.MAX_FEATURES_FOR_NORMAL_SOLVER. - "auto" (default) means that the solver algorithm is selected automatically. The Normal Equations solver will be used when possible, but this will automatically fall back to iterative optimization methods when needed.Note: Fitting with huber loss doesn't support normal solver, so throws exception if this param was set with "normal".
- Parameters:
value- (undocumented)- Returns:
- (undocumented)
-
setAggregationDepth
Suggested depth for treeAggregate (greater than or equal to 2). If the dimensions of features or the number of partitions are large, this param could be adjusted to a larger size. Default is 2.- Parameters:
value- (undocumented)- Returns:
- (undocumented)
-
setLoss
Sets the value of paramloss(). Default is "squaredError".- Parameters:
value- (undocumented)- Returns:
- (undocumented)
-
setEpsilon
Sets the value of paramepsilon(). Default is 1.35.- Parameters:
value- (undocumented)- Returns:
- (undocumented)
-
setMaxBlockSizeInMB
Sets the value of parammaxBlockSizeInMB(). Default is 0.0, then 1.0 MB will be chosen.- Parameters:
value- (undocumented)- Returns:
- (undocumented)
-
copy
Description copied from interface:ParamsCreates a copy of this instance with the same UID and some extra params. Subclasses should implement this method and set the return type properly. SeedefaultCopy().- Specified by:
copyin interfaceParams- Specified by:
copyin classPredictor<Vector,LinearRegression, LinearRegressionModel> - Parameters:
extra- (undocumented)- Returns:
- (undocumented)
-
estimateModelSize
-