spark.logit {SparkR}  R Documentation 
Fits an logistic regression model against a SparkDataFrame. It supports "binomial": Binary logistic regression with pivoting; "multinomial": Multinomial logistic (softmax) regression without pivoting, similar to glmnet. Users can print, make predictions on the produced model and save the model to the input path.
spark.logit(data, formula, ...) ## S4 method for signature 'SparkDataFrame,formula' spark.logit( data, formula, regParam = 0, elasticNetParam = 0, maxIter = 100, tol = 1e06, family = "auto", standardization = TRUE, thresholds = 0.5, weightCol = NULL, aggregationDepth = 2, lowerBoundsOnCoefficients = NULL, upperBoundsOnCoefficients = NULL, lowerBoundsOnIntercepts = NULL, upperBoundsOnIntercepts = NULL, handleInvalid = c("error", "keep", "skip") ) ## S4 method for signature 'LogisticRegressionModel' summary(object) ## S4 method for signature 'LogisticRegressionModel' predict(object, newData) ## S4 method for signature 'LogisticRegressionModel,character' write.ml(object, path, overwrite = FALSE)
data 
SparkDataFrame for training. 
formula 
A symbolic description of the model to be fitted. Currently only a few formula operators are supported, including '~', '.', ':', '+', and ''. 
... 
additional arguments passed to the method. 
regParam 
the regularization parameter. 
elasticNetParam 
the ElasticNet mixing parameter. For alpha = 0.0, the penalty is an L2 penalty. For alpha = 1.0, it is an L1 penalty. For 0.0 < alpha < 1.0, the penalty is a combination of L1 and L2. Default is 0.0 which is an L2 penalty. 
maxIter 
maximum iteration number. 
tol 
convergence tolerance of iterations. 
family 
the name of family which is a description of the label distribution to be used in the model. Supported options:

standardization 
whether to standardize the training features before fitting the model. The coefficients of models will be always returned on the original scale, so it will be transparent for users. Note that with/without standardization, the models should be always converged to the same solution when no regularization is applied. Default is TRUE, same as glmnet. 
thresholds 
in binary classification, in range [0, 1]. If the estimated probability of class label 1 is > threshold, then predict 1, else 0. A high threshold encourages the model to predict 0 more often; a low threshold encourages the model to predict 1 more often. Note: Setting this with threshold p is equivalent to setting thresholds c(1p, p). In multiclass (or binary) classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0, excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold. 
weightCol 
The weight column name. 
aggregationDepth 
The depth for treeAggregate (greater than or equal to 2). If the dimensions of features or the number of partitions are large, this param could be adjusted to a larger size. This is an expert parameter. Default value should be good for most cases. 
lowerBoundsOnCoefficients 
The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. It is a R matrix. 
upperBoundsOnCoefficients 
The upper bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. It is a R matrix. 
lowerBoundsOnIntercepts 
The lower bounds on intercepts if fitting under bound constrained optimization. The bounds vector size must be equal to 1 for binomial regression, or the number of classes for multinomial regression. 
upperBoundsOnIntercepts 
The upper bounds on intercepts if fitting under bound constrained optimization. The bound vector size must be equal to 1 for binomial regression, or the number of classes for multinomial regression. 
handleInvalid 
How to handle invalid data (unseen labels or NULL values) in features and label column of string type. Supported options: "skip" (filter out rows with invalid data), "error" (throw an error), "keep" (put invalid data in a special additional bucket, at index numLabels). Default is "error". 
object 
an LogisticRegressionModel fitted by 
newData 
a SparkDataFrame for testing. 
path 
The directory where the model is saved. 
overwrite 
Overwrites or not if the output path already exists. Default is FALSE which means throw exception if the output path exists. 
spark.logit
returns a fitted logistic regression model.
summary
returns summary information of the fitted model, which is a list.
The list includes coefficients
(coefficients matrix of the fitted model).
predict
returns the predicted values based on an LogisticRegressionModel.
spark.logit since 2.1.0
summary(LogisticRegressionModel) since 2.1.0
predict(LogisticRegressionModel) since 2.1.0
write.ml(LogisticRegression, character) since 2.1.0
## Not run:
##D sparkR.session()
##D # binary logistic regression
##D t < as.data.frame(Titanic)
##D training < createDataFrame(t)
##D model < spark.logit(training, Survived ~ ., regParam = 0.5)
##D summary < summary(model)
##D
##D # fitted values on training data
##D fitted < predict(model, training)
##D
##D # save fitted model to input path
##D path < "path/to/model"
##D write.ml(model, path)
##D
##D # can also read back the saved model and predict
##D # Note that summary deos not work on loaded model
##D savedModel < read.ml(path)
##D summary(savedModel)
##D
##D # binary logistic regression against two classes with
##D # upperBoundsOnCoefficients and upperBoundsOnIntercepts
##D ubc < matrix(c(1.0, 0.0, 1.0, 0.0), nrow = 1, ncol = 4)
##D model < spark.logit(training, Species ~ .,
##D upperBoundsOnCoefficients = ubc,
##D upperBoundsOnIntercepts = 1.0)
##D
##D # multinomial logistic regression
##D model < spark.logit(training, Class ~ ., regParam = 0.5)
##D summary < summary(model)
##D
##D # multinomial logistic regression with
##D # lowerBoundsOnCoefficients and lowerBoundsOnIntercepts
##D lbc < matrix(c(0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0), nrow = 2, ncol = 4)
##D lbi < as.array(c(0.0, 0.0))
##D model < spark.logit(training, Species ~ ., family = "multinomial",
##D lowerBoundsOnCoefficients = lbc,
##D lowerBoundsOnIntercepts = lbi)
## End(Not run)