spark.kmeans {SparkR} | R Documentation |

Fits a k-means clustering model against a SparkDataFrame, similarly to R's kmeans().
Users can call `summary`

to print a summary of the fitted model, `predict`

to make
predictions on new data, and `write.ml`

/`read.ml`

to save/load fitted models.

spark.kmeans(data, formula, ...) ## S4 method for signature 'SparkDataFrame,formula' spark.kmeans(data, formula, k = 2, maxIter = 20, initMode = c("k-means||", "random"), seed = NULL, initSteps = 2, tol = 1e-04) ## S4 method for signature 'KMeansModel' summary(object) ## S4 method for signature 'KMeansModel' predict(object, newData) ## S4 method for signature 'KMeansModel,character' write.ml(object, path, overwrite = FALSE)

`data` |
a SparkDataFrame for training. |

`formula` |
a symbolic description of the model to be fitted. Currently only a few formula operators are supported, including '~', '.', ':', '+', and '-'. Note that the response variable of formula is empty in spark.kmeans. |

`...` |
additional argument(s) passed to the method. |

`k` |
number of centers. |

`maxIter` |
maximum iteration number. |

`initMode` |
the initialization algorithm chosen to fit the model. |

`seed` |
the random seed for cluster initialization. |

`initSteps` |
the number of steps for the k-means|| initialization mode. This is an advanced setting, the default of 2 is almost always enough. Must be > 0. |

`tol` |
convergence tolerance of iterations. |

`object` |
a fitted k-means model. |

`newData` |
a SparkDataFrame for testing. |

`path` |
the directory where the model is saved. |

`overwrite` |
overwrites or not if the output path already exists. Default is FALSE which means throw exception if the output path exists. |

`spark.kmeans`

returns a fitted k-means model.

`summary`

returns summary information of the fitted model, which is a list.
The list includes the model's `k`

(the configured number of cluster centers),
`coefficients`

(model cluster centers),
`size`

(number of data points in each cluster), `cluster`

(cluster centers of the transformed data), is.loaded (whether the model is loaded
from a saved file), and `clusterSize`

(the actual number of cluster centers. When using initMode = "random",
`clusterSize`

may not equal to `k`

).

`predict`

returns the predicted values based on a k-means model.

spark.kmeans since 2.0.0

summary(KMeansModel) since 2.0.0

predict(KMeansModel) since 2.0.0

write.ml(KMeansModel, character) since 2.0.0

```
## Not run:
##D sparkR.session()
##D t <- as.data.frame(Titanic)
##D df <- createDataFrame(t)
##D model <- spark.kmeans(df, Class ~ Survived, k = 4, initMode = "random")
##D summary(model)
##D
##D # fitted values on training data
##D fitted <- predict(model, df)
##D head(select(fitted, "Class", "prediction"))
##D
##D # save fitted model to input path
##D path <- "path/to/model"
##D write.ml(model, path)
##D
##D # can also read back the saved model and print
##D savedModel <- read.ml(path)
##D summary(savedModel)
## End(Not run)
```

[Package *SparkR* version 2.4.2 Index]