PowerIterationClustering¶
- 
class pyspark.ml.clustering.PowerIterationClustering(*, k: int = 2, maxIter: int = 20, initMode: str = 'random', srcCol: str = 'src', dstCol: str = 'dst', weightCol: Optional[str] = None)[source]¶
- Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by Lin and Cohen. From the abstract: PIC finds a very low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix of the data. - This class is not yet an Estimator/Transformer, use - assignClusters()method to run the PowerIterationClustering algorithm.- New in version 2.4.0. - Notes - See Wikipedia on Spectral clustering - Examples - >>> data = [(1, 0, 0.5), ... (2, 0, 0.5), (2, 1, 0.7), ... (3, 0, 0.5), (3, 1, 0.7), (3, 2, 0.9), ... (4, 0, 0.5), (4, 1, 0.7), (4, 2, 0.9), (4, 3, 1.1), ... (5, 0, 0.5), (5, 1, 0.7), (5, 2, 0.9), (5, 3, 1.1), (5, 4, 1.3)] >>> df = spark.createDataFrame(data).toDF("src", "dst", "weight").repartition(1) >>> pic = PowerIterationClustering(k=2, weightCol="weight") >>> pic.setMaxIter(40) PowerIterationClustering... >>> assignments = pic.assignClusters(df) >>> assignments.sort(assignments.id).show(truncate=False) +---+-------+ |id |cluster| +---+-------+ |0 |0 | |1 |0 | |2 |0 | |3 |0 | |4 |0 | |5 |1 | +---+-------+ ... >>> pic_path = temp_path + "/pic" >>> pic.save(pic_path) >>> pic2 = PowerIterationClustering.load(pic_path) >>> pic2.getK() 2 >>> pic2.getMaxIter() 40 >>> pic2.assignClusters(df).take(6) == assignments.take(6) True - Methods - assignClusters(dataset)- Run the PIC algorithm and returns a cluster assignment for each input vertex. - clear(param)- Clears a param from the param map if it has been explicitly set. - copy([extra])- Creates a copy of this instance with the same uid and some extra params. - explainParam(param)- Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string. - Returns the documentation of all params with their optionally default values and user-supplied values. - extractParamMap([extra])- Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. - Gets the value of - dstColor its default value.- Gets the value of - initModeor its default value.- getK()- Gets the value of - kor its default value.- Gets the value of maxIter or its default value. - getOrDefault(param)- Gets the value of a param in the user-supplied param map or its default value. - getParam(paramName)- Gets a param by its name. - Gets the value of - srcColor its default value.- Gets the value of weightCol or its default value. - hasDefault(param)- Checks whether a param has a default value. - hasParam(paramName)- Tests whether this instance contains a param with a given (string) name. - isDefined(param)- Checks whether a param is explicitly set by user or has a default value. - isSet(param)- Checks whether a param is explicitly set by user. - load(path)- Reads an ML instance from the input path, a shortcut of read().load(path). - read()- Returns an MLReader instance for this class. - save(path)- Save this ML instance to the given path, a shortcut of ‘write().save(path)’. - set(param, value)- Sets a parameter in the embedded param map. - setDstCol(value)- Sets the value of - dstCol.- setInitMode(value)- Sets the value of - initMode.- setK(value)- Sets the value of - k.- setMaxIter(value)- Sets the value of - maxIter.- setParams(self, \*[, k, maxIter, initMode, …])- Sets params for PowerIterationClustering. - setSrcCol(value)- Sets the value of - srcCol.- setWeightCol(value)- Sets the value of - weightCol.- write()- Returns an MLWriter instance for this ML instance. - Attributes - Returns all params ordered by name. - Methods Documentation - 
assignClusters(dataset: pyspark.sql.dataframe.DataFrame) → pyspark.sql.dataframe.DataFrame[source]¶
- Run the PIC algorithm and returns a cluster assignment for each input vertex. - Parameters
- datasetpyspark.sql.DataFrame
- A dataset with columns src, dst, weight representing the affinity matrix, which is the matrix A in the PIC paper. Suppose the src column value is i, the dst column value is j, the weight column value is similarity s,,ij,, which must be nonnegative. This is a symmetric matrix and hence s,,ij,, = s,,ji,,. For any (i, j) with nonzero similarity, there should be either (i, j, s,,ij,,) or (j, i, s,,ji,,) in the input. Rows with i = j are ignored, because we assume s,,ij,, = 0.0. 
 
- dataset
- Returns
- pyspark.sql.DataFrame
- A dataset that contains columns of vertex id and the corresponding cluster for the id. The schema of it will be: - id: Long - cluster: Int 
 - New in version 2.4.0: .. 
 
 - 
clear(param: pyspark.ml.param.Param) → None¶
- Clears a param from the param map if it has been explicitly set. 
 - 
copy(extra: Optional[ParamMap] = None) → JP¶
- Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied. - Parameters
- extradict, optional
- Extra parameters to copy to the new instance 
 
- Returns
- JavaParams
- Copy of this instance 
 
 
 - 
explainParam(param: Union[str, pyspark.ml.param.Param]) → str¶
- Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string. 
 - 
explainParams() → str¶
- Returns the documentation of all params with their optionally default values and user-supplied values. 
 - 
extractParamMap(extra: Optional[ParamMap] = None) → ParamMap¶
- Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra. - Parameters
- extradict, optional
- extra param values 
 
- Returns
- dict
- merged param map 
 
 
 - 
getMaxIter() → int¶
- Gets the value of maxIter or its default value. 
 - 
getOrDefault(param: Union[str, pyspark.ml.param.Param[T]]) → Union[Any, T]¶
- Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set. 
 - 
getParam(paramName: str) → pyspark.ml.param.Param¶
- Gets a param by its name. 
 - 
getWeightCol() → str¶
- Gets the value of weightCol or its default value. 
 - 
hasDefault(param: Union[str, pyspark.ml.param.Param[Any]]) → bool¶
- Checks whether a param has a default value. 
 - 
hasParam(paramName: str) → bool¶
- Tests whether this instance contains a param with a given (string) name. 
 - 
isDefined(param: Union[str, pyspark.ml.param.Param[Any]]) → bool¶
- Checks whether a param is explicitly set by user or has a default value. 
 - 
isSet(param: Union[str, pyspark.ml.param.Param[Any]]) → bool¶
- Checks whether a param is explicitly set by user. 
 - 
classmethod load(path: str) → RL¶
- Reads an ML instance from the input path, a shortcut of read().load(path). 
 - 
classmethod read() → pyspark.ml.util.JavaMLReader[RL]¶
- Returns an MLReader instance for this class. 
 - 
save(path: str) → None¶
- Save this ML instance to the given path, a shortcut of ‘write().save(path)’. 
 - 
set(param: pyspark.ml.param.Param, value: Any) → None¶
- Sets a parameter in the embedded param map. 
 - 
setDstCol(value: str) → pyspark.ml.clustering.PowerIterationClustering[source]¶
- Sets the value of - dstCol.- New in version 2.4.0. 
 - 
setInitMode(value: str) → pyspark.ml.clustering.PowerIterationClustering[source]¶
- Sets the value of - initMode.- New in version 2.4.0. 
 - 
setK(value: int) → pyspark.ml.clustering.PowerIterationClustering[source]¶
- Sets the value of - k.- New in version 2.4.0. 
 - 
setMaxIter(value: int) → pyspark.ml.clustering.PowerIterationClustering[source]¶
- Sets the value of - maxIter.- New in version 2.4.0. 
 - 
setParams(self, \*, k=2, maxIter=20, initMode="random", srcCol="src", dstCol="dst", weightCol=None)[source]¶
- Sets params for PowerIterationClustering. - New in version 2.4.0. 
 - 
setSrcCol(value: str) → pyspark.ml.clustering.PowerIterationClustering[source]¶
- Sets the value of - srcCol.- New in version 2.4.0. 
 - 
setWeightCol(value: str) → pyspark.ml.clustering.PowerIterationClustering[source]¶
- Sets the value of - weightCol.- New in version 2.4.0. 
 - 
write() → pyspark.ml.util.JavaMLWriter¶
- Returns an MLWriter instance for this ML instance. 
 - Attributes Documentation - 
dstCol= Param(parent='undefined', name='dstCol', doc='Name of the input column for destination vertex IDs.')¶
 - 
initMode= Param(parent='undefined', name='initMode', doc="The initialization algorithm. This can be either 'random' to use a random vector as vertex properties, or 'degree' to use a normalized sum of similarities with other vertices. Supported options: 'random' and 'degree'.")¶
 - 
k= Param(parent='undefined', name='k', doc='The number of clusters to create. Must be > 1.')¶
 - 
maxIter= Param(parent='undefined', name='maxIter', doc='max number of iterations (>= 0).')¶
 - 
params¶
- Returns all params ordered by name. The default implementation uses - dir()to get all attributes of type- Param.
 - 
srcCol= Param(parent='undefined', name='srcCol', doc='Name of the input column for source vertex IDs.')¶
 - 
weightCol= Param(parent='undefined', name='weightCol', doc='weight column name. If this is not set or empty, we treat all instance weights as 1.0.')¶
 
-