pyspark.sql.DataFrame.randomSplit

DataFrame.randomSplit(weights: List[float], seed: Optional[int] = None) → List[pyspark.sql.dataframe.DataFrame][source]

Randomly splits this DataFrame with the provided weights.

New in version 1.4.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters
weightslist

list of doubles as weights with which to split the DataFrame. Weights will be normalized if they don’t sum up to 1.0.

seedint, optional

The seed for sampling.

Returns
list

List of DataFrames.

Examples

>>> from pyspark.sql import Row
>>> df = spark.createDataFrame([
...     Row(age=10, height=80, name="Alice"),
...     Row(age=5, height=None, name="Bob"),
...     Row(age=None, height=None, name="Tom"),
...     Row(age=None, height=None, name=None),
... ])
>>> splits = df.randomSplit([1.0, 2.0], 24)
>>> splits[0].count()
2
>>> splits[1].count()
2