pyspark.RDD.takeSample

RDD.takeSample(withReplacement: bool, num: int, seed: Optional[int] = None) → List[T][source]

Return a fixed-size sampled subset of this RDD.

New in version 1.3.0.

Parameters
withReplacementlist

whether sampling is done with replacement

numint

size of the returned sample

seedint, optional

random seed

Returns
list

a fixed-size sampled subset of this RDD in an array

See also

RDD.sample()

Notes

This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory.

Examples

>>> import sys
>>> rdd = sc.parallelize(range(0, 10))
>>> len(rdd.takeSample(True, 20, 1))
20
>>> len(rdd.takeSample(False, 5, 2))
5
>>> len(rdd.takeSample(False, 15, 3))
10
>>> sc.range(0, 10).takeSample(False, sys.maxsize)
Traceback (most recent call last):
    ...
ValueError: Sample size cannot be greater than ...