pyspark.RDD.takeSample¶
- 
RDD.takeSample(withReplacement: bool, num: int, seed: Optional[int] = None) → List[T][source]¶
- Return a fixed-size sampled subset of this RDD. - New in version 1.3.0. - Parameters
- withReplacementlist
- whether sampling is done with replacement 
- numint
- size of the returned sample 
- seedint, optional
- random seed 
 
- Returns
- list
- a fixed-size sampled subset of this - RDDin an array
 
 - See also - Notes - This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory. - Examples - >>> import sys >>> rdd = sc.parallelize(range(0, 10)) >>> len(rdd.takeSample(True, 20, 1)) 20 >>> len(rdd.takeSample(False, 5, 2)) 5 >>> len(rdd.takeSample(False, 15, 3)) 10 >>> sc.range(0, 10).takeSample(False, sys.maxsize) Traceback (most recent call last): ... ValueError: Sample size cannot be greater than ...