pyspark.RDD.sampleByKey

RDD.sampleByKey(withReplacement: bool, fractions: Dict[K, Union[float, int]], seed: Optional[int] = None) → pyspark.rdd.RDD[Tuple[K, V]][source]

Return a subset of this RDD sampled by key (via stratified sampling). Create a sample of this RDD using variable sampling rates for different keys as specified by fractions, a key to sampling rate map.

New in version 0.7.0.

Parameters
withReplacementbool

whether to sample with or without replacement

fractionsdict

map of specific keys to sampling rates

seedint, optional

seed for the random number generator

Returns
RDD

a RDD containing the stratified sampling result

See also

RDD.sample()

Examples

>>> fractions = {"a": 0.2, "b": 0.1}
>>> rdd = sc.parallelize(fractions.keys()).cartesian(sc.parallelize(range(0, 1000)))
>>> sample = dict(rdd.sampleByKey(False, fractions, 2).groupByKey().collect())
>>> 100 < len(sample["a"]) < 300 and 50 < len(sample["b"]) < 150
True
>>> max(sample["a"]) <= 999 and min(sample["a"]) >= 0
True
>>> max(sample["b"]) <= 999 and min(sample["b"]) >= 0
True