pyspark.RDD.meanApprox

RDD.meanApprox(timeout: int, confidence: float = 0.95) → pyspark.rdd.BoundedFloat[source]

Approximate operation to return the mean within a timeout or meet the confidence.

New in version 1.2.0.

Parameters
timeoutint

maximum time to wait for the job, in milliseconds

confidencefloat

the desired statistical confidence in the result

Returns
BoundedFloat

a potentially incomplete result, with error bounds

See also

RDD.mean()

Examples

>>> rdd = sc.parallelize(range(1000), 10)
>>> r = sum(range(1000)) / 1000.0
>>> abs(rdd.meanApprox(1000) - r) / r < 0.05
True