pyspark.RDD.countApprox

RDD.countApprox(timeout: int, confidence: float = 0.95) → int[source]

Approximate version of count() that returns a potentially incomplete result within a timeout, even if not all tasks have finished.

New in version 1.2.0.

Parameters
timeoutint

maximum time to wait for the job, in milliseconds

confidencefloat

the desired statistical confidence in the result

Returns
int

a potentially incomplete result, with error bounds

See also

RDD.count()

Examples

>>> rdd = sc.parallelize(range(1000), 10)
>>> rdd.countApprox(1000, 1.0)
1000