pyspark.SparkContext.parallelize

SparkContext.parallelize(c: Iterable[T], numSlices: Optional[int] = None) → pyspark.rdd.RDD[T][source]

Distribute a local Python collection to form an RDD. Using range is recommended if the input represents a range for performance.

New in version 0.7.0.

Parameters
ccollections.abc.Iterable

iterable collection to distribute

numSlicesint, optional

the number of partitions of the new RDD

Returns
RDD

RDD representing distributed collection.

Examples

>>> sc.parallelize([0, 2, 3, 4, 6], 5).glom().collect()
[[0], [2], [3], [4], [6]]
>>> sc.parallelize(range(0, 6, 2), 5).glom().collect()
[[], [0], [], [2], [4]]

Deal with a list of strings.

>>> strings = ["a", "b", "c"]
>>> sc.parallelize(strings, 2).glom().collect()
[['a'], ['b', 'c']]