pyspark.RDD.saveAsTextFile

RDD.saveAsTextFile(path: str, compressionCodecClass: Optional[str] = None) → None[source]

Save this RDD as a text file, using string representations of elements.

New in version 0.7.0.

Parameters
pathstr

path to text file

compressionCodecClassstr, optional

fully qualified classname of the compression codec class i.e. “org.apache.hadoop.io.compress.GzipCodec” (None by default)

Examples

>>> import os
>>> import tempfile
>>> from fileinput import input
>>> from glob import glob
>>> with tempfile.TemporaryDirectory() as d1:
...     path1 = os.path.join(d1, "text_file1")
...
...     # Write a temporary text file
...     sc.parallelize(range(10)).saveAsTextFile(path1)
...
...     # Load text file as an RDD
...     ''.join(sorted(input(glob(path1 + "/part-0000*"))))
'0\n1\n2\n3\n4\n5\n6\n7\n8\n9\n'

Empty lines are tolerated when saving to text files.

>>> with tempfile.TemporaryDirectory() as d2:
...     path2 = os.path.join(d2, "text2_file2")
...
...     # Write another temporary text file
...     sc.parallelize(['', 'foo', '', 'bar', '']).saveAsTextFile(path2)
...
...     # Load text file as an RDD
...     ''.join(sorted(input(glob(path2 + "/part-0000*"))))
'\n\n\nbar\nfoo\n'

Using compressionCodecClass

>>> from fileinput import input, hook_compressed
>>> with tempfile.TemporaryDirectory() as d3:
...     path3 = os.path.join(d3, "text3")
...     codec = "org.apache.hadoop.io.compress.GzipCodec"
...
...     # Write another temporary text file with specified codec
...     sc.parallelize(['foo', 'bar']).saveAsTextFile(path3, codec)
...
...     # Load text file as an RDD
...     result = sorted(input(glob(path3 + "/part*.gz"), openhook=hook_compressed))
...     ''.join([r.decode('utf-8') if isinstance(r, bytes) else r for r in result])
'bar\nfoo\n'