pyspark.pandas.DataFrame.to_json

DataFrame.to_json(path: Optional[str] = None, compression: str = 'uncompressed', num_files: Optional[int] = None, mode: str = 'w', orient: str = 'records', lines: bool = True, partition_cols: Union[str, List[str], None] = None, index_col: Union[str, List[str], None] = None, **options: Any) → Optional[str]

Convert the object to a JSON string.

Note

pandas-on-Spark to_json writes files to a path or URI. Unlike pandas’, pandas-on-Spark respects HDFS’s property such as ‘fs.default.name’.

Note

pandas-on-Spark writes JSON files into the directory, path, and writes multiple part-… files in the directory when path is specified. This behavior was inherited from Apache Spark. The number of partitions can be controlled by num_files. This is deprecated. Use DataFrame.spark.repartition instead.

Note

output JSON format is different from pandas’. It always uses orient=’records’ for its output. This behavior might have to change soon.

Note

Set ignoreNullFields keyword argument to True to omit None or NaN values when writing JSON objects. It works only when path is provided.

Note NaN’s and None will be converted to null and datetime objects will be converted to UNIX timestamps.

Parameters
path: string, optional

File path. If not specified, the result is returned as a string.

lines: bool, default True

If ‘orient’ is ‘records’ write out line delimited JSON format. Will throw ValueError if incorrect ‘orient’ since others are not list like. It should be always True for now.

orient: str, default ‘records’

It should be always ‘records’ for now.

compression: {‘gzip’, ‘bz2’, ‘xz’, None}

A string representing the compression to use in the output file, only used when the first argument is a filename. By default, the compression is inferred from the filename.

num_files: the number of partitions to be written in `path` directory when

this is a path. This is deprecated. Use DataFrame.spark.repartition instead.

mode: str

Python write mode, default ‘w’.

Note

mode can accept the strings for Spark writing mode. Such as ‘append’, ‘overwrite’, ‘ignore’, ‘error’, ‘errorifexists’.

  • ‘append’ (equivalent to ‘a’): Append the new data to existing data.

  • ‘overwrite’ (equivalent to ‘w’): Overwrite existing data.

  • ‘ignore’: Silently ignore this operation if data already exists.

  • ‘error’ or ‘errorifexists’: Throw an exception if data already exists.

partition_cols: str or list of str, optional, default None

Names of partitioning columns

index_col: str or list of str, optional, default: None

Column names to be used in Spark to represent pandas-on-Spark’s index. The index name in pandas-on-Spark is ignored. By default, the index is always lost.

options: keyword arguments for additional options specific to PySpark.

It is specific to PySpark’s JSON options to pass. Check the options in PySpark’s API documentation for spark.write.json(…). It has a higher priority and overwrites all other options. This parameter only works when path is specified.

Returns
str or None

Examples

>>> df = ps.DataFrame([['a', 'b'], ['c', 'd']],
...                   columns=['col 1', 'col 2'])
>>> df.to_json()
'[{"col 1":"a","col 2":"b"},{"col 1":"c","col 2":"d"}]'
>>> df['col 1'].to_json()
'[{"col 1":"a"},{"col 1":"c"}]'
>>> df.to_json(path=r'%s/to_json/foo.json' % path, num_files=1)
>>> ps.read_json(
...     path=r'%s/to_json/foo.json' % path
... ).sort_values(by="col 1")
  col 1 col 2
0     a     b
1     c     d
>>> df['col 1'].to_json(path=r'%s/to_json/foo.json' % path, num_files=1, index_col="index")
>>> ps.read_json(
...     path=r'%s/to_json/foo.json' % path, index_col="index"
... ).sort_values(by="col 1")  
      col 1
index
0         a
1         c