pyspark.testing.assertPandasOnSparkEqual

pyspark.testing.assertPandasOnSparkEqual(actual: Union[pyspark.pandas.frame.DataFrame, pyspark.pandas.series.Series, pyspark.pandas.indexes.base.Index], expected: Union[pyspark.pandas.frame.DataFrame, pandas.core.frame.DataFrame, pyspark.pandas.series.Series, pandas.core.series.Series, pyspark.pandas.indexes.base.Index, pandas.core.indexes.base.Index], checkExact: bool = True, almost: bool = False, rtol: float = 1e-05, atol: float = 1e-08, checkRowOrder: bool = True)[source]

A util function to assert equality between actual (pandas-on-Spark object) and expected (pandas-on-Spark or pandas object).

New in version 3.5.0.

Deprecated since version 3.5.1: assertPandasOnSparkEqual will be removed in Spark 4.0.0.

Parameters
actual: pandas-on-Spark DataFrame, Series, or Index

The object that is being compared or tested.

expected: pandas-on-Spark or pandas DataFrame, Series, or Index

The expected object, for comparison with the actual result.

checkExact: bool, optional

A flag indicating whether to compare exact equality. If set to ‘True’ (default), the data is compared exactly. If set to ‘False’, the data is compared less precisely, following pandas assert_frame_equal approximate comparison (see documentation for more details).

almost: bool, optional

A flag indicating whether to use unittest assertAlmostEqual or assertEqual. If set to ‘True’, the comparison is delegated to unittest’s assertAlmostEqual (see documentation for more details). If set to ‘False’ (default), the data is compared exactly with unittest’s assertEqual.

rtolfloat, optional

The relative tolerance, used in asserting almost equality for float values in actual and expected. Set to 1e-5 by default. (See Notes)

atolfloat, optional

The absolute tolerance, used in asserting almost equality for float values in actual and expected. Set to 1e-8 by default. (See Notes)

checkRowOrderbool, optional

A flag indicating whether the order of rows should be considered in the comparison. If set to False, the row order is not taken into account. If set to True (default), the order of rows will be checked during comparison. (See Notes)

Notes

For checkRowOrder, note that pandas-on-Spark DataFrame ordering is non-deterministic, unless explicitly sorted.

When almost is set to True, approximate equality will be asserted, where two values a and b are approximately equal if they satisfy the following formula:

absolute(a - b) <= (atol + rtol * absolute(b)).

Examples

>>> import pyspark.pandas as ps
>>> psdf1 = ps.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
>>> psdf2 = ps.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
>>> assertPandasOnSparkEqual(psdf1, psdf2)  # pass, ps.DataFrames are equal
>>> s1 = ps.Series([212.32, 100.0001])
>>> s2 = ps.Series([212.32, 100.0])
>>> assertPandasOnSparkEqual(s1, s2, checkExact=False)  # pass, ps.Series are approx equal
>>> s1 = ps.Index([212.300001, 100.000])
>>> s2 = ps.Index([212.3, 100.0001])
>>> assertPandasOnSparkEqual(s1, s2, almost=True)  # pass, ps.Index obj are almost equal