Spark Release 3.3.0

Apache Spark 3.3.0 is the fourth release of the 3.x line. With tremendous contribution from the open-source community, this release managed to resolve in excess of 1,600 Jira tickets.

This release improve join query performance via Bloom filters, increases the Pandas API coverage with the support of popular Pandas features such as datetime.timedelta and merge_asof, simplifies the migration from traditional data warehouses by improving ANSI compliance and supporting dozens of new built-in functions, boosts development productivity with better error handling, autocompletion, performance, and profiling.

To download Apache Spark 3.3.0, visit the downloads page. You can consult JIRA for the detailed changes. We have curated a list of high level changes here, grouped by major modules.

Highlight

  • Row-level Runtime Filtering (SPARK-32268)
  • ANSI enhancements (SPARK-38860)
  • Error Message Improvements (SPARK-38781)
  • Support complex types for Parquet vectorized reader (SPARK-34863)
  • Hidden File Metadata Support for Spark SQL (SPARK-37273)
  • Provide a profiler for Python/Pandas UDFs (SPARK-37443)
  • Introduce Trigger.AvailableNow for running streaming queries like Trigger.Once in multiple batches (SPARK-36533)
  • More comprehensive DS V2 push down capabilities (SPARK-38788)
  • Executor Rolling in Kubernetes environment (SPARK-37810)
  • Support Customized Kubernetes Schedulers ( SPARK-36057)
  • Migrating from log4j 1 to log4j 2 (SPARK-37814)

Spark SQL and Core

ANSI mode

  • New explicit cast syntax rules in ANSI mode (SPARK-33354)
  • Elt() should return null if index is null under ANSI mode (SPARK-38304)
  • Optionally return null result if element not exists in array/map (SPARK-37750)
  • Allow casting between numeric type and timestamp type (SPARK-37714)
  • Disable ANSI reserved keywords by default (SPARK-37724)
  • Use store assignment rules for resolving function invocation (SPARK-37438)
  • Add a config to allow casting between Datetime and Numeric (SPARK-37179)
  • Add a config to optionally enforce ANSI reserved keywords (SPARK-37133)
  • Disallow binary operations between Interval and String literal (SPARK-36508)

Feature Enhancements

Performance enhancements

  • Whole-stage code generation
    • Add code-gen for sort aggregate without grouping keys (SPARK-37564)
    • Add code-gen for full outer sort merge join (SPARK-35352)
    • Add code-gen for full outer shuffled hash join (SPARK-32567)
    • Add code-gen for existence sort merge join (SPARK-37316)
  • Push down (filters)
    • Push down filters through RebalancePartitions (SPARK-37828)
    • Push down boolean column filter (SPARK-36644)
    • Push down limit 1 for right side of left semi/anti join if join condition is empty (SPARK-37917)
    • Support propagate empty relation through aggregate/union (SPARK-35442)
    • Row-level Runtime Filtering (SPARK-32268)
    • Support Left Semi join in row level runtime filters (SPARK-38565)
    • Support predicate pushdown and column pruning for de-duped CTEs (SPARK-37670)
  • Vectorization
    • Implement a ConstantColumnVector and improve performance of the hidden file metadata (SPARK-37896)
    • Enable vectorized read for VectorizedPlainValuesReader.readBooleans (SPARK-35867)
  • Combine/remove/replace nodes
    • Combine unions if there is a project between them (SPARK-37915)
    • Combine to one cast if we can safely up-cast two casts (SPARK-37922)
    • Remove the Sort if it is the child of RepartitionByExpression (SPARK-36703)
    • Removes outer join if it only has DISTINCT on streamed side with alias (SPARK-37292)
    • Replace hash with sort aggregate if child is already sorted (SPARK-37455)
    • Replace object hash with sort aggregate if child is already sorted (SPARK-37557)
    • Only collapse projects if we don’t duplicate expensive expressions (SPARK-36718)
    • Remove redundant aliases after RewritePredicateSubquery (SPARK-36280)
    • Merge non-correlated scalar subqueries (SPARK-34079)
  • Partitioning
    • Do not add dynamic partition pruning if there exists static partition pruning (SPARK-38148)
    • Improve RebalancePartitions in rules of Optimizer (SPARK-37904)
    • Add small partition factor for rebalance partitions (SPARK-37357)
  • Join
    • Fine tune logic to demote Broadcast hash join in DynamicJoinSelection (SPARK-37753)
    • Ignore duplicated join keys when building relation for SEMI/ANTI shuffled hash join (SPARK-36794)
    • Support optimize skewed join even if introduce extra shuffle (SPARK-33832)
  • AQE
    • Support eliminate limits in AQE Optimizer (SPARK-36424)
    • Optimize one row plan in normal and AQE Optimizer (SPARK-38162)
  • Aggregate.groupOnly support foldable expressions (SPARK-38489)
  • ByteArrayMethods arrayEquals should fast skip the check of aligning with unaligned platform (SPARK-37796)
  • Add tree pattern pruning to CTESubstitution rule (SPARK-37379)
  • Add more Not operator simplifications (SPARK-36665)
  • Support BooleanType in UnwrapCastInBinaryComparison (SPARK-36607)
  • Coalesce drop all expressions after the first non nullable expression (SPARK-36359)
  • Add a logical plan visitor to propagate the distinct attributes (SPARK-36194)

Built-in Connector Enhancements

  • General
    • Lenient serialization of datetime from datasource (SPARK-38437)
    • Treat table location as absolute when the first letter of its path is slash in create/alter table (SPARK-38236)
    • Remove leading zeros from empty static number type partition (SPARK-35561)
    • Support ignoreCorruptFiles and ignoreMissingFiles in Data Source options (SPARK-38767)
  • Parquet
    • Enable matching schema column names by field ids (SPARK-38094)
    • Remove check field name when reading/writing data in parquet (SPARK-27442)
    • Support vectorized read boolean values use RLE encoding with Parquet DataPage V2 (SPARK-37864)
    • Support Parquet V2 data page encoding (DELTA_BINARY_PACKED) for the vectorized path (SPARK-36879)
    • Rebase timestamps in the session time zone saved in Parquet/Avro metadata (SPARK-37705)
    • Push down group by partition column for aggregate (SPARK-36646)
    • Aggregate (Min/Max/Count) push down for Parquet (SPARK-36645)
    • Reduce default page size by LONG_ARRAY_OFFSET if G1GC and ON_HEAP are used (SPARK-37593)
    • Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support (SPARK-37974)
    • Support complex types for Parquet vectorized reader (SPARK-34863)
  • ORC
    • Remove check field name when reading/writing existing data in ORC (SPARK-37965)
    • Aggregate push down for ORC (SPARK-34960)
    • Support reading and writing ANSI intervals from/to ORC data sources (SPARK-36931)
    • Support number-only column names in ORC data sources (SPARK-36663)
  • JSON
    • Respect allowNonNumericNumbers when parsing quoted NaN and Infinity values in JSON reader (SPARK-38060)
    • Use CAST for datetime in CSV/JSON by default (SPARK-36536)
    • Align error message for unsupported key types in MapType in Json reader (SPARK-35320)
    • Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds) (SPARK-35912)
  • CSV
    • Fix referring to the corrupt record column from CSV (SPARK-38534)
    • null values should be saved as nothing instead of quoted empty Strings “” by default (SPARK-37575)
    • Fasten Timestamp type inference of default format in JSON/CSV data source (SPARK-39193)
  • JDBC
    • Support cascade mode for JDBC V2 (SPARK-37929)
    • Add the IMMEDIATE statement to the DB2 dialect truncate implementation (SPARK-30062)
    • Support aggregate functions of build-in JDBC dialect (SPARK-37867)
    • Move compileAggregates from JDBCRDD to JdbcDialect (SPARK-37286)
    • Implement dropIndex and listIndexes in JDBC (MySQL dialect) (SPARK-36914)
    • Supports list namespaces in JDBC V2 MySQL dialect (SPARK-38054)
    • Add factory method getConnection into JDBCDialect (SPARK-38361)
    • Jdbc dialect should decide which function could be pushed down (SPARK-39162)
    • Propagate correct JDBC properties in JDBC connector provider and add “connectionProvider” option (SPARK-36163)
    • Refactor framework so as JDBC dialect could compile filter by self way (SPARK-38432)
    • Reactor framework so as JDBC dialect could compile expression by itself (SPARK-38196)
    • Implement createIndex and IndexExists in DS V2 JDBC (MySQL dialect) (SPARK-36913)
  • Hive
    • Support writing Hive bucketed table (Parquet/ORC format with Hive hash) (SPARK-32709)
    • Support writing Hive bucketed table (Hive file formats with Hive hash) (SPARK-32712)
    • Use expressions to filter Hive partitions at client side (SPARK-35437)
    • Support Dynamic Partition pruning for HiveTableScanExec (SPARK-36876)
    • InsertIntoHiveDir should use data source if it’s convertible (SPARK-38215)

Data Source V2 API

  • New interfaces
    • Introduce a new DataSource V2 interface HasPartitionKey (SPARK-37376)
    • Add interface SupportsPushDownV2Filters (SPARK-36760)
    • Support DataSource V2 CreateTempViewUsing (SPARK-35803)
    • Add a class to represent general aggregate functions in DS V2 (SPARK-37789)
    • A new framework to represent catalyst expressions in DS V2 APIs (SPARK-37960)
    • Add APIs for group-based row-level operations (SPARK-38625)
  • Migrate commands
    • Migrate SHOW CREATE TABLE to use V2 command by default (SPARK-37878)
    • Migrate CREATE NAMESPACE to use V2 command by default (SPARK-37636)
    • Migrate DESCRIBE NAMESPACE to use V2 command by default (SPARK-37150)
  • Indexing
  • Push down (SPARK-38788)
    • Add DS V2 filters (SPARK-36556)
    • Push down boolean column filter for Data Source V2 (SPARK-36644)
    • Support push down top N to JDBC data source V2 (SPARK-37483)
    • DS V2 Sample Push Down (SPARK-37038)
    • DS V2 LIMIT push down (SPARK-37020)
    • DS V2 supports partial aggregate push-down AVG (SPARK-37839)
    • Support datasource V2 complete aggregate pushdown (SPARK-37644)
    • If Sum, Count, Any accompany distinct, cannot do partial agg push down (SPARK-38560)
    • Translate more standard aggregate functions for pushdown (SPARK-37527)
    • DS V2 aggregate push-down supports project with alias (SPARK-38533)
    • DS V2 topN push-down supports project with alias (SPARK-38644)
    • DS V2 Top N push-down supports order by expressions (SPARK-39037)
    • Datasource V2 supports partial topN push-down (SPARK-38391)
    • Support push down Cast to JDBC data source V2 (SPARK-38633)
    • Remove Limit from plan if complete push down limit to data source (SPARK-38768)
    • DS V2 supports push down misc non-aggregate functions (SPARK-38761)
    • DS V2 supports push down math functions (SPARK-38855)
    • DS V2 aggregate push-down supports group by expressions (SPARK-38997)
    • DS V2 aggregate partial push-down should supports group by without aggregate functions (SPARK-39135)
  • Support nested columns in ORC vectorized reader for data source V2 (SPARK-36404)
  • Update task metrics from ds V2 custom metrics (SPARK-37578)
  • Unify V1 and V2 options output of SHOW CREATE TABLE command (SPARK-37494)
  • Add command SHOW CATALOGS (SPARK-35973)

Kubernetes Enhancements

  • Executor Rolling in Kubernetes environment (SPARK-37810)
  • Support Customized Kubernetes Schedulers ( SPARK-36057)
  • executorIdleTimeout is not working for pending pods on K8s (SPARK-37049)
  • Upgrade kubernetes-client to 5.12.2 (SPARK-38817)
  • Make memory overhead factor configurable (SPARK-38194)
  • Add Volcano build-in integration and PodGroup template support for Spark on Kubernetes (experimental). (SPARK-36061, SPARK-38455)
  • Add KubernetesCustom[Driver/Executor]FeatureConfigStep developer API (SPARK-37145)

Node Decommission

  • FallbackStorage shouldn’t attempt to resolve arbitrary “remote” hostname (SPARK-38062)
  • ExecutorMonitor.onExecutorRemoved should handle ExecutorDecommission as finished (SPARK-38023)

Push-based shuffle

  • Adaptive shuffle merge finalization for push-based shuffle (SPARK-33701)
  • Adaptive fetch of shuffle mergers for Push based shuffle (SPARK-34826)
  • Skip diagnosis ob merged blocks from push-based shuffle (SPARK-37695)
  • PushBlockStreamCallback should check isTooLate first to avoid NPE (SPARK-37847)
  • Push-based merge finalization bugs in the RemoteBlockPushResolver (SPARK-37675)
  • Avoid fetching merge status when shuffleMergeEnabled is false for a shuffleDependency during retry (SPARK-37023)

Other Notable Changes

  • Add fine grained locking to BlockInfoManager (SPARK-37356)
  • Support mapping Spark gpu/fpga resource types to custom YARN resource type (SPARK-37208)
  • Report accurate shuffle block size if its skewed (SPARK-36967)
  • Supporting Netty Logging at the network layer (SPARK-36719)

Structured Streaming

Major feature

  • Introduce Trigger.AvailableNow for running streaming queries like Trigger.Once in multiple batches (SPARK-36533)

Other Notable Changes

  • Use StatefulOpClusteredDistribution for stateful operators with respecting backward compatibility (SPARK-38204)
  • Fix flatMapGroupsWithState timeout in batch with data for key (SPARK-38320)
  • Fix correctness issue on stream-stream outer join with RocksDB state store provider (SPARK-38684)
  • Upgrade Kafka to 3.1.0 (SPARK-36837)
  • Support Trigger.AvailableNow on Kafka data source (SPARK-36649)
  • Optimize write path on RocksDB state store provider (SPARK-37224)
  • Introduce a new data source for providing consistent set of rows per microbatch (SPARK-37062)
  • Use HashClusteredDistribution for stateful operators with respecting backward compatibility (SPARK-38204)
  • Make foreachBatch streaming query stop gracefully (SPARK-39218)

PySpark

Pandas API on Spark

  • Major improvement
  • Major feature
    • Implement SparkSQL native ps.merge_asof (SPARK-36813)
    • Support TimedeltaIndex in pandas API on Spark (SPARK-37525)
    • Support Python’s timedelta (SPARK-37275, SPARK-37510)
    • Implement functions in CategoricalAccessor/CategoricalIndex (SPARK-36185)
    • Uses Python’s standard string formatter for SQL API in pandas API on Spark (SPARK-37436)
    • Support basic operations of timedelta Series/Index (SPARK-37510)
    • Support ps.MultiIndex.dtypes (SPARK-36930)
    • Implement Index.map (SPARK-36469)
    • Implement Series.xor and Series.rxor (SPARK-36653)
    • Implement unary operator invert of integral ps.Series/Index (SPARK-36003)
    • Implement DataFrame.cov (SPARK-36396)
    • Support str and timestamp for (Series DataFrame).describe() (SPARK-37657)
    • Support lambda column parameter of DataFrame.rename(SPARK-38763)

Other Notable Changes

  • Breaking changes
    • Drop references to Python 3.6 support in docs and python/docs (SPARK-36977)
    • Remove namedtuple hack by replacing built-in pickle to cloudpickle (SPARK-32079)
    • Bump minimum pandas version to 1.0.5 (SPARK-37465)
  • Major improvements
    • Provide a profiler for Python/Pandas UDFs (SPARK-37443)
    • Uses Python’s standard string formatter for SQL API in PySpark (SPARK-37516)
    • Expose SQL state and error class in PySpark exceptions (SPARK-36953)
    • Try to capture faulthanlder when a Python worker crashes (SPARK-36062)
  • Major feature
    • Implement DataFrame.mapInArrow in Python (SPARK-37228)
    • Uses Python’s standard string formatter for SQL API in PySpark (SPARK-37516)
    • Add df.withMetadata pyspark API (SPARK-36642)
    • Support Python’s timedelta (SPARK-37275)
    • Expose tableExists in pyspark.sql.catalog (SPARK-36176)
    • Expose databaseExists in pyspark.sql.catalog (SPARK-36207)
    • Exposing functionExists in pyspark sql catalog (SPARK-36258)
    • Add Dataframe.observation to PySpark (SPARK-36263)
    • Add max_by/min_by API to PySpark (SPARK-36972)
    • Support to infer nested dict as a struct when creating a DataFrame (SPARK-35929)
    • Add bit/octet_length APIs to Scala, Python and R (SPARK-36751)
    • Support ILIKE API on Python (SPARK-36882)
    • Add isEmpty method for the Python DataFrame API (SPARK-37207)
    • Add multiple columns adding support (SPARK-35173)
    • Add SparkContext.addArchive in PySpark (SPARK-38278)
    • Make sql type reprs eval-able (SPARK-18621)
    • Inline type hints for fpm.py in python/pyspark/mllib (SPARK-37396)
    • Implement dropna parameter of SeriesGroupBy.value_counts (SPARK-38837)

MLLIB

  • Major feature
    • Add distanceMeasure param to trainKMeansModel (SPARK-37118)
    • Expose LogisticRegression.setInitialModel, like KMeans et al do (SPARK-36481)
    • Support CrossValidatorModel get standard deviation of metrics for each paramMap (SPARK-36425)
  • Major improvements
    • Optimize some treeAggregates in MLlib by delaying allocations (SPARK-35848)
    • Rewrite _shared_params_code_gen.py to inline type hints for ml/param/shared.py (SPARK-37419)
  • Other Notable Changes

SparkR

UI

  • Speculation metrics summary at stage level (SPARK-36038)
  • Unified shuffle read block time to shuffle read fetch wait time in StagePage (SPARK-37469)
  • Add modified configs for SQL execution in UI (SPARK-34735)
  • Make ThriftServer recognize spark.sql.redaction.string.regex (SPARK-36400)
  • Attach and start handler after application started in UI (SPARK-36237)
  • Add commit duration to SQL tab’s graph node (SPARK-34399)
  • Support RocksDB backend in Spark History Server (SPARK-37680)
  • Show options for Pandas API on Spark in UI (SPARK-38656)
  • Rename ‘SQL’ to ‘SQL / DataFrame’ in SQL UI page (SPARK-38657)

Build

Credits

Last but not least, this release would not have been possible without the following contributors: Abhishek Somani, Adam Binford, Alex Balikov, Alex Ott, Alfonso Buono, Allison Wang, Almog Tavor, Amin Borjian, Andrew Liu, Andrew Olson, Andy Grove, Angerszhuuuu, Anish Shrigondekar, Ankur Dave, Anton Okolnychyi, Aravind Patnam, Attila Zsolt Piros, BOOTMGR, BelodengKlaus, Bessenyei Balázs Donát, Bjørn Jørgensen, Bo Zhang, Brian Fallik, Brian Yue, Bruce Robbins, Byron, Cary Lee, Cedric-Magnan, Chandni Singh, Chao Sun, Cheng Pan, Cheng Su, Chia-Ping Tsai, Chilaka Ramakrishna, Daniel Dai, Daniel Davies, Daniel Tenedorio, Daniel-Davies, Danny Guinther, Darek, David Christle, Denis Tarima, Dereck Li, Devesh Agrawal, Dhiren Navani, Diego Luis, Dmitriy Fishman, Dmytro Melnychenko, Dominik Gehl, Dongjoon Hyun, Emil Ejbyfeldt, Enrico Minack, Erik Krogen, Eugene Koifman, Fabian A.J. Thiele, Franck Thang, Fu Chen, Geek, Gengliang Wang, Gidon Gershinsky, H. Vetinari, Haejoon Lee, Harutaka Kawamura, Herman van Hovell, Holden Karau, Huaxin Gao, Hyukjin Kwon, Igor Dvorzhak, IonutBoicuAms, Itay Bittan, Ivan Karol, Ivan Sadikov, Jackey Lee, Jerry Peng, Jiaan Geng, Jie, Johan Nystrom, Josh Rosen, Junfan Zhang, Jungtaek Lim, Kamel Gazzaz, Karen Feng, Karthik Subramanian, Kazuyuki Tanimura, Ke Jia, Keith Holliday, Keith Massey, Kent Yao, Kevin Sewell, Kevin Su, Kevin Wallimann, Koert Kuipers, Kousuke Saruta, Kun Wan, Lei Peng, Leona, Leona Yoda, Liang Zhang, Liang-Chi Hsieh, Linhong Liu, Lorenzo Martini, Luca Canali, Ludovic Henry, Lukas Rytz, Luran He, Maciej Szymkiewicz, Manu Zhang, Martin Tzvetanov Grigorov, Maryann Xue, Matthew Jones, Max Gekk, Menelaos Karavelas, Michael Chen, Michał Słapek, Mick Jermsurawong, Microsoft Learn Student, Min Shen, Minchu Yang, Ming Li, Mohamadreza Rostami, Mridul Muralidharan, Nicholas Chammas, Nicolas Azrak, Ole Sasse, Pablo Langa, Parth Chandra, PengLei, Peter Toth, Philipp Dallig, Prashant Singh, Qian.Sun, RabbidHY, Radek Busz, Rahul Mahadev, Richard Chen, Rob Reeves, Robert (Bobby) Evans, RoryQi, Rui Wang, Ruifeng Zheng, Russell Spitzer, Sachin Tripathi, Sajith Ariyarathna, Samuel Moseley, Samuel Souza, Sathiya KUMAR, SaurabhChawla, Sean Owen, Senthil Kumar, Serge Rielau, Shardul Mahadik, Shixiong Zhu, Shockang, Shruti Gumma, Simeon Simeonov, Steve Loughran, Steven Aerts, Takuya UESHIN, Ted Yu, Tengfei Huang, Terry Kim, Thejdeep Gudivada, Thomas Graves, Tim Armstrong, Tom van Bussel, Tomas Pereira de Vasconcelos, TongWeii, Utkarsh, Vasily Malakhin, Venkata Sai Akhil Gudesa, Venkata krishnan Sowrirajan, Venki Korukanti, Vitalii Li, Wang, Warren Zhu, Weichen Xu, Weiwei Yang, Wenchen Fan, William Hyun, Wu, Xiaochang, Xianjin YE, Xiduo You, Xingbo Jiang, Xinrong Meng, Xinyi Yu, XiuLi Wei, Yang He, Yang Liu, YangJie, Yannis Sismanis, Ye Zhou, Yesheng Ma, Yihong He, Yikf, Yikun Jiang, Yimin, Yingyi Bu, Yuanjian Li, Yufei Gu, Yuming Wang, Yun Tang, Yuto Akutsu, Zhen Li, Zhenhua Wang, Zimo Li, alexander_holmes, beobest2, bjornjorgensen, chenzhx, copperybean, daugraph, dch nguyen, dchvn, dchvn nguyen, dgd-contributor, dgd_contributor, dohongdayi, erenavsarogullari, fhygh, flynn, gaoyajun02, gengjiaan, herman, hi-zir, huangmaoyang2, huaxingao, hujiahua, jackierwzhang, jackylee-ch, jiaoqb, jinhai, khalidmammadov, kuwii, leesf, mans2singh, mcdull-zhang, michaelzhang-db, minyyy, nyingping, pralabhkumar, qitao liu, remykarem, sandeepvinayak, senthilkumarb, shane knapp, skhandrikagmail, sperlingxx, sudoliyang, sweisdb, sychen, tan.vu, tanel.kiis@gmail.com, tenglei, tianhanhu, tianlzhang, timothy65535, tooptoop4, vadim, w00507315, wangguangxin.cn, wangshengjie3, wayneguow, wooplevip, wuyi, xiepengjie, xuyu, yangjie01, yaohua, yi.wu, yikaifei, yoda-mon, zhangxudong1, zhoubin11, zhouyifan279, zhuqi-lucas, zwangsheng


Spark News Archive