Read and write modes in pySpark

By Niraj Zade | 2023-12-09 | Tags: theory reference file-io


Read modes

Every loader has modes to decide what to do when spark comes across a malformed row/file.

There are 3 read modes:

  1. PERMISSIVE - Default mode. When a corrupt record is read, it sets all column values to null, and places the malformed row's values in a string col called _corrupt_record (you can change the name of this column by setting the spark.sql.columnNameOfCorruptRecord configuration. I don't recommend changing it - to maintain conventions.)
  2. DROPMALFORMED - Removes malformed record. It basically skips over malformed records and just loads the healthy ones
  3. FAILFAST - raises exception, terminates immediately

The default mode is is PERMISSIVE

While using PERMISSIVE mode, you can override the name of the corrupt record column using the option columnNameOfCorruptRecord (changing the default column name is not recommended, to maintain common programming standard).

...
.option("mode", "PERMISSIVE")
.option("columnNameOfCorruptRecord","corrupt_record")
...

How to set read mode

There are 2 ways to set read mode

  1. Using the option() method to set read mode. Example:
sales_df = (
    spark.read
    .format("parquet") 
    .option("mode", "error")
    .load("data/sales/2022/01/*.parquet")
)
  1. Using the mode() method to set read mode. Example:
sales_df = (
    spark.read
    .format("parquet")
    .mode("error")
    .load("data/sales/2022/01/*.parquet")
)

Write modes

Every writer has modes to decide what to do when data already exists in the destination.

There are 4 write modes:

  1. append - Create new files without touching existing data (append to existing data)
  2. overwrite - If data files already exist, then remove existing data files and write new ones
  3. error or errorIfExists - If data files already exist, then stop execution and throw exception
  4. ignore - Write data if and only if target directory is empty. If data files already exist, silently ignore the write operation (it will do nothing)

The default write mode is errorIfExists

How to set write mode

There are 2 ways to set write mode

  1. Using the option() method to set wrote mode. Example:
(
    df.write
    .format("parquet")
    .option("mode", "append")
    .option("path", "/data/myparquetfiles/")
    .save()
)
  1. Using the mode() method to set write mode. Example:
(
    df.write
    .format("parquet")
    .mode("append")
    .option("path", "/data/myparquetfiles/")
    .save()
)