Read and write modes in pySpark

By Niraj Zade | 2023/12/09 | Tags: theory reference file io

Read modes

Every loader has modes to decide what to do when spark comes across a malformed row/file.

There are 3 read modes:

PERMISSIVE - Default mode. When a corrupt record is read, it sets all column values to null, and places the malformed row's values in a string col called _corrupt_record (you can change the name of this column by setting the spark.sql.columnNameOfCorruptRecord configuration. I don't recommend changing it - to maintain conventions.)
DROPMALFORMED - Removes malformed record. It basically skips over malformed records and just loads the healthy ones
FAILFAST - raises exception, terminates immediately

The default mode is is PERMISSIVE

While using PERMISSIVE mode, you can override the name of the corrupt record column using the option columnNameOfCorruptRecord (changing the default column name is not recommended, to maintain common programming standard).

...
.option("mode", "PERMISSIVE")
.option("columnNameOfCorruptRecord","corrupt_record")
...

How to set read mode

There are 2 ways to set read mode

Using the option() method to set read mode. Example:

sales_df = (
    spark.read
    .format("parquet") 
    .option("mode", "error")
    .load("data/sales/2022/01/*.parquet")
)

Using the mode() method to set read mode. Example:

sales_df = (
    spark.read
    .format("parquet")
    .mode("error")
    .load("data/sales/2022/01/*.parquet")
)

Write modes

Every writer has modes to decide what to do when data already exists in the destination.

There are 4 write modes:

append - Create new files without touching existing data (append to existing data)
overwrite - If data files already exist, then remove existing data files and write new ones
error or errorIfExists - If data files already exist, then stop execution and throw exception
ignore - Write data if and only if target directory is empty. If data files already exist, silently ignore the write operation (it will do nothing)

The default write mode is errorIfExists

How to set write mode

There are 2 ways to set write mode

Using the option() method to set wrote mode. Example:

(
    df.write
    .format("parquet")
    .option("mode", "append")
    .option("path", "/data/myparquetfiles/")
    .save()
)

Using the mode() method to set write mode. Example:

(
    df.write
    .format("parquet")
    .mode(saveMode)
    .option("path", "/data/myparquetfiles/")
    .save()
)

In this chapter

Read modes
- How to set read mode
Write modes
- How to set write mode

< Back to home

Brought to you by Niraj Zade - Website, Linkedin

~ whoever owns storage, owns computing ~

pySparkGuide.com