Remove duplicate rows from dataframe
Remove duplicate rows using distinct() and dropDuplicates()
2024-09-18   1 min   guide 12 cleaning 1 workflow 2

Cast string column to Date or DateTime using to_date() or to_timestamp()
Convert string column to Date or Datetime using to_date() and to_timestamp()
2024-01-28   3 min   guide 12

GROUP BY in pySpark
Group BY clause in the form of pySpark functions
2024-01-22   4 min   guide 12 sql 5

CASE in pySpark
SQL CASE clause in the form of pySpark functions
2024-01-17   5 min   guide 12 sql 5

Filter rows in pySpark
SQL WHERE clause in the form of pySpark functions
2024-01-17   3 min   guide 12 sql 5

Window functions with pySpark
How to use window functions in pySpark
2023-12-19   13 min   guide 12 sql 5

JSON - read, set schema and write with pySpark
Load data, set schema, save data using the DataFrameReader & DataFrameWriter APIs. Using various file formats.
2023-12-18   12 min   guide 12 file-io 4

Delta - read, set schema and write with pySpark
Reading and writing delta tables
2023-12-11   7 min   guide 12 file-io 4

Run spark locally with jupyter notebook
Docker based single note pyspark notebook setup
2023-11-18   1 min   guide 12 local-pyspark 1 dev-env 1

Cache and persist - why and how
Improve execution speed by caching/persisting intermediate dataframes
2023-11-18   16 min   guide 12 performance 3

Style guide - Chaining functions in python
Write better multiline pySpark code
2023-11-16   2 min   guide 12 style-guide 1

CSV - read, set schema and write with pySpark
Load data, set schema, save data using the DataFrameReader & DataFrameWriter APIs. Using various file formats.
2023-11-16   17 min   guide 12 file-io 4