Remove duplicate rows from dataframe
Remove duplicate rows using distinct() and dropDuplicates()
2024-09-18   1 min   guide cleaning

Cast string column to Date or DateTime using to_date() or to_timestamp()
Convert string column to Date or Datetime using to_date() and to_timestamp()
2024-01-28   3 min   guide

GROUP BY in pySpark
Group BY clause in the form of pySpark functions
2024-01-22   4 min   guide sql

CASE in pySpark
SQL CASE clause in the form of pySpark functions
2024-01-17   5 min   guide sql

Filter rows in pySpark
SQL WHERE clause in the form of pySpark functions
2024-01-17   3 min   guide sql

Window functions with pySpark
How to use window functions in pySpark
2023-12-19   13 min   guide sql

JSON - read, set schema and write with pySpark
Load data, set schema, save data using the DataFrameReader & DataFrameWriter APIs. Using various file formats.
2023-12-18   12 min   guide file-io

Delta - read, set schema and write with pySpark
Reading and writing delta tables
2023-12-11   7 min   guide file-io

Single node spark notebook setup using docker
Docker based single note pyspark notebook setup
2023-11-18   1 min   guide local-pyspark dev-env

Cache and persist - why and how
Improve execution speed by caching/persisting intermediate dataframes
2023-11-18   14 min   guide performance

Style guide - Chaining functions in python
Write better multiline pySpark code
2023-11-16   2 min   guide style-guide

CSV - read, set schema and write with pySpark
Load data, set schema, save data using the DataFrameReader & DataFrameWriter APIs. Using various file formats.
2023-11-16   17 min   guide file-io