Remove duplicate rows from dataframe

By Niraj Zade | 2024-09-18 | Tags: guide cleaning workflow


Key takeaway

Remove duplicate rows using distinct() and dropDuplicates()

  • distinct() determines duplicates by comparing rows across at all columns
  • dropDuplicates() decides duplicates by comparing rows across at all columns, or across few columns (as specified)

Create Sample data

data = [
    ("Alice", "Ace", "Austin", 10),
    ("Alice", "Ace", "Austin", 10),
    ("Alice", "Ace", "London", 20),
    ("Chris", "Cola", "Toronto", 30),
    ("Chris", "Cola", "Toronto", 30),
]
schema= ["name", "surname", "city", "age"]

df = spark.createDataFrame(data = data, schema = schema)
df.show()

Output:

+-----+-------+-------+---+
| name|surname|   city|age|
+-----+-------+-------+---+
|Alice|    Ace| Austin| 10|
|Alice|    Ace| Austin| 10|
|Alice|    Ace| London| 20|
|Chris|   Cola|Toronto| 30|
|Chris|   Cola|Toronto| 30|
+-----+-------+-------+---+

De-duplicate on all columns

Using distinct():

fully_unique_rows_df = df.distinct()
fully_unique_rows_df.show()

Using dropDuplicates():

fully_unique_rows_df = df.dropDuplicates(subset=["name", "surname"])
fully_unique_rows_df.show()

Output:

+-----+-------+-------+---+
| name|surname|   city|age|
+-----+-------+-------+---+
|Alice|    Ace| Austin| 10|
|Alice|    Ace| London| 20|
|Chris|   Cola|Toronto| 30|
+-----+-------+-------+---+

De-duplicate on specific columns

Specify column names in an array, and pass it to dropDuplicates() as an argument.

unique_names_df = df.dropDuplicates(subset=["name", "surname"])
unique_names_df.show()

Output:

+-----+-------+-------+---+
| name|surname|   city|age|
+-----+-------+-------+---+
|Alice|    Ace| Austin| 10|
|Chris|   Cola|Toronto| 30|
+-----+-------+-------+---+

Performance impact

distinct() and dropDuplicates() will both cause shuffles. These are compute heavy operations and will take time to execute.

Documentation links

  • spark documentation - distinct()
  • [spark documentation - dropDuplicates()][https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.dropDuplicates.html]