Key takeaway
Remove duplicate rows using distinct() and dropDuplicates()
distinct()
determines duplicates by comparing rows across at all columnsdropDuplicates()
decides duplicates by comparing rows across at all columns, or across few columns (as specified)
Create Sample data
data = [
("Alice", "Ace", "Austin", 10),
("Alice", "Ace", "Austin", 10),
("Alice", "Ace", "London", 20),
("Chris", "Cola", "Toronto", 30),
("Chris", "Cola", "Toronto", 30),
]
schema= ["name", "surname", "city", "age"]
df = spark.createDataFrame(data = data, schema = schema)
df.show()
Output:
+-----+-------+-------+---+
| name|surname| city|age|
+-----+-------+-------+---+
|Alice| Ace| Austin| 10|
|Alice| Ace| Austin| 10|
|Alice| Ace| London| 20|
|Chris| Cola|Toronto| 30|
|Chris| Cola|Toronto| 30|
+-----+-------+-------+---+
De-duplicate on all columns
Using distinct()
:
fully_unique_rows_df = df.distinct()
fully_unique_rows_df.show()
Using dropDuplicates()
:
fully_unique_rows_df = df.dropDuplicates(subset=["name", "surname"])
fully_unique_rows_df.show()
Output:
+-----+-------+-------+---+
| name|surname| city|age|
+-----+-------+-------+---+
|Alice| Ace| Austin| 10|
|Alice| Ace| London| 20|
|Chris| Cola|Toronto| 30|
+-----+-------+-------+---+
De-duplicate on specific columns
Specify column names in an array, and pass it to dropDuplicates()
as an argument.
unique_names_df = df.dropDuplicates(subset=["name", "surname"])
unique_names_df.show()
Output:
+-----+-------+-------+---+
| name|surname| city|age|
+-----+-------+-------+---+
|Alice| Ace| Austin| 10|
|Chris| Cola|Toronto| 30|
+-----+-------+-------+---+
Performance impact
distinct()
and dropDuplicates()
will both cause shuffles. These are compute heavy operations and will take time to execute.
Documentation links
- spark documentation - distinct()
- [spark documentation - dropDuplicates()][https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.dropDuplicates.html]