To drop duplicate rows in a DataFrame or Series in pandas, the easiest way is to use the pandas drop_duplicates() function.
df.drop_duplicates()
When working with data, it’s important to be able to find any problems with our data. Finding and removing duplicate records in our data is one such situation where we may have to fix our data.
With Python, we can find and remove duplicate rows in data very easily using the pandas package and the pandas drop_duplicates() function.
Let’s say we have the following DataFrame:
df = pd.DataFrame({'Name': ['Jim','Jim','Jim','Sally','Bob','Sue','Sue','Larry'],
'Weight':['100','100','200','100','200','150','150','200']})
# Output:
Name Weight
0 Jim 100
1 Jim 100
2 Jim 200
3 Sally 100
4 Bob 200
5 Sue 150
6 Sue 150
7 Larry 200
Let’s find the duplicate rows in this DataFrame. We can do this easily using the pandas duplicated() function. The duplicated() function returns a Series with boolean values denoting where we have duplicate rows. By default, it marks all duplicates as True except the first occurrence.
print(df.duplicated())
# Output:
0 False
1 True
2 False
3 False
4 False
5 False
6 True
7 False
dtype: bool
We see above that we have 2 duplicate rows. If we want to remove these duplicate rows, we can use the pandas drop_duplicates() function like in the following Python code:
print(df.drop_duplicates())
# Output:
Name Weight
0 Jim 100
2 Jim 200
3 Sally 100
4 Bob 200
5 Sue 150
7 Larry 200
The default setting for drop_duplicates() is to drop all duplicates except the first. We can drop all duplicates except the last occurrence, or drop all duplicates by passing ‘keep=”last”‘ or ‘keep=False’ respectively.
print(df.drop_duplicates(keep="last"))
print(df.drop_duplicates(keep=False))
# Output:
Name Weight
1 Jim 100
2 Jim 200
3 Sally 100
4 Bob 200
6 Sue 150
7 Larry 200
Name Weight
2 Jim 200
3 Sally 100
4 Bob 200
7 Larry 200
The pandas drop_duplicates() function returns a DataFrame, and if you want to reset the index, you can do this with the ‘ignore_index’ option. Additionally, you can remove duplicates ‘inplace’ like many other pandas functions.
print(df.drop_duplicates(keep=False, ignore_index=True))
# Output:
Name Weight
0 Jim 200
1 Sally 100
2 Bob 200
3 Larry 200
Drop Duplicate Rows based on Column Using Pandas
By default, the drop_duplicates() function removes duplicates based on all columns of a DataFrame. We can remove duplicate rows based on just one column or multiple columns using the “subset” parameter.
Let’s say we have the same DataFrame as above. We can find all of the duplicates based on the “Name” column by passing ‘subset=[“Name”]’ to the drop_duplicates() function.
print(df.drop_duplicates(subset=["Name"]))
#Output:
Name Weight
0 Jim 100
3 Sally 100
4 Bob 200
5 Sue 150
7 Larry 200
Hopefully this article has been beneficial for you to understand how to use the pandas drop_duplicates() function to remove duplicate rows in your data in Python.