To find duplicate rows in a DataFrame or Series in pandas, the easiest way is to use the pandas duplicated() function.
df.duplicated()
When working with data, it’s important to be able to find any problems with our data. Finding duplicate records in our data is one such situation where we may need to take additional steps to fix our data.
With Python, we can find duplicate rows in data very easily using the pandas package and the pandas duplicated() function.
Let’s say we have the following DataFrame:
df = pd.DataFrame({'Name': ['Jim','Jim','Jim','Sally','Bob','Sue','Sue','Larry'],
'Weight':['100','100','200','100','200','150','150','200']})
# Output:
Name Weight
0 Jim 100
1 Jim 100
2 Jim 200
3 Sally 100
4 Bob 200
5 Sue 150
6 Sue 150
7 Larry 200
Let’s find the duplicate rows in this DataFrame. We can do this easily using the pandas duplicated() function. The duplicated() function returns a Series with boolean values denoting where we have duplicate rows. By default, it marks all duplicates as True except the first occurrence.
print(df.duplicated())
# Output:
0 False
1 True
2 False
3 False
4 False
5 False
6 True
7 False
dtype: bool
To mark the first occurrence of the duplicates as True, we can pass “keep=’last'” to the duplicated() function.
print(df.duplicated(keep='last'))
# Output:
0 True
1 False
2 False
3 False
4 False
5 True
6 False
7 False
dtype: bool
To mark all duplicates as True, pass ‘keep=False’ to the duplicated() function.
print(df.duplicated(keep=False))
# Output:
0 True
1 True
2 False
3 False
4 False
5 True
6 True
7 False
dtype: bool
Depending on the way you want to handle these duplicates, you may want to keep or remove the duplicate rows.
Finding Duplicate Rows based on Column Using Pandas
By default, the duplicated function finds duplicates based on all columns of a DataFrame. We can find duplicate rows based on just one column or multiple columns using the “subset” parameter.
Let’s say we have the same DataFrame as above. We can find all of the duplicates based on the “Name” column by passing ‘subset=[“Name”]’ to the duplicated() function.
print(df.duplicated(subset=["Name"]))
#Output:
0 False
1 True
2 True
3 False
4 False
5 False
6 True
7 False
dtype: bool
Hopefully this article has been beneficial for you to understand how to use the pandas duplicated() function to find duplicate rows in your data analysis in Python.