To convert a column in a pandas DataFrame from a column with data type “object” to a column with data type “category”, use the astype() function.
import pandas as pd
df = pd.DataFrame({ "column": ["a","b","c","a","b","c","b","d"] })
print(df["column"].dtype)
df["column"] = df["column"].astype('category')
print(df["column"].dtype)
#Output:
object
category
When working with different types of data in pandas, the ability to easily be able to change the data type of a column is valuable.
One such case is if you want to convert a pandas column from a column with the data type “object” to a column with data type “category”.
To convert a column in a pandas DataFrame from a column with data type “object” to a column with data type “category”, use the astype() function.
astype() allows you to convert the data type of pandas columns.
Below is a simple example showing you how to convert the data type of a pandas column from “object” to “category”.
import pandas as pd
df = pd.DataFrame({ "column": ["a","b","c","a","b","c","b","d"] })
print(df["column"].dtype)
df["column"] = df["column"].astype('category')
print(df["column"].dtype)
#Output:
object
category
Reducing Memory Usage with dtype Category Columns in pandas
One of the main benefits of using “category” columns in pandas is you are able to reduce the amount of memory used in your process.
The reason for this is that categorical data is pandas stores only the unique values (i.e the category) instead of every single value.
Below shows an example of how you can reduce memory using categorical data in pandas.
import pandas as pd
s = pd.Series(["a","b","c","a","b","c","b","d"] * 1000)
print(s.nbytes)
print(s.astype("category").nbytes)
#Output:
64000
8032
Using groupby() When Working With Column with dtype Category in pandas
One last thing I want to add to this post is something that I came across when I was performing some data analysis with pandas.
If you have categorical data and go to use the groupby() function to group your DataFrame, you should use the “observed=True” option so that groupby() behaves the same as it does when you use it on data which has the data type “object”.
Below shows you an example of how using the “observed=True” option in groupby() affects the output if you are using groupby() in pandas.
import pandas as pd
df = pd.DataFrame({"animal_type":["dog","cat","dog","cat","dog","dog","cat","cat","dog"],
"gender":["F","M","F","M","M","F","M","M","M"],
"age":[1,2,3,4,5,6,7,8,9],
"weight":[10,20,15,20,25,10,15,30,40]})
df["animal_type"] = df["animal_type"].astype('category')
df["gender"] = df["gender"].astype('category')
print(df.groupby(["animal_type","gender"])["age"].max())
print(df.groupby(["animal_type","gender"], observed=True)["age"].max())
#Output:
animal_type gender
cat F NaN
M 8.0
dog F 6.0
M 9.0
animal_type gender
dog F 6
M 9
cat M 8
Name: age, dtype: int64
Hopefully this article has been useful for you to learn how to convert a pandas column from object to category in Python.