To sample a DataFrame with pandas in Python, you can use the sample() function. Pass the number of elements you want to extract or a fraction of items to return.
sampled_df = df.sample(n=100)
sampled_df = df.sample(frac=0.5)
In this article, you’ll learn how to get a random sample of data in Python with the pandas sample() function.
When working with data in Python, many times we want to get a random sample of our data. For example, in modeling, we might take a random sample to prevent overfitting a model or to create fitting and validation datasets.
With pandas, we can easily get random samples of data with the pandas sample() function.
You can use sample() to get a sample of a specific number of records, get a sample of a fraction of records, get a sample of the columns of a DataFrame, and sample with replacement.
Let’s say we have the following DataFrame in Python.
df = pd.DataFrame({'Name': ['Jim','Jim','Jim','Sally','Bob','Sue','Sue','Larry'],
'Weight':['100','100','200','100','200','150','150','200']})
# Output:
Name Weight
0 Jim 100
1 Jim 100
2 Jim 200
3 Sally 100
4 Bob 200
5 Sue 150
6 Sue 150
7 Larry 200
If you want to generate a 50% sample of this dataset, you can pass “0.5” to the “frac” parameter.
print(df.sample(frac=0.5))
# Output:
Name Weight
0 Jim 100
1 Jim 100
4 Bob 200
7 Larry 200
If instead, you wanted to extract 4 items from the data randomly, you can pass “4” to the “n” parameter.
print(df.sample(n=4))
# Output:
Name Weight
0 Jim 100
1 Jim 100
5 Sue 150
6 Sue 150
You can also return a sample which has more records than the original dataset. If you want to create a 200% sample of your data, you can pass “2” to the “frac” parameter.
print(df.sample(frac=2))
# Output:
Name Weight
0 Jim 100
1 Jim 100
4 Bob 200
7 Larry 200
Like most pandas functions, sample() has the parameter “inplace” which allows you to modify a given DataFrame in place, and you can also sample columns by passing “1” to the parameter “axis”.
Using Seed for the Random Number Generation with sample()
When creating a random sample, many times we want reproducibility. For example, if I’m validating someone else’s results, then I want to be able to reproduce every dataset in their process.
The “random_state” parameter of the sample() function allows us to pass a “seed” for the random number generator of sample().
Below shows an example of how you can use the “random_state” parameter in sample().
sampled_df = df.sample(frac=0.5, random_state=5)
Random Sampling with Replacement in pandas
If you want get a random sample with replacement, you can also do that with the pandas sample() function.
The “replace” parameter allows you to perform sampling with replacement.
Sampling with replacement means that after each element is chosen via the sampling algorithm, instead of removing that element, it is put back into the population.
Below shows an example of how you can use the “replace” parameter to get a random sample with replacement with the pandas sample() function.
sampled_df = df.sample(frac=0.5, replace=5)
Hopefully this article has been useful for you to learn how to use the pandas sample() function to generate random samples of your data in Python.