When working with data in pandas, you can fill NaN values with interpolation using the pandas interpolate() function.

df_withinterpolation = df["col_with_nan"].interpolate(method="linear")

There are many different interpolation methods you can use. In this post, you’ll learn how to use interpolate() to fill NaN Values with pandas in Python.


When working with data, NaN values can be a problem for us, and depending on the situation, we might want to remove those NaN values or fill the NaN values.

One way you can deal with NaN values is with interpolation. If you are working with time series data, interpolation allows us to fill missing values and create new data points.

When using pandas, the interpolate() function allows us to fill NaN values with different interpolation methods.

By default, interpolate() using linear interpolation to interpolate between two non-NaN values to fill a NaN value.

Let’s say we have the following data with some NaN values.

                 time  value
2022-05-01 00:00:00    1.0
2022-05-01 06:00:00    NaN
2022-05-01 12:00:00    7.0
2022-05-01 18:00:00    NaN
2022-05-02 00:00:00    9.0
2022-05-02 06:00:00    NaN
2022-05-02 12:00:00    8.0
2022-05-02 18:00:00    NaN
2022-05-03 00:00:00    9.0
2022-05-03 06:00:00    NaN

Below is an example of how to use interpolate() to perform linear interpolation and fill NaN values with the midpoint between two values.

print(df.interpolate(method="linear"))

#Output:
                     value
time
2022-05-01 00:00:00    1.0
2022-05-01 06:00:00    4.0
2022-05-01 12:00:00    7.0
2022-05-01 18:00:00    8.0
2022-05-02 00:00:00    9.0
2022-05-02 06:00:00    8.5
2022-05-02 12:00:00    8.0
2022-05-02 18:00:00    8.5
2022-05-03 00:00:00    9.0
2022-05-03 06:00:00    4.5

As you can see, the NaN values have been filled using linear interpolation.

There are many different interpolation methods (such as cubic, spline, polynomial, etc.) you can use for interpolation which can you read about in the documentation. Some of these methods may require the SciPy module.

Interpolating Data After Resampling with pandas interpolate() Function

One common use of the pandas interpolate() function is after resampling. The pandas resample() function allows us to resample time series data.

One way we can use resample() is to increase the frequency of our time series data. To increasing the frequency of our time series data is called upsampling. This is like taking monthly data and making it daily.

Let’s say we have the following data which has data points every 12 hours.

import pandas as pd
import numpy as np

df = pd.DataFrame({'time':pd.date_range(start='05-01-2022',end='05-31-2022', freq="12H"), 'value':np.random.randint(10,size=61)})

print(df.head(10))

#Output:
                 time  value
0 2022-05-01 00:00:00      5
1 2022-05-01 12:00:00      1
2 2022-05-02 00:00:00      9
3 2022-05-02 12:00:00      8
4 2022-05-03 00:00:00      9
5 2022-05-03 12:00:00      7
6 2022-05-04 00:00:00      7
7 2022-05-04 12:00:00      4
8 2022-05-05 00:00:00      6
9 2022-05-05 12:00:00      4

Let’s increase the frequency of our data to every 3 hours with resample(). First, we need to set the date time column as the index, and then we can resample.

Then, we can increase the frequency of our data by passing “3H” to resample().

df.set_index('time', inplace=True)

resampled_df = df.resample("3H").mean()

print(resampled_df.head(10))

#Output:
                     value
time
2022-05-01 00:00:00    5.0
2022-05-01 03:00:00    NaN
2022-05-01 06:00:00    NaN
2022-05-01 09:00:00    NaN
2022-05-01 12:00:00    1.0
2022-05-01 15:00:00    NaN
2022-05-01 18:00:00    NaN
2022-05-01 21:00:00    NaN
2022-05-02 00:00:00    9.0
2022-05-02 03:00:00    NaN

As you can see, we’ve now added datapoints between the datapoints which previously existed, but the values for these datapoints are NaN.

To fill these NaN values, you can use interpolate(). Below is an example of how to use a polynomial of order 2 for interpolation to fill the NaN values in the time series data.

resampled_df = df.resample("3H").interpolate(method="polynomial", order=2)

print(resampled_df.head(10))

#Output:
                        value
time
2022-05-01 00:00:00  5.000000
2022-05-01 03:00:00  2.503992
2022-05-01 06:00:00  1.005323
2022-05-01 09:00:00  0.503992
2022-05-01 12:00:00  1.000000
2022-05-01 15:00:00  2.493346
2022-05-01 18:00:00  4.984031
2022-05-01 21:00:00  7.482700
2022-05-02 00:00:00  9.000000
2022-05-02 03:00:00  9.535930

Hopefully this article has been useful for you to learn about the pandas interpolate() function and how you can interpolate between datapoints and fill NaN values in your Python code.

Categorized in:

Python,

Last Update: February 26, 2024