When working with data as a data science or data analyst, survival analysis is very common and something that many industries and companies utilize to understand the expected time and probabilities of some event occurring.
There are many major companies and industries which use SAS (banking, insurance, etc.), but with the rise of open source and the popularity of languages such as Python and R, these companies are exploring converting their code to Python.
A commonly used procedure for survival analysis in SAS is the PROC LIFETEST procedure. In this article, you’ll learn the Python equivalent of PROC LIFETEST.
PROC LIFETEST Equivalent in Python
In SAS, when we are looking at doing survival analysis on continuous variables, we use PROC LIFETEST. PROC LIFETEST computes nonparametric estimates of the survivor function using the Kaplan-Meier method.
Let’s say we have data such as the following:
In SAS, if we wanted to get the Kaplan-Meier estimates of on this data for the weight_class variable (weight_class = 1 if weight > 6.5, else weight_class = 0), we could do something like the following:
With this code, we would get output for the two strata, a plot and p values.
The output for stratum 1 is below:
The output for stratum 2 is below:
The KM curves plotted from this data is below:
And finally, we have the p values and other tests from this Kaplan Meier.
To get the equivalent of PROC LIFETEST in Python, we will use the KaplanMeierFitter class from the lifelines package. To chart, we will use matplotlib. For the p values, we need to import logrank_test from lifelines.statistics.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter
from lifelines.statistics import logrank_test
To get the survival curves and plots, we need to fit the Kaplan Meier. For each level of your categorical variable, you need to subset the data, and calculate a fit on that subset.
Here’s the first survival curve:
kmf_0 = KaplanMeierFitter()
ds_0 = example_data[example_data["weight_class"]==0]
kmf_0.fit(durations=ds_0["time"],event_observed=ds_0["event"],label="weight_class=0")
kmf_0.survival_function_
#output:
# weight_class=0
#timeline
#0.000000 1.000000
#0.013689 0.960000
#0.046543 0.920000
#0.106776 0.880000
#0.123203 0.840000
#0.336756 0.800000
#0.347707 0.760000
#0.372348 0.720000
#0.375086 0.680000
#0.388775 0.640000
#0.391513 0.600000
#0.424367 0.560000
#0.462697 0.520000
#0.470910 0.480000
#0.577687 0.440000
#0.791239 0.440000
#0.804928 0.396000
#0.878850 0.352000
#0.889802 0.352000
#0.944559 0.352000
#1.004791 0.293333
#1.034908 0.293333
#1.396304 0.293333
#1.557837 0.195556
#1.796030 0.097778
#2.521561 0.097778
For the second stratum, we do the same:
kmf_1 = KaplanMeierFitter()
ds_1 = example_data[example_data["weight_class"]==1]
kmf_1.fit(durations=ds_1["time"],event_observed=ds_1["event"],label="weight_class=1")
kmf_1.survival_function_
#output:
# weight_class=1
#timeline
#0.000000 1.000000
#0.019165 0.933333
#0.024641 0.866667
#0.027379 0.800000
#0.038330 0.733333
#0.062971 0.666667
#0.095825 0.600000
#0.153320 0.533333
#0.227242 0.466667
#0.312115 0.400000
#0.394251 0.333333
#0.462697 0.333333
#0.484600 0.250000
#0.591376 0.250000
#0.635181 0.250000
#0.651608 0.000000
In this case, the outsurv dataset you receive from SAS is made up of these two survival functions.
For the plot, we don’t have to do much. We can use the .plot() function which is part of the KaplanMeierFitter class and easily add these two curves to a plot.
kmf_0.plot()
kmf_1.plot()
plt.show()
Here’s the resulting image from this code:
If you don’t want to show the confidence intervals, then you can pass ci_show = False to the plot().
To get the p values, we need to do a log-rank test.
t0 = ds_0["time"]
t1 = ds_1["time"]
e0 = ds_0["event"]
e1 = ds_1["event"]
results = logrank_test(t0,t1,event_observed_A=e0,event_observed_B=e1)
print(results)
#output:
# t_0 = -1
# null_distribution = chi squared
#degrees_of_freedom = 1
# test_name = logrank_test
#
#---
# test_statistic p -log2(p)
# 4.52 0.03 4.90
print(results.p_value)
#output:
#0.033463339869510035
You can verify this is the same p value we received from SAS.
Finally, we want to get the 25%, 50% and 75% quantiles for the different survival curves.
The KaplanMeierFitter class has a median_survival_time_ function, but this is not the right median. This median is calculated using pandas – which uses a different algorithm thatn SAS when calculating a median.
To get the 25%, 50%, and 75% quantiles, I use the following function that I wrote:
def get_KM_times(survival_function,label):
s = survival_function.reset_index()
length = int(s.size / 2)
s.rename(columns={label:"pred"}, inplace=True)
below_75 = False
below_50 = False
below_25 = False
estimates = [".",".","."]
s["pred"] = s["pred"].apply(lambda x: round(x,6))
for i in range(1,length):
if (s["pred"][i] < 0.750000 and below_75 == False):
if (s["pred"][i-1] == 0.750000):
estimates[0] = (s["timeline"][i]+s["timeline"][i-1])/2
else:
estimates[0] = s["timeline"][i]
below_75 = True
if (s["pred"][i] < 0.500000 and below_50 == False):
if (s["pred"][i-1] == 0.500000):
estimates[1] = (s["timeline"][i]+s["timeline"][i-1])/2
else:
estimates[1] = s["timeline"][i]
below_50 = True
if (s["pred"][i] < 0.250000 and below_25 == False):
if (s["pred"][i-1] == 0.250000):
estimates[2] = (s["timeline"][i]+s["timeline"][i-1])/2
else:
estimates[2] = s["timeline"][i]
below_25 = True
return estimates
Using this on our two curves, we can get the same 25%, 50% and 75% times as SAS produces:
print(get_KM_times(kmf_0.survival_function_,"weight_class=0"))
#output:
#[0.3723477070499658, 0.47091033538672145, 1.5578370978781657]
print(get_KM_times(kmf_1.survival_function_,"weight_class=1"))
#output:
#[0.038329911019849415, 0.2272416153319644, 0.6433949349760438]
You can see that these 25%, 50%, and 75% survival times match up with the SAS output.
Thanks for reading this article on how to convert your PROC LIFETEST to get the same output in Python.
Here’s the full code from this article:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter
from lifelines.statistics import logrank_test
def get_KM_times(survival_function,label):
s = survival_function.reset_index()
length = int(s.size / 2)
s.rename(columns={label:"pred"}, inplace=True)
below_75 = False
below_50 = False
below_25 = False
estimates = [".",".","."]
s["pred"] = s["pred"].apply(lambda x: round(x,6))
for i in range(1,length):
if (s["pred"][i] < 0.750000 and below_75 == False):
if (s["pred"][i-1] == 0.750000):
estimates[0] = (s["timeline"][i]+s["timeline"][i-1])/2
else:
estimates[0] = s["timeline"][i]
below_75 = True
if (s["pred"][i] < 0.500000 and below_50 == False):
if (s["pred"][i-1] == 0.500000):
estimates[1] = (s["timeline"][i]+s["timeline"][i-1])/2
else:
estimates[1] = s["timeline"][i]
below_50 = True
if (s["pred"][i] < 0.250000 and below_25 == False):
if (s["pred"][i-1] == 0.250000):
estimates[2] = (s["timeline"][i]+s["timeline"][i-1])/2
else:
estimates[2] = s["timeline"][i]
below_25 = True
return estimates
#fitting the Kaplan Meiers
kmf_0 = KaplanMeierFitter()
ds_0 = example_data[example_data["weight_class"]==0]
kmf_0.fit(durations=ds_0["time"],event_observed=ds_0["event"],label="weight_class=0")
kmf_0.survival_function_
kmf_1 = KaplanMeierFitter()
ds_1 = example_data[example_data["weight_class"]==1]
kmf_1.fit(durations=ds_1["time"],event_observed=ds_1["event"],label="weight_class=1")
kmf_1.survival_function_
#plotting
kmf_0.plot()
kmf_1.plot()
plt.show()
#log-rank test
t0 = ds_0["time"]
t1 = ds_1["time"]
e0 = ds_0["event"]
e1 = ds_1["event"]
results = logrank_test(t0,t1,event_observed_A=e0,event_observed_B=e1)
print(results)
print(results.p_value)
#25%, 50% and 75% survival times
print(get_KM_times(kmf_0.survival_function_,"weight_class=0"))
print(get_KM_times(kmf_1.survival_function_,"weight_class=1"))