1 Introduction

This guide shows how to use penalized survival models for prediction with left-truncated and right-censored data (LTRC) as described in McGough et al. 2021. The analyses utilize a simulated dataset based on the Foundation Medicine and Flatiron Health Clinico-Genomic Database (CGDB), as described in the paper.

We will use glmnet to fit the models, rsample to split the data into training and test sets, and various features of the survival package. We will also need a script containing custom functions for calibrating survival predictions.

# Packages used in this script
library("glmnet")
library("rsample")
library("survival")

# Custom functions for calibration
library("data.table") # Dependency for calibration functions
library("ggplot2") # Required to visualize calibrations
source("R/calibrate.R") # Our custom calibration functions
theme_set(theme_bw()) # ggplot theme

# There is some randomness in the analysis include splitting
# the data into train/test sets and cross validation.
set.seed(7) 

2 LTRC data

While we are not permitted to share data from the CGDB, we can simulate data that is consistent with what we observe. Our simulated dataset is contained in the file simdata.rds. Details on how we simulated the data are described in the paper. If you would like to see the code used to simulate the data you can view/run the file simdata.R.

simdata <- as.data.frame(readRDS("simdata.rds"))
head(simdata)
##   patient_id   death_time entry_time   event_time dead hidden PracticeTypeCOMMUNITY index_date_year
## 1          1  0.005757188  1.4637014  0.005757188    1      1             0.3075328      -1.5808048
## 2          2  3.013345192  0.9717125  3.013345192    1      0            -0.5945161      -0.0919248
## 3          3 18.329333316  0.6524921 17.434915670    0      0            -0.5906585       0.1516768
## 4          4  0.648803003  0.1911216  0.648803003    1      0             0.3756330       1.1000916
## 5          5  6.431299023  2.3453534  1.993891491    0      1            -0.1602403      -0.5161476
## 6          6  3.297675146  0.7987339  1.540651933    0      0             1.0100120      -1.5804034
##   RaceBlack_or_African_American RaceOther_Race  RaceWhite  age_at_dx     RE_TP53    CN_TP53    SV_TP53    SV_KRAS
## 1                     0.2507992      0.1007697 -0.4459728 -1.0525652 -0.24861387  1.8761518  0.4010577  1.1605783
## 2                    -0.7034950     -0.5999176  0.9138291 -1.8536665  0.47891090 -1.0843865  1.1260018 -1.4744714
## 3                     1.7934429      1.9369430 -2.4369538  0.2949064 -0.05952971 -0.7465886 -2.0157914 -1.0248021
## 4                    -0.9150085     -0.6546389  1.0668138 -0.9242296 -0.18676749  1.3010576  0.3924116 -1.0823567
## 5                     1.0378511      1.2124524 -1.6625865 -0.2708332  0.31883473 -0.1526148  0.7071611 -0.1710668
## 6                    -0.4948662     -0.4184016  0.3474558  0.8100232  1.10125555  0.5437209 -0.7758010 -1.4563459
##       SV_EGFR
## 1 -0.07789252
## 2 -0.63500825
## 3  0.71480219
## 4 -0.21909582
## 5  0.12773691
## 6  1.03726140

There are two characteristics of the data that are artificial and would not be observed in reality. First, we simulate (latent) death_time, but deaths are only observed if they occur prior to censoring; that is, since the data is right-censored, we only observe event_time. Second, some patients are not included in the data (i.e., they are hidden), because their event_time occurs prior to the time they would have entered the dataset (their entry_time). These patients are left-truncated.

To create a realistic LTRC dataset we will consequently remove the latent death_time variable and subset the data to patients that were not left-truncated.

simdata <- simdata[simdata$hidden == 0, ]
simdata$death_time <- NULL

2.1 Train and test sets

For prediction problems, it is common to split the data into training and tests sets. Here we use the rsample package to randomly split the data so that 75% of patients are in the training set and 25% are in the test set.

data_split <- initial_split(simdata, prop = .75)
train_data <- training(data_split)
test_data <- testing(data_split)

3 Model fitting

3.1 Model data

3.1.1 Input matrix

glmnet requires an input matrix x, which must be a matrix object. Here we consider a simple example with a small number of predictors contained in x.

x_vars <- attr(simdata, "x_vars")
x_vars
##  [1] "PracticeTypeCOMMUNITY"         "index_date_year"               "RaceBlack_or_African_American"
##  [4] "RaceOther_Race"                "RaceWhite"                     "age_at_dx"                    
##  [7] "RE_TP53"                       "CN_TP53"                       "SV_TP53"                      
## [10] "SV_KRAS"                       "SV_EGFR"

We create x matrices for both the training and the test sets.

x_train <- as.matrix(train_data[, x_vars])
x_test <- as.matrix(test_data[, x_vars])

3.1.2 Response

When fitting Cox proportional hazards models with glmnet, the response should be a survival::Surv object. The key difference between modeling LTRC and RC data is that the response must contain start-stop intervals when using LTRC data. Start-stop Surv objects are created by passing the time that a patient enters the dataset (time), the follow-up time (time2), and a status indicator equal to 1 if a patient has died and 0 if they are alive.

As with x, we create y for both the training and the test sets.

y_train <- Surv(time = train_data$entry_time,
                time2 = train_data$event_time,
                event = train_data$dead)

y_test <- Surv(test_data$entry_time,
               test_data$event_time,
               test_data$dead)

3.2 Cross validation

We will fit a penalized lasso model appropriate for LTRC data using cv.glmnet() (note: requires glmnet > 4.1). The value of the tuning parameter \(\lambda\) that minimizes the partial likelihood deviance (lambda.min) is selected using 5-fold cross validation. For more details, see the vignette on the glmnet website.

cvfit <- cv.glmnet(x_train, y_train, family = "cox", nfolds = 5)

3.3 Model coefficients

The coefficients (i.e., log hazard ratios) depend on the value of \(\lambda\). We can easily view the coefficients for a specific value of \(\lambda\) using coef(). We might want to examine coefficients from a model that minimizes the deviance:

coef(cvfit, s = "lambda.min")
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                                          1
## PracticeTypeCOMMUNITY          0.240479460
## index_date_year                0.050843679
## RaceBlack_or_African_American  0.096615350
## RaceOther_Race                 .          
## RaceWhite                      0.162999930
## age_at_dx                     -0.005480715
## RE_TP53                        .          
## CN_TP53                        0.480436211
## SV_TP53                        0.222604771
## SV_KRAS                        0.100041553
## SV_EGFR                       -0.280888080

Or alternatively, we may choose a model that minimizes deviance within 1 standard deviation of the minimum (lambda.1se). Note that this results in a sparser model:

coef(cvfit, s = "lambda.1se")
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                                        1
## PracticeTypeCOMMUNITY         .         
## index_date_year               .         
## RaceBlack_or_African_American .         
## RaceOther_Race                .         
## RaceWhite                     .         
## age_at_dx                     .         
## RE_TP53                       .         
## CN_TP53                       0.03341574
## SV_TP53                       .         
## SV_KRAS                       .         
## SV_EGFR                       .

The full regularization path can also be plotted:

plot(cvfit)

4 Model predictions and performance

We will make out-of-sample predictions on the test set (25% of data) from models where the deviance was minimized (lambda.min). With survival data, there is no single best way to summarize predictions or to assess the performance of a model. We will first consider the C-index, which is a measure of the extent to which a model can discriminate across patients. Then we will examine survival probabilities and the extent to which predicted probabilities are well calibrated; that is, the extent to which predicted probabilities are similar to observed probabilities.

4.1 C-index

The C-index can be computed from the linear predictor (i.e., the predicted log hazards). By default, the predict.cv.glmnet() method predicts the linear predictor for each patient.

lp_test <- predict(cvfit, newx = x_test, s = "lambda.min")
head(lp_test)
##             1
## 6  -0.1816055
## 20 -0.3884870
## 22 -1.2152285
## 29 -0.5825268
## 34 -0.1583408
## 41 -0.7225797

The survival::concordance() function will compute the C-index in a manner appropriate for LTRC data. When making out of sample predictions, it is convenient to specify the object argument as a formula that relates the response to the linear predictor.

concordance(y_test ~ lp_test, reverse = TRUE)
## Call:
## concordance.formula(object = y_test ~ lp_test, reverse = TRUE)
## 
## n= 102 
## Concordance= 0.6887 se= 0.03401
## concordant discordant     tied.x     tied.y    tied.xy 
##       1352        611          0          0          0

4.2 Survival probabilities and calibration

4.2.1 Survival probabilities

Models fit with glmnet now have a survfit method which can be used to generate survival curves. Note that the values of x and y used to train the model must also be passed so that the baseline hazard can be computed.

surv_test <- survfit(cvfit, newx = x_test, s = "lambda.min", 
                    x = x_train, y = y_train)

Plots using base R graphics can be quickly generated using the plot.survfit() method, which produces a survival curve for each patient in the test set.

plot(surv_test)

Note that we could have alternatively used a general purpose plotting package like ggplot or a specialized plotting package for survival analysis like survminer to produce more modern looking graphics. It might also be a good idea to aggregate predicted survival curves across patients in a meaningful way.

4.2.2 Calibration

While there are many great R packages such a rms and pec that can be used to evaluate a survival model (including via calibration), none (to our knowledge) work with LTRC data. We consequently wrote our own function to calibrate survival models for the paper (available in the file R/calibrate.R), which we will use here. calibrate() is a generic function with a method for survfit objects. We also created a generic autplot() method to quickly plot the output produced by calibrate().

Let’s calibrate the model at semi-annual intervals for up to 3 years. Each point consists of patients in a given decile of predicted survival at a particular time point. The x-axis is the average predicted survival probability and the y-axis is the pseudo observed survival probability computed using the (left truncation adjusted) Kaplan-Meier estimator. Perfectly calibrated predictions are those that lie along the 45-degree line so that predicted and observed survival probabilities are equal.

# calibrate.survfit() method
cal_test <- calibrate(object = surv_test, times = seq(.5, 3, .5), y = y_test)

# autoplot.calibrate() method
autoplot(cal_test)