Course material provided by Professor Kevin Thorpe and modifications were made by myself.
Load the data frame tut1
into R from the file
tut1.RData
available on the course page. The data frame
contains the variables x and y.
Fit a simple regression with y as the response and x as the predictor. Use diagnostic plots to determine if a non-linear relationship should be modelled. If you believe a non-linear relationship is present, model it using a restricted cubic spline. The number of knots is up to you.
# Load packages and data
library(rms)
library(lattice)
library(tidyverse)
load("./CHL5202/tut1.RData")
dd <- datadist(tut1)
options(datadist="dd")
Some notes taken from here.
Comparing ols()
vs lm()
: one advantage
for ols()
is that the entire variance-covariance matrix is
saved
fit0 <- ols(y~x,data=tut1)
fit0
## Linear Regression Model
##
## ols(formula = y ~ x, data = tut1)
##
## Model Likelihood Discrimination
## Ratio Test Indexes
## Obs 100 LR chi2 32.07 R2 0.274
## sigma1.0827 d.f. 1 R2 adj 0.267
## d.f. 98 Pr(> chi2) 0.0000 g 0.769
##
## Residuals
##
## Min 1Q Median 3Q Max
## -2.21787 -0.73287 0.02416 0.74941 2.75329
##
##
## Coef S.E. t Pr(>|t|)
## Intercept 1.2185 0.2149 5.67 <0.0001
## x -0.3597 0.0591 -6.09 <0.0001
anova
to an ols
model, it
separates out the linear and non-linear components of restricted cubic
splines and polynomial terms (and product terms if they are
present).lm()
, using anova()
with
ols()
we have partial F-tests that describe the marginal
impact of removing each covariate from the model.anova(fit0)
## Analysis of Variance Response: y
##
## Factor d.f. Partial SS MS F P
## x 1 43.43251 43.432506 37.05 <.0001
## REGRESSION 1 43.43251 43.432506 37.05 <.0001
## ERROR 98 114.88584 1.172305
# for diagnostic plots
dx0 <- data.frame(tut1, e = resid(fit0), yhat = fitted(fit0))
# Normal QQ Plot
qqnorm(dx0$e)
qqline(dx0$e)
xyplot()
: Common Bivariate Trellis Plots. Produces
conditional scatterplots, typically between two continuous variables.
Below we have residual plots.# Residual Plots
xyplot(e~x, data = dx0, type = c("p","smooth"))
xyplot(e~yhat, data = dx0, type = c("p","smooth"))
fit1 <- ols(y ~ rcs(x, 4), data = tut1)
fit1
## Linear Regression Model
##
## ols(formula = y ~ rcs(x, 4), data = tut1)
##
## Model Likelihood Discrimination
## Ratio Test Indexes
## Obs 100 LR chi2 46.60 R2 0.372
## sigma1.0173 d.f. 3 R2 adj 0.353
## d.f. 96 Pr(> chi2) 0.0000 g 0.875
##
## Residuals
##
## Min 1Q Median 3Q Max
## -2.0412 -0.7268 0.1088 0.5716 2.3664
##
##
## Coef S.E. t Pr(>|t|)
## Intercept 0.2834 0.3389 0.84 0.4050
## x 0.6559 0.2760 2.38 0.0195
## x' -3.3087 0.8566 -3.86 0.0002
## x'' 9.3222 2.4876 3.75 0.0003
anova()
output below is more useful and shows a significant
non-linear component.anova(fit1)
## Analysis of Variance Response: y
##
## Factor d.f. Partial SS MS F P
## x 3 58.97203 19.657342 19.00 <.0001
## Nonlinear 2 15.53952 7.769761 7.51 9e-04
## REGRESSION 3 58.97203 19.657342 19.00 <.0001
## ERROR 96 99.34632 1.034858
# Plot predicted fit with confidence interval
ggplot(Predict(fit1))
# for diagnostic plots
dx1 <- data.frame(tut1, e = resid(fit1), yhat = fitted(fit1))
# Normal QQ Plot
qqnorm(dx1$e)
qqline(dx1$e)
# Residual plots
xyplot(e ~ x, data = dx1, type = c("p", "smooth"))
xyplot(e ~ yhat, data = dx1, type = c("p", "smooth"))
Load the data frame FEV from the file FEV.RData. For these data, use the variable fev as the response and the rest as the explanatory covariates.
Fit an additive linear model to fev using the other variables as the covariates. Evaluate whether any of the continuous variables should be fit as non-linear terms.
Fit an another additive model but this time model the continuous covariates with restricted cubic splines with four knots. Describe the results.
load("./CHL5202/FEV.RData")
fev.dd <- datadist(FEV)
options(datadist = "fev.dd")
# `dplyr()`: package for data wrangling - part of `tidyverse()`
library(dplyr)
# Useful for quickly viewing the data
glimpse(FEV)
## Rows: 654
## Columns: 6
## $ id <int> 301, 451, 501, 642, 901, 1701, 1752, 1753, 1901, 1951, 1952, 20…
## $ age <int> 9, 8, 7, 9, 9, 8, 6, 6, 8, 9, 6, 8, 8, 8, 8, 7, 5, 6, 9, 9, 5, …
## $ fev <dbl> 1.708, 1.724, 1.720, 1.558, 1.895, 2.336, 1.919, 1.415, 1.987, …
## $ height <dbl> 57.0, 67.5, 54.5, 53.0, 57.0, 61.0, 58.0, 56.0, 58.5, 60.0, 53.…
## $ sex <fct> female, female, female, male, male, female, female, female, fem…
## $ smoke <fct> non-current smoker, non-current smoker, non-current smoker, non…
summary(FEV)
## id age fev height sex
## Min. : 201 Min. : 3.000 Min. :0.791 Min. :46.00 female:318
## 1st Qu.:15811 1st Qu.: 8.000 1st Qu.:1.981 1st Qu.:57.00 male :336
## Median :36071 Median :10.000 Median :2.547 Median :61.50
## Mean :37170 Mean : 9.931 Mean :2.637 Mean :61.14
## 3rd Qu.:53639 3rd Qu.:12.000 3rd Qu.:3.119 3rd Qu.:65.50
## Max. :90001 Max. :19.000 Max. :5.793 Max. :74.00
## smoke
## non-current smoker:589
## current smoker : 65
##
##
##
##
# Fit initial additive model
fev.lin <- ols(fev ~ age + height + sex + smoke, data = FEV)
fev.lin
## Linear Regression Model
##
## ols(formula = fev ~ age + height + sex + smoke, data = FEV)
##
## Model Likelihood Discrimination
## Ratio Test Indexes
## Obs 654 LR chi2 976.59 R2 0.775
## sigma0.4122 d.f. 4 R2 adj 0.774
## d.f. 649 Pr(> chi2) 0.0000 g 0.869
##
## Residuals
##
## Min 1Q Median 3Q Max
## -1.376562 -0.250333 0.008938 0.255879 1.920471
##
##
## Coef S.E. t Pr(>|t|)
## Intercept -4.4570 0.2228 -20.00 <0.0001
## age 0.0655 0.0095 6.90 <0.0001
## height 0.1042 0.0048 21.90 <0.0001
## sex=male 0.1571 0.0332 4.73 <0.0001
## smoke=current smoker -0.0872 0.0593 -1.47 0.1414
anova(fev.lin)
## Analysis of Variance Response: fev
##
## Factor d.f. Partial SS MS F P
## age 1 8.0994719 8.0994719 47.67 <.0001
## height 1 81.5049421 81.5049421 479.66 <.0001
## sex 1 3.8032782 3.8032782 22.38 <.0001
## smoke 1 0.3683978 0.3683978 2.17 0.1414
## REGRESSION 4 380.6402823 95.1600706 560.02 <.0001
## ERROR 649 110.2795540 0.1699223
age
and height
are continuous variables,
so we can use residual plots.sex
and smoke
are categorical variables
(in fact, they’re binary) so residual plots won’t be useful.dx.fev <- data.frame(FEV, e = resid(fev.lin), yhat = fitted(fev.lin))
xyplot(e ~ age, data = dx.fev, type = c("p", "smooth"))
xyplot(e ~ height, data = dx.fev, type = c("p", "smooth"))
fev.full <- ols(fev ~ rcs(age, 4) + rcs(height, 4) + sex + smoke, data = FEV)
fev.full
## Linear Regression Model
##
## ols(formula = fev ~ rcs(age, 4) + rcs(height, 4) + sex + smoke,
## data = FEV)
##
## Model Likelihood Discrimination
## Ratio Test Indexes
## Obs 654 LR chi2 1039.74 R2 0.796
## sigma0.3940 d.f. 8 R2 adj 0.794
## d.f. 645 Pr(> chi2) 0.0000 g 0.878
##
## Residuals
##
## Min 1Q Median 3Q Max
## -1.637487 -0.223633 0.008879 0.236143 1.776429
##
##
## Coef S.E. t Pr(>|t|)
## Intercept -2.7618 0.6718 -4.11 <0.0001
## age -0.0109 0.0343 -0.32 0.7509
## age' 0.1815 0.0733 2.48 0.0135
## age'' -0.7112 0.3094 -2.30 0.0218
## height 0.0834 0.0151 5.51 <0.0001
## height' 0.0114 0.0308 0.37 0.7114
## height'' 0.1252 0.1349 0.93 0.3540
## sex=male 0.0951 0.0335 2.83 0.0047
## smoke=current smoker -0.1485 0.0576 -2.58 0.0102
anova(fev.full)
## Analysis of Variance Response: fev
##
## Factor d.f. Partial SS MS F P
## age 3 10.2137671 3.4045890 21.93 <.0001
## Nonlinear 2 0.9847797 0.4923898 3.17 0.0426
## height 3 80.2711567 26.7570522 172.36 <.0001
## Nonlinear 2 3.1249430 1.5624715 10.06 <.0001
## sex 1 1.2475614 1.2475614 8.04 0.0047
## smoke 1 1.0306405 1.0306405 6.64 0.0102
## TOTAL NONLINEAR 4 10.1495441 2.5373860 16.34 <.0001
## REGRESSION 8 390.7898263 48.8487283 314.67 <.0001
## ERROR 645 100.1300100 0.1552403
anova()
output suggests that the non-linear
components for age and height are statistically significant.dx.full <- data.frame(FEV, e = resid(fev.full), yhat = fitted(fev.full))
# Residual plots look good
xyplot(e ~ age, data = dx.full, type = c("p", "smooth"))
xyplot(e ~ height, data = dx.full, type = c("p", "smooth"))
summary()
is the best way to get the effects of
the categorical variables.summary(fev.full)
## Effects Response : fev
##
## Factor Low High Diff. Effect S.E.
## age 8 12.0 4.0 0.344100 0.059919
## height 57 65.5 8.5 0.888570 0.061187
## sex - female:male 2 1.0 NA -0.095061 0.033533
## smoke - current smoker:non-current smoker 1 2.0 NA -0.148540 0.057649
## Lower 0.95 Upper 0.95
## 0.22644 0.461760
## 0.76842 1.008700
## -0.16091 -0.029214
## -0.26174 -0.035337
ggplot(Predict(fev.full, age))
ggplot(Predict(fev.full, height))
%>%
: pipe operator - very useful for performing
multiple data wrangling stepsggplot2()
: plotting package, included in
rms()
and tidyverse()
# Boxplot
FEV %>%
ggplot(mapping = aes(x = sex, y = height, fill = sex)) +
geom_boxplot() +
theme_bw() +
labs(title = "Boxplot of heights by sex")
Use the hwy2.RData file available from the course page. After
attaching the rms package and doing the usual
datadist()/options()
task, fit the following model:
fit <- ols(rate~rcs(trks,4)+rcs(sigs1,4)+type+hwy,data=hwy2,x=TRUE,y=TRUE)
Run both the ordinary bootstrap validation and the .632 bootstrap validation on this model. Use the same number of iterations and seed for both validations (to compare the results). What are your conclusions?
library(rms)
load("./CHL5202/hwy2.RData")
h.dd <- datadist(hwy2)
options(datadist="h.dd")
fit <- ols(rate ~ rcs(trks,4) + rcs(sigs1,4) + type + hwy, data=hwy2, x=TRUE, y=TRUE)
# Standard bootstrap
set.seed(321) # Set seed is reproducibility. Choose any number you like.
validate(fit, B=100)
##
## Divergence or singularity in 11 samples
## index.orig training test optimism index.corrected n
## R-square 0.7110 0.7802 0.3163 0.4640 0.2471 89
## MSE 1.1106 0.7859 2.6277 -1.8418 2.9523 89
## g 1.8604 1.8808 1.6933 0.1876 1.6728 89
## Intercept 0.0000 0.0000 0.6535 -0.6535 0.6535 89
## Slope 1.0000 1.0000 0.8493 0.1507 0.8493 89
# .632 bootstrap
set.seed(321)
validate(fit, method = ".632", B=100)
##
## Divergence or singularity in 11 samples
## index.orig training test optimism index.corrected n
## R-square 0.7110 0.7802 -0.3786 0.6886 0.0224 89
## MSE 1.1106 0.7859 5.0256 -2.4743 3.5849 89
## g 1.8604 1.8808 1.2664 0.3754 1.4850 89
## Intercept 0.0000 0.0000 1.2077 -0.7633 0.7633 89
## Slope 1.0000 1.0000 0.5981 0.2540 0.7460 89
Conclusions:
There is a discrepancy in standard bootstrap vs .632 method – the standard bootstrap may be overfitting so we use the .632 method to double check the validation results. The .632 bootstrap applies a correction factor to reduce the bias.
Standard bootstrap may believe the model is less overfit than
what it actually is, i.e. low optimism so index.corrected
is closer to index.orig
.
The drop in R-square is very large from validation (went from positive to negative), indicating the original model was very overfit to the point where the relationship inverted.
Slope change seems to imply an adjusted slope estimate should be around 0.75 of what the original model guessed.
Let’s investigate the outputs from a simpler model.
# We remove the restricted cubic splines which may be the main source of overfitting
fit2 <- ols(rate ~ trks + sigs1 + type + hwy, data=hwy2, x=TRUE, y=TRUE)
set.seed(321)
validate(fit2, B=100)
##
## Divergence or singularity in 11 samples
## index.orig training test optimism index.corrected n
## R-square 0.6031 0.6535 0.5132 0.1402 0.4628 89
## MSE 1.5255 1.2468 1.8708 -0.6240 2.1494 89
## g 1.6155 1.6571 1.5645 0.0926 1.5230 89
## Intercept 0.0000 0.0000 0.3534 -0.3534 0.3534 89
## Slope 1.0000 1.0000 0.9146 0.0854 0.9146 89
set.seed(321)
validate(fit2, method = ".632", B=100)
##
## Divergence or singularity in 11 samples
## index.orig training test optimism index.corrected n
## R-square 0.6031 0.6535 0.1379 0.2940 0.3091 89
## MSE 1.5255 1.2468 2.4057 -0.5563 2.0818 89
## g 1.6155 1.6571 1.2776 0.2136 1.4019 89
## Intercept 0.0000 0.0000 0.5816 -0.3676 0.3676 89
## Slope 1.0000 1.0000 0.7459 0.1606 0.8394 89
For the .632 bootstrap, the magnitude of the optimism for all 5 values is smaller than in the saturated model which showed overfitting.
Hard to say whether this model is still overfitting but it is certainly an improvement.
The R-square term in particular shows a major improvement, as the test R-square is positive rather than negative like previously shown.
Load the data from tutdata
from the file
tutdata.RData
posted on the course page. Fit an additive
model where y
is the outcome and the predictors are
blood.pressure
, sex
, age
, and
cholesterol
but they should use a restricted cubic spline
with four knots for cholesterol.
library(rms)
load("./CHL5202/tutdata.RData")
dd <- datadist(tutdata)
options(datadist="dd")
fit <- lrm(y ~ blood.pressure + sex + age + rcs(cholesterol,4), data = tutdata)
Which “variables” are significant at the 5% level?
anova(fit)
## Wald Statistics Response: y
##
## Factor Chi-Square d.f. P
## blood.pressure 0.33 1 0.5640
## sex 13.45 1 0.0002
## age 25.58 1 <.0001
## cholesterol 1.27 3 0.7368
## Nonlinear 0.78 2 0.6758
## TOTAL 38.63 6 <.0001
sex
and age
are statistically significant with p-values of 0.0002 and < 0.0001
respectively.Display the cholesterol effect.
# log odds
plot(Predict(fit,cholesterol))
# probability
plot(Predict(fit,cholesterol,fun=plogis), ylab="Probability")
# odds
plot(Predict(fit,cholesterol,fun=exp), ylab="Odds")
Describe the effect of sex.
summary(fit)
## Effects Response : y
##
## Factor Low High Diff. Effect S.E. Lower 0.95
## blood.pressure 109.580 130.580 20.991 -0.052121 0.090347 -0.22920
## Odds Ratio 109.580 130.580 20.991 0.949210 NA 0.79517
## age 43.531 56.677 13.146 0.438050 0.086609 0.26830
## Odds Ratio 43.531 56.677 13.146 1.549700 NA 1.30770
## cholesterol 183.730 216.530 32.796 0.194330 0.174400 -0.14748
## Odds Ratio 183.730 216.530 32.796 1.214500 NA 0.86288
## sex - male:female 1.000 2.000 NA 0.477030 0.130050 0.22213
## Odds Ratio 1.000 2.000 NA 1.611300 NA 1.24870
## Upper 0.95
## 0.12496
## 1.13310
## 0.60780
## 1.83640
## 0.53615
## 1.70940
## 0.73192
## 2.07910
y
for males is 1.61 times the
odds for females (with 95% CI [1.25,2.08]) controlled for blood
pressure, age and cholesterol.Use the data from last week (tutdata.RData
) and re-fit
the model from last week which was an additive model where
y
is the outcome and the predictors are
blood.pressure
, sex
, age
, and
cholesterol
except this time use a restricted cubic spline
with four knots for cholesterol
. Re-run the anova also.
library(rms)
load("./CHL5202/tutdata.RData")
dd <- datadist(tutdata)
options(datadist="dd")
fit1 <- lrm(y ~ blood.pressure + sex + age + rcs(cholesterol,4), data = tutdata)
anova(fit1)
## Wald Statistics Response: y
##
## Factor Chi-Square d.f. P
## blood.pressure 0.33 1 0.5640
## sex 13.45 1 0.0002
## age 25.58 1 <.0001
## cholesterol 1.27 3 0.7368
## Nonlinear 0.78 2 0.6758
## TOTAL 38.63 6 <.0001
Fit a model as above only this time, add interactions between sex and
age as well as sex and the RCS for cholesterol and run
anova()
on this model.
fit2 <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)), data = tutdata)
anova(fit2)
## Wald Statistics Response: y
##
## Factor Chi-Square d.f. P
## blood.pressure 0.26 1 0.6105
## sex (Factor+Higher Order Factors) 38.71 5 <.0001
## All Interactions 26.13 4 <.0001
## age (Factor+Higher Order Factors) 30.42 2 <.0001
## All Interactions 3.86 1 0.0495
## cholesterol (Factor+Higher Order Factors) 23.78 6 0.0006
## All Interactions 22.47 3 0.0001
## Nonlinear (Factor+Higher Order Factors) 5.33 4 0.2550
## sex * age (Factor+Higher Order Factors) 3.86 1 0.0495
## sex * cholesterol (Factor+Higher Order Factors) 22.47 3 0.0001
## Nonlinear 4.79 2 0.0911
## Nonlinear Interaction : f(A,B) vs. AB 4.79 2 0.0911
## TOTAL NONLINEAR 5.33 4 0.2550
## TOTAL INTERACTION 26.13 4 <.0001
## TOTAL NONLINEAR + INTERACTION 26.81 6 0.0002
## TOTAL 62.26 10 <.0001
sex
are statistically
significant.Compare the two models with a likelihood ratio test. What does it show?
lrtest(fit1,fit2)
##
## Model 1: y ~ blood.pressure + sex + age + rcs(cholesterol, 4)
## Model 2: y ~ blood.pressure + sex * (age + rcs(cholesterol, 4))
##
## L.R. Chisq d.f. P
## 2.831616e+01 4.000000e+00 1.076137e-05
TOTAL INTERACTION
in the anova(fit2)
output as
this is what we are testing.Describe the sex/age and sex/cholesterol relationships with outcome.
plot(Predict(fit2,age,sex))
plot(Predict(fit2,cholesterol,sex))
# Fancy way of combining the plots together but is harder to interpret
bplot(Predict(fit2,cholesterol,age,sex,np=20),lfun=wireframe,drape=TRUE)
sex
and the other 3 variables, including
blood pressure
.fit3 <- lrm(y ~ sex * (blood.pressure + age + rcs(cholesterol,4)), data = tutdata)
# Alternate way
#lrm(y ~ sex + blood.pressure + age + rcs(cholesterol,4) + sex:blood.pressure + sex:age + sex:rcs(cholesterol,4), data = tutdata)
anova(fit3)
## Wald Statistics Response: y
##
## Factor Chi-Square d.f. P
## sex (Factor+Higher Order Factors) 39.12 6 <.0001
## All Interactions 26.58 5 0.0001
## blood.pressure (Factor+Higher Order Factors) 0.73 2 0.6954
## All Interactions 0.47 1 0.4940
## age (Factor+Higher Order Factors) 30.58 2 <.0001
## All Interactions 3.94 1 0.0471
## cholesterol (Factor+Higher Order Factors) 23.96 6 0.0005
## All Interactions 22.59 3 <.0001
## Nonlinear (Factor+Higher Order Factors) 5.24 4 0.2634
## sex * blood.pressure (Factor+Higher Order Factors) 0.47 1 0.4940
## sex * age (Factor+Higher Order Factors) 3.94 1 0.0471
## sex * cholesterol (Factor+Higher Order Factors) 22.59 3 <.0001
## Nonlinear 4.67 2 0.0966
## Nonlinear Interaction : f(A,B) vs. AB 4.67 2 0.0966
## TOTAL NONLINEAR 5.24 4 0.2634
## TOTAL INTERACTION 26.58 5 0.0001
## TOTAL NONLINEAR + INTERACTION 27.25 7 0.0003
## TOTAL 62.64 11 <.0001
lrtest(fit1,fit3)
##
## Model 1: y ~ blood.pressure + sex + age + rcs(cholesterol, 4)
## Model 2: y ~ sex * (blood.pressure + age + rcs(cholesterol, 4))
##
## L.R. Chisq d.f. P
## 2.878410e+01 5.000000e+00 2.556193e-05
lrtest(fit2,fit3)
##
## Model 1: y ~ blood.pressure + sex * (age + rcs(cholesterol, 4))
## Model 2: y ~ sex * (blood.pressure + age + rcs(cholesterol, 4))
##
## L.R. Chisq d.f. P
## 0.4679391 1.0000000 0.4939368
plot(Predict(fit3,blood.pressure,sex))
sex:age
and sex:cholesterol
interactions. The
confidence intervals also overlap for the most part, except in the
115-130 range.Using the data from the last few tutorials, fit the following model.
fit <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)), data = tutdata, x=TRUE, y=TRUE)
library(rms)
load("./CHL5202/tutdata.RData")
dd <- datadist(tutdata)
options(datadist = "dd")
fit <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol, 4)), data = tutdata,
x = TRUE, y = TRUE)
fit
## Logistic Regression Model
##
## lrm(formula = y ~ blood.pressure + sex * (age + rcs(cholesterol,
## 4)), data = tutdata, x = TRUE, y = TRUE)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 1000 LR chi2 69.39 R2 0.090 C 0.642
## 0 460 d.f. 10 R2(10,1000)0.058 Dxy 0.283
## 1 540 Pr(> chi2) <0.0001 R2(10,745.2)0.077 gamma 0.283
## max |deriv| 1e-08 Brier 0.232 tau-a 0.141
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept 3.2474 2.3426 1.39 0.1657
## blood.pressure -0.0022 0.0044 -0.51 0.6105
## sex=male -9.3612 3.5422 -2.64 0.0082
## age 0.0474 0.0094 5.06 <0.0001
## cholesterol -0.0308 0.0133 -2.31 0.0209
## cholesterol' 0.0738 0.0376 1.96 0.0499
## cholesterol'' -0.3072 0.1552 -1.98 0.0477
## sex=male * age -0.0263 0.0134 -1.96 0.0495
## sex=male * cholesterol 0.0620 0.0203 3.06 0.0022
## sex=male * cholesterol' -0.1241 0.0569 -2.18 0.0293
## sex=male * cholesterol'' 0.5116 0.2362 2.17 0.0303
Validate the model using the bootstrap with 100 replications and comment on the overfitting.
set.seed(123)
validate(fit, B = 100)
## index.orig training test optimism index.corrected n
## Dxy 0.2832 0.3025 0.2685 0.0340 0.2493 100
## R2 0.0896 0.1024 0.0792 0.0232 0.0664 100
## Intercept 0.0000 0.0000 0.0108 -0.0108 0.0108 100
## Slope 1.0000 1.0000 0.8768 0.1232 0.8768 100
## Emax 0.0000 0.0000 0.0311 0.0311 0.0311 100
## D 0.0684 0.0788 0.0601 0.0187 0.0496 100
## U -0.0020 -0.0020 0.0013 -0.0033 0.0013 100
## Q 0.0704 0.0808 0.0588 0.0220 0.0484 100
## B 0.2316 0.2289 0.2341 -0.0052 0.2367 100
## g 0.6230 0.6708 0.5787 0.0921 0.5309 100
## gp 0.1457 0.1541 0.1358 0.0183 0.1273 100
B
) do not change much at all.Investigate influential observations and partial residual plots.
dff <- resid(fit, "dffits")
plot(dff)
show.influence(which.influence(fit), data.frame(tutdata, dffits = dff), report = c("dffits"))
## Count sex cholesterol dffits
## 143 2 female *274.0808 1.6108318
## 173 4 *female *138.5056 -1.1265158
## 181 2 *male 146.4978 0.8026558
## 408 2 *male 148.1649 0.6813596
## 411 1 female 146.3522 0.5594519
## 834 2 female *146.6413 -0.4908278
## 840 2 female *147.0025 -0.5604495
which.influence()
gives the indices of
influential observations, which are listed in the first column of the
dataframe above.Below we investigate partial residual plots which can be used to verify the functional form of continuous covariates.
# Option below saves the following plots to a pdf file
# pdf("resid_plots.pdf")
resid(fit, "partial", pl = "loess")
Over the next few weeks the tutorials will use the data frame
mgus2
which may be found in the survival package. Since the
survival package is attached when rms
is, nothing special
is required to access it, other than loading the rms
package. Update: it needs to be loaded from the survival
package.
library(rms)
library(survival)
str(mgus2)
## 'data.frame': 1384 obs. of 11 variables:
## $ id : num 1 2 3 4 5 6 7 8 9 10 ...
## $ age : num 88 78 94 68 90 90 89 87 86 79 ...
## ..- attr(*, "label")= chr "AGE AT mgus_sp"
## $ sex : Factor w/ 2 levels "F","M": 1 1 2 2 1 2 1 1 1 1 ...
## ..- attr(*, "label")= chr "Sex"
## ..- attr(*, "format")= chr "$desex"
## $ dxyr : num 1981 1968 1980 1977 1973 ...
## $ hgb : num 13.1 11.5 10.5 15.2 10.7 12.9 10.5 12.3 14.5 9.4 ...
## $ creat : num 1.3 1.2 1.5 1.2 0.8 1 0.9 1.2 0.9 1.1 ...
## $ mspike: num 0.5 2 2.6 1.2 1 0.5 1.3 1.6 2.4 2.3 ...
## ..- attr(*, "label")= chr "SEP SPIKE AT mgus_sp"
## $ ptime : num 30 25 46 92 8 4 151 2 57 136 ...
## $ pstat : num 0 0 0 0 0 0 0 0 0 0 ...
## $ futime: num 30 25 46 92 8 4 151 2 57 136 ...
## $ death : num 1 1 1 1 1 1 1 1 0 1 ...
More information about the data can be found in the help file
(e.g. ?mgus2
).
Estimate the Kaplan-Meier curves for survival for males and females and plot the result.
# The rms package has a function npsurv which is used in place of the usual survfit.
fit <- npsurv(Surv(futime, death) ~ sex, data = mgus2)
# Two methods below:
plot(fit, lty = c(1, 2)) # standard method
survplot(fit, n.risk = TRUE) #a more flexible function from rms
Test \(H_0 : S_F(t) = S_M(t)\) versus \(H_0 : S_F(t) \neq S_M(t).\)
# This requires the Mantel-Haenzel (log-rank) test.
print(fit.mh <- survdiff(Surv(futime, death) ~ sex, data = mgus2))
## Call:
## survdiff(formula = Surv(futime, death) ~ sex, data = mgus2)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=F 631 423 471 4.88 9.67
## sex=M 753 540 492 4.67 9.67
##
## Chisq= 9.7 on 1 degrees of freedom, p= 0.002
Compute the risk difference at one year.
(summ <- summary(fit, times = 12)) # time is in months
## Call: npsurv(formula = Surv(futime, death) ~ sex, data = mgus2)
##
## sex=F
## time n.risk n.event survival std.err lower 95% CI
## 12.0000 569.0000 61.0000 0.9032 0.0118 0.8804
## upper 95% CI
## 0.9266
##
## sex=M
## time n.risk n.event survival std.err lower 95% CI
## 12.000 646.000 112.000 0.851 0.013 0.826
## upper 95% CI
## 0.877
\[ \begin{aligned} \widehat{RD} &= \hat{F}_C(t_0)-\hat{F}_T(t_0) \\ &= (1-\hat{S}_C(t_0))(1-\hat{S}_T(t_0)) \\ &= \hat{S}_T(t_0)-\hat{S}_C(t_0) \\ &= 0.851 - 0.9032 \\ &= -0.0522 \end{aligned} \]
(rd <- summ$surv[2] - summ$surv[1])
## [1] -0.05194881
fit$surv[length(fit$surv)-232+12] - fit$surv[12] # alternative
## [1] -0.05194881
Interpretation: The observed risk of death in females at one year is 5.2% lower than in males.
(Extra) A 95% confidence interval for the risk difference is:
\[ \begin{aligned} &\widehat{RD} \pm 1.96\sqrt{\hat{V}[\hat{S}_C(t_0)] + \hat{V}[\hat{S}_T(t_0)]} \\ &= -0.0522 \pm 1.96\sqrt{0.0118^2 + 0.013^2} \\ &= (-0.0866,-0.0178) \end{aligned} \]
bound <- 1.96*sqrt(summ$std.err[1]^2 + summ$std.err[2]^2)
c(rd-bound, rd+bound)
## [1] -0.08628336 -0.01761427
Compute a hazard ratio using the output from
survdiff
.
\[ \widehat{HR} = \frac{O_T/E_T}{O_C/E_C} = \frac{540/492}{423/471} = 1.222 \]
fit.mh
:# We can extract the numbers from fit.mh
fit.mh$obs
## [1] 423 540
fit.mh$exp
## [1] 470.9283 492.0717
# Observed / Expected
female <- fit.mh$obs[1]/fit.mh$exp[1] # 423/471
male <- fit.mh$obs[2]/fit.mh$exp[2] # 540/492
# Hazard Ratio Using females as reference
hr <- male/female # (540/492)/(423/471)
hr
## [1] 1.221743
The hazard ratio of 1.222 means that the rate of death for males is 1.222 times the rate of death for females.
(Extra) Obtaining a 95% CI for the hazard ratio, we start with the standard error:
\[ SE[\log\widehat{HR}] = \sqrt{\frac{1}{E_T} + \frac{1}{E_C}} = \sqrt{\frac{1}{492} + \frac{1}{471}} = 0.06446 \]
\[ 1.222 \times e^{\pm1.96\times0.06446} = (1.0769,1.3866) \]
# Standard error of log hazard ratio
se <- sqrt((1/fit.mh$exp[2]) + (1/fit.mh$exp[1]))
# 95% Confidence interval
c(1.222*exp(-1.96*se), 1.222*exp(+1.96*se))
## [1] 1.076956 1.386579
You will again use the data frame mgus2
from the
rms
package.
library(rms)
str(mgus2)
## 'data.frame': 1384 obs. of 11 variables:
## $ id : num 1 2 3 4 5 6 7 8 9 10 ...
## $ age : num 88 78 94 68 90 90 89 87 86 79 ...
## ..- attr(*, "label")= chr "AGE AT mgus_sp"
## $ sex : Factor w/ 2 levels "F","M": 1 1 2 2 1 2 1 1 1 1 ...
## ..- attr(*, "label")= chr "Sex"
## ..- attr(*, "format")= chr "$desex"
## $ dxyr : num 1981 1968 1980 1977 1973 ...
## $ hgb : num 13.1 11.5 10.5 15.2 10.7 12.9 10.5 12.3 14.5 9.4 ...
## $ creat : num 1.3 1.2 1.5 1.2 0.8 1 0.9 1.2 0.9 1.1 ...
## $ mspike: num 0.5 2 2.6 1.2 1 0.5 1.3 1.6 2.4 2.3 ...
## ..- attr(*, "label")= chr "SEP SPIKE AT mgus_sp"
## $ ptime : num 30 25 46 92 8 4 151 2 57 136 ...
## $ pstat : num 0 0 0 0 0 0 0 0 0 0 ...
## $ futime: num 30 25 46 92 8 4 151 2 57 136 ...
## $ death : num 1 1 1 1 1 1 1 1 0 1 ...
dd <- datadist(mgus2)
options(datadist = "dd")
Investigate if either the exponential or Weibull models would be appropriate to model the dependency of survival on sex.
fit.km <- npsurv(Surv(futime, death) ~ sex, data = mgus2)
# Check exponential
plot(fit.km, fun = "cumhaz")
# Check Weibull
plot(fit.km, fun = "cloglog")
Fit and interpret an exponential model and plot the result.
print(fit1 <- psm(Surv(futime,death)~sex,data=mgus2,dist="exponential"))
## Parametric Survival Model: Exponential Distribution
##
## psm(formula = Surv(futime, death) ~ sex, data = mgus2, dist = "exponential")
##
## Model Likelihood Discrimination
## Ratio Test Indexes
## Obs 1384 LR chi2 9.99 R2 0.007
## Events 963 d.f. 1 R2(1,1384)0.006
## sigma 1.0000 Pr(> chi2) 0.0016 R2(1,963)0.009
## Dxy 0.062
##
## Coef S.E. Wald Z Pr(>|Z|)
## (Intercept) 5.0344 0.0486 103.54 <0.0001
## sex=M -0.2045 0.0649 -3.15 0.0016
# HR for sex: M vs. F
exp(-coef(fit1)[2])
## sex=M
## 1.226934
# Survival time ratio
summary(fit1,sex="F")
## Effects Response : Surv(futime, death)
##
## Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
## sex - M:F 1 2 NA -0.20452 0.06493 -0.33189 -0.077146
## Survival Time Ratio 1 2 NA 0.81504 NA 0.71757 0.925750
survplot(fit1,sex)
survplot(fit.km,add=TRUE,label.curves=FALSE,conf="none",col="red")
Fit and interpret a Weibull model and plot the result.
print(fit2 <- psm(Surv(futime,death)~sex,data=mgus2,dist="weibull"))
## Parametric Survival Model: Weibull Distribution
##
## psm(formula = Surv(futime, death) ~ sex, data = mgus2, dist = "weibull")
##
## Model Likelihood Discrimination
## Ratio Test Indexes
## Obs 1384 LR chi2 9.18 R2 0.007
## Events 963 d.f. 1 R2(1,1384)0.006
## sigma 1.1016 Pr(> chi2) 0.0024 R2(1,963)0.008
## Dxy 0.062
##
## Coef S.E. Wald Z Pr(>|Z|)
## (Intercept) 5.0494 0.0538 93.87 <0.0001
## sex=M -0.2162 0.0716 -3.02 0.0025
## Log(scale) 0.0967 0.0278 3.48 0.0005
# HR for sex: M vs. F
exp(-coef(fit2)[2]/fit2$scale)
## sex=M
## 1.216865
# Survival time ratio
summary(fit2,sex="F")
## Effects Response : Surv(futime, death)
##
## Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
## sex - M:F 1 2 NA -0.21622 0.071609 -0.35669 -0.075742
## Survival Time Ratio 1 2 NA 0.80556 NA 0.69999 0.927060
survplot(fit2,sex)
survplot(fit.km,add=TRUE,label.curves=FALSE,conf="none",col="red")
Which looks better (exponential or Weibull)?
survplot(fit.km,conf="none")
survplot(fit1,sex,add=TRUE,label.curves=FALSE,col="red") # exponential
survplot(fit2,sex,add=TRUE,label.curves=FALSE,col="blue") # weibull
Fit a Cox model with sex as the covariate and interpret.
print(fit1.cox <- cph(Surv(futime,death)~sex,data=mgus2))
## Cox Proportional Hazards Model
##
## cph(formula = Surv(futime, death) ~ sex, data = mgus2)
##
## Model Tests Discrimination
## Indexes
## Obs 1384 LR chi2 9.68 R2 0.007
## Events 963 d.f. 1 R2(1,1384)0.006
## Center 0.1098 Pr(> chi2) 0.0019 R2(1,963)0.009
## Score chi2 9.65 Dxy 0.062
## Pr(> chi2) 0.0019
##
## Coef S.E. Wald Z Pr(>|Z|)
## sex=M 0.2017 0.0651 3.10 0.0019
summary(fit1.cox,sex="F")
## Effects Response : Surv(futime, death)
##
## Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
## sex - M:F 1 2 NA 0.20174 0.065062 0.074221 0.32926
## Hazard Ratio 1 2 NA 1.22350 NA 1.077000 1.38990
Add age using a restricted cubic spline with 4 knots. Test it with a LRT and display results.
print(fit2.cox <- cph(Surv(futime,death)~sex+rcs(age,4),data=mgus2))
## Cox Proportional Hazards Model
##
## cph(formula = Surv(futime, death) ~ sex + rcs(age, 4), data = mgus2)
##
## Model Tests Discrimination
## Indexes
## Obs 1384 LR chi2 396.10 R2 0.249
## Events 963 d.f. 4 R2(4,1384)0.247
## Center 3.248 Pr(> chi2) 0.0000 R2(4,963)0.334
## Score chi2 408.99 Dxy 0.334
## Pr(> chi2) 0.0000
##
## Coef S.E. Wald Z Pr(>|Z|)
## sex=M 0.3605 0.0657 5.48 <0.0001
## age 0.0380 0.0108 3.52 0.0004
## age' 0.0473 0.0214 2.21 0.0272
## age'' -0.2396 0.1233 -1.94 0.0520
# LRT
lrtest(fit1.cox,fit2.cox)
##
## Model 1: Surv(futime, death) ~ sex
## Model 2: Surv(futime, death) ~ sex + rcs(age, 4)
##
## L.R. Chisq d.f. P
## 386.42 3.00 0.00
# Adjusted effect for sex
summary(fit2.cox,sex="F")
## Effects Response : Surv(futime, death)
##
## Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
## age 63 79 16 1.13420 0.094988 0.94802 1.32040
## Hazard Ratio 63 79 16 3.10870 NA 2.58060 3.74480
## sex - M:F 1 2 NA 0.36049 0.065742 0.23164 0.48935
## Hazard Ratio 1 2 NA 1.43400 NA 1.26070 1.63120
sex
.age
.# You can specify the specific age groups
summary(fit2.cox,sex="F",age=c(40,80))
## Effects Response : Surv(futime, death)
##
## Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
## age 40 80 40 2.16870 0.194200 1.78810 2.54930
## Hazard Ratio 40 80 40 8.74700 NA 5.97800 12.79800
## sex - M:F 1 2 NA 0.36049 0.065742 0.23164 0.48935
## Hazard Ratio 1 2 NA 1.43400 NA 1.26070 1.63120
sex
remains
unchanged.# Plot of age effect
plot(Predict(fit2.cox,age,sex))
sex
effect.age
.You will again use the data frame mgus2
from the
rms
package.
library(rms)
str(mgus2)
## 'data.frame': 1384 obs. of 11 variables:
## $ id : num 1 2 3 4 5 6 7 8 9 10 ...
## $ age : num 88 78 94 68 90 90 89 87 86 79 ...
## ..- attr(*, "label")= chr "AGE AT mgus_sp"
## $ sex : Factor w/ 2 levels "F","M": 1 1 2 2 1 2 1 1 1 1 ...
## ..- attr(*, "label")= chr "Sex"
## ..- attr(*, "format")= chr "$desex"
## $ dxyr : num 1981 1968 1980 1977 1973 ...
## $ hgb : num 13.1 11.5 10.5 15.2 10.7 12.9 10.5 12.3 14.5 9.4 ...
## $ creat : num 1.3 1.2 1.5 1.2 0.8 1 0.9 1.2 0.9 1.1 ...
## $ mspike: num 0.5 2 2.6 1.2 1 0.5 1.3 1.6 2.4 2.3 ...
## ..- attr(*, "label")= chr "SEP SPIKE AT mgus_sp"
## $ ptime : num 30 25 46 92 8 4 151 2 57 136 ...
## $ pstat : num 0 0 0 0 0 0 0 0 0 0 ...
## $ futime: num 30 25 46 92 8 4 151 2 57 136 ...
## $ death : num 1 1 1 1 1 1 1 1 0 1 ...
dd <- datadist(mgus2)
options(datadist = "dd")
Fit a Cox model with sex
as the only covariate and test
the proportional hazards assumption.
fit1 <- cph(Surv(futime, death) ~ sex, data = mgus2, x = TRUE, y = TRUE)
print(fit1.zph <- cox.zph(fit1))
## chisq df p
## sex 2.47 1 0.12
## GLOBAL 2.47 1 0.12
# scaled Schoenfeld residuals
plot(fit1.zph)
abline(h=coef(fit1)[1], col="red", lwd=2, lty=3)
Estimate a suitable transformation of hgb
after
adjustment for sex
.
res <- resid(fit1)
res.lo <- loess(res ~ hgb, data = mgus2)
# restricted cubic spline with 5 knots
res.ols <- ols(res ~ rcs(hgb, 5), data = mgus2)
plot(Predict(res.ols, hgb), addpanel = function(...) {
panel.points(mgus2$hgb, res)
panel.lines(seq(5, 20, length.out = 25),
predict(res.lo, seq(5, 20, length.out = 25)),
col = "red")
}, ylim = 1.15 * range(res), ylab = "Martingale Residual", xlab = "Hgb")
Fit a Cox model that includes sex
, hgb
,
creat
and age.
Model the continuous variables
with 5 knot splines and include interactions between sex
and the other variables. Validate the model for overfitting.
fit2 <- cph(Surv(futime, death) ~ sex * (rcs(age, 5) + rcs(hgb, 5) + rcs(creat,5)),
data = mgus2, x = TRUE, y = TRUE)
anova(fit2)
## Wald Statistics Response: Surv(futime, death)
##
## Factor Chi-Square d.f. P
## sex (Factor+Higher Order Factors) 40.49 13 0.0001
## All Interactions 14.79 12 0.2529
## age (Factor+Higher Order Factors) 251.27 8 <.0001
## All Interactions 0.27 4 0.9919
## Nonlinear (Factor+Higher Order Factors) 7.46 6 0.2805
## hgb (Factor+Higher Order Factors) 56.74 8 <.0001
## All Interactions 0.36 4 0.9859
## Nonlinear (Factor+Higher Order Factors) 12.45 6 0.0527
## creat (Factor+Higher Order Factors) 44.76 8 <.0001
## All Interactions 13.57 4 0.0088
## Nonlinear (Factor+Higher Order Factors) 28.66 6 0.0001
## sex * age (Factor+Higher Order Factors) 0.27 4 0.9919
## Nonlinear 0.23 3 0.9718
## Nonlinear Interaction : f(A,B) vs. AB 0.23 3 0.9718
## sex * hgb (Factor+Higher Order Factors) 0.36 4 0.9859
## Nonlinear 0.08 3 0.9945
## Nonlinear Interaction : f(A,B) vs. AB 0.08 3 0.9945
## sex * creat (Factor+Higher Order Factors) 13.57 4 0.0088
## Nonlinear 11.36 3 0.0100
## Nonlinear Interaction : f(A,B) vs. AB 11.36 3 0.0100
## TOTAL NONLINEAR 48.33 18 0.0001
## TOTAL INTERACTION 14.79 12 0.2529
## TOTAL NONLINEAR + INTERACTION 56.21 21 <.0001
## TOTAL 468.25 25 <.0001
set.seed(1)
validate(fit2,B=100)
## Warning in fitter(..., strata = strata, rownames = rownames, offset = offset, :
## Loglik converged before variable 24 ; coefficient may be infinite.
## Warning in fitter(..., strata = strata, rownames = rownames, offset = offset, :
## Loglik converged before variable 15 ; coefficient may be infinite.
## Warning in fitter(..., strata = strata, rownames = rownames, offset = offset, :
## Loglik converged before variable 18 ; coefficient may be infinite.
## index.orig training test optimism index.corrected n
## Dxy 0.4023 0.4112 0.3951 0.0162 0.3862 100
## R2 0.3091 0.3220 0.2878 0.0342 0.2749 100
## Slope 1.0000 1.0000 0.8882 0.1118 0.8882 100
## D 0.0407 0.0428 0.0374 0.0054 0.0353 100
## U -0.0002 -0.0002 0.0021 -0.0023 0.0021 100
## Q 0.0408 0.0429 0.0352 0.0077 0.0331 100
## g 0.9390 0.9708 0.8629 0.1079 0.8312 100
index.corrected
values do not change much from
index.orig
.