Analysis of Pima Dataset in R

11 minute read

Published:

In this blog post I would like to provide an anylsis of the Pima dataset that is available in the faraway R-package. The exercises below are part of the course MAST90139: Statistical Modelling for Data Science at the University of Melbourne. Analysis of Pima dataset in R (Diabetes survey on Pima Indians)

First we clean up any variables that we may have left in the environment

rm(list = ls())

Load data and perform EDA

library(faraway)
library(ggplot2)
data(pima)
head(pima)
help(pima)
dim(pima)
## [1] 768   9

Exercise 1:

Create a factor version of the test results and use this to produce an interleaved histogram to show how the distribution of insulin differs between those testing positive and negative. Do you notice anything unbelievable about the plot?

pima$test = as.factor(pima$test)
levels(pima$test) <- c("negative","positive"); pima[1,]
par(mfrow=c(1,2)); plot(insulin ~ test, pima)

ggplot(pima, aes(x=insulin, color=test)) + geom_histogram(position="dodge", binwidth=30)

library(ggplot2)
ggplot(pima, aes(x = insulin, color = test)) + geom_histogram(position="dodge",
                       binwidth=30, aes(y=..density..))

summary(pima$test[pima$insulin==0])
## negative positive 
##      236      138

High values of insulin seem to correlate with signs of diabetes!

Exercise 2:

Replace the zero values of insulin with the missing value code NA. Recreate the interleaved histogram plot and comment on the distribution.

pima$insulin[pima$insulin == 0] <- NA
ggplot(pima, aes(x = insulin, color = test)) + geom_histogram(position="dodge",
                       binwidth=30, aes(y=..density..))
## Warning: Removed 374 rows containing non-finite values (stat_bin).

After replacing the zero values with NA, the relationship becomes even more clearer!

Exercise 3:

Replace the incredible zeroes in other variables with the missing value code. Fit a model with the result of the diabetes test as the response and all the other variables as predictors. How many observations were used in the model fitting? Why is this less than the number of observations in the data frame?

pima[pima == 0] <- NA
# Fit logistic regression model from binomial family
model1 <- glm(test ~ pregnant + glucose + diastolic + triceps + insulin + bmi + diabetes + age,family =  binomial, pima)
summary(model1)
## 
## Call:
## glm(formula = test ~ pregnant + glucose + diastolic + triceps + 
##     insulin + bmi + diabetes + age, family = binomial, data = pima)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8619  -0.6557  -0.3295   0.6158   2.6339  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.083e+01  1.423e+00  -7.610 2.73e-14 ***
## pregnant     7.364e-02  5.973e-02   1.233   0.2176    
## glucose      3.616e-02  6.249e-03   5.785 7.23e-09 ***
## diastolic    5.993e-03  1.320e-02   0.454   0.6497    
## triceps      1.110e-02  1.869e-02   0.594   0.5527    
## insulin      3.231e-05  1.445e-03   0.022   0.9822    
## bmi          7.615e-02  3.174e-02   2.399   0.0164 *  
## diabetes     1.097e+00  4.777e-01   2.297   0.0216 *  
## age          4.075e-02  1.919e-02   2.123   0.0337 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 426.34  on 335  degrees of freedom
## Residual deviance: 288.92  on 327  degrees of freedom
##   (432 observations deleted due to missingness)
## AIC: 306.92
## 
## Number of Fisher Scoring iterations: 5

In the pima dataframe, there are 768 observations, but only 327+9=336 observations were used to fit the model. 432 observations were deleted due to missingness.

plot(model1)

summary(pima$insulin)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   14.00   76.25  125.00  155.55  190.00  846.00     374
summary(pima$insulin[pima$test=="negative"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    15.0    66.0   102.5   130.3   161.2   744.0     236
summary(pima$insulin[pima$test=="positive"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    14.0   127.5   169.5   206.8   239.2   846.0     138

Exercise 4:

Refit the model but now without the insulin and triceps predictors. How many observations were used in fitting this model? Devise a test to compare this model with that in the previous question.

# Fit logistic regression model from binomial family
model2 <- glm(test ~ pregnant + glucose + diastolic  + bmi + diabetes + age,family =  binomial, pima)
summary(model2)
## 
## Call:
## glm(formula = test ~ pregnant + glucose + diastolic + bmi + diabetes + 
##     age, family = binomial, data = pima)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8459  -0.7067  -0.3827   0.7018   2.4302  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.354750   0.915697 -10.216  < 2e-16 ***
## pregnant     0.130695   0.037880   3.450  0.00056 ***
## glucose      0.035337   0.003900   9.061  < 2e-16 ***
## diastolic   -0.008673   0.009422  -0.920  0.35734    
## bmi          0.098547   0.017768   5.546 2.92e-08 ***
## diabetes     1.020669   0.336136   3.036  0.00239 ** 
## age          0.016642   0.010553   1.577  0.11478    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 807.12  on 624  degrees of freedom
## Residual deviance: 577.80  on 618  degrees of freedom
##   (143 observations deleted due to missingness)
## AIC: 591.8
## 
## Number of Fisher Scoring iterations: 5

We can’t use ANOVA to perform model test because amount of data has been used during model fitting.

Not including insulin and triceps into the model, the model is fitted using 625 observations. So it can not be compared with the model in 3. because the number of observations used in 3. is 336. These two models can only be compared of each other based on the same data.

We make this possible by using data pimaN which removes all cases containing NAs. The results can be seen in comparing lmodNA1 with lmodNA2 in R, with p-value of 0.8386. Thus there is no significant difference between the two models in terms of adequacy of fit.

pimaN <- na.omit(pima)
lmodNA1 <- glm(test ~ pregnant+glucose+diastolic+triceps+insulin+bmi+diabetes+age, family = binomial, pimaN) 
lmodNA2 <- glm(test ~ pregnant+glucose+diastolic+bmi+diabetes+age, family = binomial, pimaN)
anova(lmodNA2, lmodNA1, test="Chi")

Exercise 5:

Use AIC to select a model. You will need to take account of the missing values. Which predictors are selected? How many cases are used in your selected model?

lmodNAr <- step(lmodNA1, trace=0)
summary(lmodNAr)
## 
## Call:
## glm(formula = test ~ glucose + bmi + diabetes + age, family = binomial, 
##     data = pimaN)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8112  -0.6673  -0.3433   0.6128   2.6207  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.810466   1.253806  -8.622  < 2e-16 ***
## glucose       0.036394   0.005495   6.624 3.51e-11 ***
## bmi           0.089165   0.024301   3.669 0.000243 ***
## diabetes      1.055880   0.465979   2.266 0.023455 *  
## age           0.059405   0.014515   4.093 4.26e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 426.34  on 335  degrees of freedom
## Residual deviance: 291.12  on 331  degrees of freedom
## AIC: 301.12
## 
## Number of Fisher Scoring iterations: 5

Exercise 6:

*Create a variable that indicates whether the case contains a missing value. Use this variable as a predictor of the test result. Is missingness associated with the test result? Refit the selected model, but now using as much of the data as reasonable. Explain why it is appropriate to do this.**

pima$misIndicator<-apply(pima,1, anyNA); xtabs(~test + misIndicator, pima)
##           misIndicator
## test       FALSE TRUE
##   negative   225  275
##   positive   111  157
summary(glm(test~misIndicator, family=binomial, pima))$coef
##                    Estimate Std. Error    z value     Pr(>|z|)
## (Intercept)      -0.7065702  0.1159890 -6.0917001 1.117178e-09
## misIndicatorTRUE  0.1460449  0.1531641  0.9535193 3.403270e-01
anova(glm(test ~ misIndicator, family=binomial, pima), test="Chi")
chisq.test(pima$test, pima$misIndicator, correct=F)
## 
##  Pearson's Chi-squared test
## 
## data:  pima$test and pima$misIndicator
## X-squared = 0.90974, df = 1, p-value = 0.3402
lmodNArs <- glm(test ~ glucose + bmi + diabetes + age, family = binomial, data = pima)
summary(lmodNArs)
## 
## Call:
## glm(formula = test ~ glucose + bmi + diabetes + age, family = binomial, 
##     data = pima)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7389  -0.7362  -0.4103   0.7239   2.4344  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.302177   0.728380 -12.771  < 2e-16 ***
## glucose      0.035281   0.003517  10.030  < 2e-16 ***
## bmi          0.086372   0.014448   5.978 2.25e-09 ***
## diabetes     0.866221   0.298356   2.903 0.003692 ** 
## age          0.028764   0.007852   3.663 0.000249 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 974.75  on 751  degrees of freedom
## Residual deviance: 716.30  on 747  degrees of freedom
##   (16 observations deleted due to missingness)
## AIC: 726.3
## 
## Number of Fisher Scoring iterations: 5

Exercise 7:

Using the last fitted model of the previous question, what is the difference in the log-odds of testing positive for diabetes for a woman with a BMI at the first quartile compared with a woman at the third quartile, assuming that all other factors are held constant? Then calculate the associated odds ratio value, and give a 95% confidence interval for this odds ratio.

summary(pima$bmi)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   18.20   27.50   32.30   32.46   36.60   67.10      11
# Log-odds difference
diff = 0.086372 * (36.60 - 27.50)

# Estimated log-odds ratio
exp_diff = exp(0.086372 * (36.60 - 27.50))

#95% conf int for odds ratio
conf_int_odds <- cbind(diff - 1.96*0.014448*(36.6 - 27.50), diff + 1.96*0.014448*(36.6 - 27.50))

#95% conf int for estimated odds ratio
(conf_int_exp <- cbind(exp(conf_int_odds[1]), exp(conf_int_odds[2])))
##          [,1]     [,2]
## [1,] 1.696031 2.839647

Exercise 8:

Do women who test positive have higher diastolic blood pressures? Is the diastolic blood pressure significant in the logistic regression model? Explain the distinction between the two questions and discuss why the answers are only apparently contradictory.

Diastolic values tend to be higher for those positives. But the interleaved histograms of the diastolic between those testing positive and negative do not seem to be significantly different. However, both the two-sample t test and the Wilcoxon rank- sum test suggest the positive cases have significantly higher diastolic blood pressures (with p-values of 0.03576 and 3.779 × 10−5 respectively).

On the other hand, diastolicN is not found to be significant to the odds of positive test vs. negative test based on the aforementioned logistic models. The means a given difference between the diastolic pressures of two women does not lead to a significant value of odds ratio of positive test vs. negative test between the two women. Therefore, although the two answers appear to be contradictory, they are actually not.

summary(pima$diastolic[pima$test=="negative"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   24.00   62.00   70.00   70.88   78.00  122.00      19
summary(pima$diastolic[pima$test=="positive"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   30.00   68.00   74.50   75.32   84.00  114.00      16
t.test(diastolic~test, alternative="less",data=pima, var.equal=T)
## 
##  Two Sample t-test
## 
## data:  diastolic by test
## t = -4.6808, df = 731, p-value = 1.703e-06
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -2.880447
## sample estimates:
## mean in group negative mean in group positive 
##               70.87734               75.32143
wilcox.test(diastolic~test, alternative="less",data=pima)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  diastolic by test
## W = 47566, p-value = 8.143e-07
## alternative hypothesis: true location shift is less than 0
ggplot(pima, aes(x=diastolic, color=test)) + geom_histogram(position="dodge", binwidth=10)
## Warning: Removed 35 rows containing non-finite values (stat_bin).

ggplot(pima, aes(x=diastolic, color=test)) + geom_histogram(position="dodge", binwidth=10, aes(y=..density..))
## Warning: Removed 35 rows containing non-finite values (stat_bin).