Analysis of Swiss Dataset in R
Published:
Analysis of Swiss dataset in R (Swiss Fertility and Socioeconomic Indicators 1888)
First we clean up any variables that may be left in the existing R environment.
rm(list = ls())
Load data from Faraway.
library(faraway); require(graphics);
data(swiss)
?swiss
dim(swiss);
## [1] 47 6
head(swiss)
1. Initial data analysis that explores the numerical and graphical characteristics of the data
Numerical characteristics
Print out numerical summary of variables
summary(swiss)
## Fertility Agriculture Examination Education
## Min. :35.00 Min. : 1.20 Min. : 3.00 Min. : 1.00
## 1st Qu.:64.70 1st Qu.:35.90 1st Qu.:12.00 1st Qu.: 6.00
## Median :70.40 Median :54.10 Median :16.00 Median : 8.00
## Mean :70.14 Mean :50.66 Mean :16.49 Mean :10.98
## 3rd Qu.:78.45 3rd Qu.:67.65 3rd Qu.:22.00 3rd Qu.:12.00
## Max. :92.50 Max. :89.70 Max. :37.00 Max. :53.00
## Catholic Infant.Mortality
## Min. : 2.150 Min. :10.80
## 1st Qu.: 5.195 1st Qu.:18.15
## Median : 15.140 Median :20.00
## Mean : 41.144 Mean :19.94
## 3rd Qu.: 93.125 3rd Qu.:21.70
## Max. :100.000 Max. :26.60
cor(swiss)
## Fertility Agriculture Examination Education Catholic
## Fertility 1.0000000 0.35307918 -0.6458827 -0.66378886 0.4636847
## Agriculture 0.3530792 1.00000000 -0.6865422 -0.63952252 0.4010951
## Examination -0.6458827 -0.68654221 1.0000000 0.69841530 -0.5727418
## Education -0.6637889 -0.63952252 0.6984153 1.00000000 -0.1538589
## Catholic 0.4636847 0.40109505 -0.5727418 -0.15385892 1.0000000
## Infant.Mortality 0.4165560 -0.06085861 -0.1140216 -0.09932185 0.1754959
## Infant.Mortality
## Fertility 0.41655603
## Agriculture -0.06085861
## Examination -0.11402160
## Education -0.09932185
## Catholic 0.17549591
## Infant.Mortality 1.00000000
The numerical summary of the data shows that all the 6 variables are numerical with weak to moderate linear correlations among them.
Graphical characteristics
pairs(swiss, panel = panel.smooth, main = "swiss data", col = 3 + (swiss$Catholic > 50))
plot(density(swiss$Fertility),main="Fertility",xlab="Fertility")
rug(swiss$Fertility)
hist(swiss$Fertility,freq=F,add=T)
qqnorm(swiss$Fertility, ylab="Fertility")
qqline(swiss$Fertility)
It seems the distribution of Fertility is not too different from the normal except for small values of Fertility.
plot(swiss)
A matrix of scatter-plots for the 6 variables indicates * Fertility has positive correlation with Agriculture and Infant.Mortality; * * Fertility has negative correlation with Examination and Education; * Fertility hasa curvature correlation with Catholic.
plot(Fertility ~ Agriculture, swiss, xlab="", las=3)
# Interesting observation (higher degree of catholic comes with higher fertility )
plot(Fertility ~ Catholic, swiss, xlab="", las=3)
plot(Fertility ~ Education, swiss, xlab="", las=3)
plot(Fertility ~ Infant.Mortality, swiss, xlab="", las=3)
2. Variable selection to choose the best model
We start by fitting a linear regression model.
lmod <- lm(Fertility ~ ., swiss);
summary(lmod)
##
## Call:
## lm(formula = Fertility ~ ., data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2743 -5.2617 0.5032 4.1198 15.3213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.91518 10.70604 6.250 1.91e-07 ***
## Agriculture -0.17211 0.07030 -2.448 0.01873 *
## Examination -0.25801 0.25388 -1.016 0.31546
## Education -0.87094 0.18303 -4.758 2.43e-05 ***
## Catholic 0.10412 0.03526 2.953 0.00519 **
## Infant.Mortality 1.07705 0.38172 2.822 0.00734 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared: 0.7067, Adjusted R-squared: 0.671
## F-statistic: 19.76 on 5 and 41 DF, p-value: 5.594e-10
# Use drop1(lmod, test="F") alternatively
lmod_reduced = step(lmod)
## Start: AIC=190.69
## Fertility ~ Agriculture + Examination + Education + Catholic +
## Infant.Mortality
##
## Df Sum of Sq RSS AIC
## - Examination 1 53.03 2158.1 189.86
## <none> 2105.0 190.69
## - Agriculture 1 307.72 2412.8 195.10
## - Infant.Mortality 1 408.75 2513.8 197.03
## - Catholic 1 447.71 2552.8 197.75
## - Education 1 1162.56 3267.6 209.36
##
## Step: AIC=189.86
## Fertility ~ Agriculture + Education + Catholic + Infant.Mortality
##
## Df Sum of Sq RSS AIC
## <none> 2158.1 189.86
## - Agriculture 1 264.18 2422.2 193.29
## - Infant.Mortality 1 409.81 2567.9 196.03
## - Catholic 1 956.57 3114.6 205.10
## - Education 1 2249.97 4408.0 221.43
summary(lmod_reduced)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6765 -6.0522 0.7514 3.1664 16.1422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.10131 9.60489 6.466 8.49e-08 ***
## Agriculture -0.15462 0.06819 -2.267 0.02857 *
## Education -0.98026 0.14814 -6.617 5.14e-08 ***
## Catholic 0.12467 0.02889 4.315 9.50e-05 ***
## Infant.Mortality 1.07844 0.38187 2.824 0.00722 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared: 0.6993, Adjusted R-squared: 0.6707
## F-statistic: 24.42 on 4 and 42 DF, p-value: 1.717e-10
anova(lmod, lmod_reduced)
By both a t-test and an ANOVA F test we find Examination does not have significant effect on Fertility.
We then treat Fertility ∼ (Agriculture + Education + Catholic + Infant.Mortality)^2 as the full model, and use step() with BIC for selecting the best model.
# Interaction term doesn't seem to bring major improvements
lmodi = lm(Fertility ~ (Agriculture + Education + Catholic + Infant.Mortality)^2, data = swiss)
lmodi_reduced = step(lmodi, trace = FALSE, k = log(47))
summary(lmodi_reduced)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality + Education:Catholic, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.9060 -5.4997 0.9556 3.6698 13.8934
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.752308 9.919330 5.419 2.89e-06 ***
## Agriculture -0.134055 0.065843 -2.036 0.04825 *
## Education -0.515105 0.252478 -2.040 0.04781 *
## Catholic 0.207038 0.046184 4.483 5.81e-05 ***
## Infant.Mortality 1.239697 0.372195 3.331 0.00184 **
## Education:Catholic -0.011255 0.005058 -2.225 0.03161 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.853 on 41 degrees of freedom
## Multiple R-squared: 0.7318, Adjusted R-squared: 0.699
## F-statistic: 22.37 on 5 and 41 DF, p-value: 9.443e-11
The fitted best model is Fertility = 53.75 − 0.134Agriculture − 0.515Education + 0.207Catholic + 1.24Infant.Mortality − 0.011Education:Catholic
with R2 = 0.7318 and Ra2 = 0.699.
drop1(lmodi_reduced)
summary(lmodi_reduced)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality + Education:Catholic, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.9060 -5.4997 0.9556 3.6698 13.8934
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.752308 9.919330 5.419 2.89e-06 ***
## Agriculture -0.134055 0.065843 -2.036 0.04825 *
## Education -0.515105 0.252478 -2.040 0.04781 *
## Catholic 0.207038 0.046184 4.483 5.81e-05 ***
## Infant.Mortality 1.239697 0.372195 3.331 0.00184 **
## Education:Catholic -0.011255 0.005058 -2.225 0.03161 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.853 on 41 degrees of freedom
## Multiple R-squared: 0.7318, Adjusted R-squared: 0.699
## F-statistic: 22.37 on 5 and 41 DF, p-value: 9.443e-11
The terms in this model cannot be further reduced by the drop1() command.
par(mfrow=c(2,2)); termplot(lmodi_reduced,partial=T,terms=NULL); plot(lmodi_reduced)
## Warning in termplot(lmodi_reduced, partial = T, terms = NULL): 'model' appears
## to involve interactions: see the help page
3. Exploration of transformations to improve the fit of the model
The model does not seem to need a transformation on the response variable because the empirical distribution of Fertility is not far from the normal.
On the other hand, the relationship between Fertility and Catholoc seems to be curvature.
_Hence we investigate the transformation of the “Catholic” variable because it has a curvature effect on Fertility. We replace “Catholic” with a quadratic term
library(MASS)
# poly(3) constructs a transformation of poly w/ vec, vec^1, vec^2 such that correlation is minimized
# vec = c(1,2,3,4)
# poly(vec, 3, raw=TRUE)
Wlmodp<-lm(Fertility~Agriculture+Education+poly(Catholic,2)+Infant.Mortality + Education:poly(Catholic,2), swiss)
summary(Wlmodp)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + poly(Catholic,
## 2) + Infant.Mortality + Education:poly(Catholic, 2), data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5584 -5.0451 0.0393 3.5404 15.3300
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.37216 9.90815 6.093 3.84e-07 ***
## Agriculture -0.13234 0.07102 -1.863 0.069955 .
## Education -0.68355 0.26160 -2.613 0.012684 *
## poly(Catholic, 2)1 51.29861 14.04601 3.652 0.000763 ***
## poly(Catholic, 2)2 1.52390 12.92429 0.118 0.906744
## Infant.Mortality 1.21767 0.37666 3.233 0.002496 **
## Education:poly(Catholic, 2)1 -1.79870 1.73121 -1.039 0.305211
## Education:poly(Catholic, 2)2 0.76806 0.72902 1.054 0.298573
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.851 on 39 degrees of freedom
## Multiple R-squared: 0.745, Adjusted R-squared: 0.6992
## F-statistic: 16.28 on 7 and 39 DF, p-value: 8.356e-10
Wlmodp1<-lm(Fertility~Agriculture + Education + poly(Catholic,2) + Infant.Mortality + Education:Catholic, swiss)
summary(Wlmodp1)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + poly(Catholic,
## 2) + Infant.Mortality + Education:Catholic, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.8316 -5.2273 0.2632 4.0651 14.2838
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.826469 9.825002 6.293 1.83e-07 ***
## Agriculture -0.152171 0.068575 -2.219 0.032221 *
## Education -0.517682 0.252752 -2.048 0.047145 *
## poly(Catholic, 2)1 55.884416 13.372902 4.179 0.000155 ***
## poly(Catholic, 2)2 9.820777 10.261947 0.957 0.344311
## Infant.Mortality 1.269834 0.373906 3.396 0.001556 **
## Education:Catholic -0.009239 0.005484 -1.685 0.099837 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.86 on 40 degrees of freedom
## Multiple R-squared: 0.7378, Adjusted R-squared: 0.6984
## F-statistic: 18.75 on 6 and 40 DF, p-value: 3.078e-10
plot(Wlmodp1)
termplot(Wlmodp,partial=T,terms=3)
library(splines)
Wlmods<-lm(Fertility~Agriculture+Education+bs(Catholic,3)+Infant.Mortality + Education:bs(Catholic,3), swiss)
summary(Wlmods)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + bs(Catholic,
## 3) + Infant.Mortality + Education:bs(Catholic, 3), data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5473 -5.0681 0.5734 3.2353 15.5592
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.15343 11.66882 5.069 1.14e-05 ***
## Agriculture -0.12260 0.07291 -1.681 0.1011
## Education -0.55823 0.41581 -1.343 0.1876
## bs(Catholic, 3)1 -11.07029 18.70215 -0.592 0.5575
## bs(Catholic, 3)2 35.45274 28.03511 1.265 0.2139
## bs(Catholic, 3)3 13.63965 6.35112 2.148 0.0384 *
## Infant.Mortality 1.00679 0.43136 2.334 0.0251 *
## Education:bs(Catholic, 3)1 0.66876 1.50115 0.445 0.6586
## Education:bs(Catholic, 3)2 -2.42495 1.48402 -1.634 0.1107
## Education:bs(Catholic, 3)3 -0.25491 0.84904 -0.300 0.7657
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.931 on 37 degrees of freedom
## Multiple R-squared: 0.7524, Adjusted R-squared: 0.6921
## F-statistic: 12.49 on 9 and 37 DF, p-value: 8.342e-09
Wlmods1<-lm(Fertility~Agriculture+Education+bs(Catholic,3)+Infant.Mortality + Education:Catholic, swiss)
summary(Wlmods1)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + bs(Catholic,
## 3) + Infant.Mortality + Education:Catholic, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.377 -5.072 0.321 4.014 14.446
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.455117 10.117571 5.481 2.72e-06 ***
## Agriculture -0.148741 0.070264 -2.117 0.040702 *
## Education -0.479725 0.284130 -1.688 0.099316 .
## bs(Catholic, 3)1 -3.217588 11.717565 -0.275 0.785077
## bs(Catholic, 3)2 11.056766 19.100029 0.579 0.565994
## bs(Catholic, 3)3 19.202744 4.706975 4.080 0.000216 ***
## Infant.Mortality 1.259959 0.379587 3.319 0.001964 **
## Education:Catholic -0.010382 0.006687 -1.553 0.128614
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.939 on 39 degrees of freedom
## Multiple R-squared: 0.7384, Adjusted R-squared: 0.6914
## F-statistic: 15.72 on 7 and 39 DF, p-value: 1.35e-09
plot(Wlmods1)
termplot(Wlmods,partial=T,terms=NULL)
## Warning in termplot(Wlmods, partial = T, terms = NULL): 'model' appears to
## involve interactions: see the help page
Thus, we replace Catholic by poly(Catholic, 2) or bs(Catholic, 3) in the model lmodi_reduced to see whether any improvement of fit can be made.
It does not seem to achieve any significant improvement by doing this. Here the order 2 in poly() and df 3 in bs() are selected using a try-and-error approach
4. Diagnostics to check the assumptions of the model.
plot(lmodi_reduced)
The diagnostics plots given by plot(smallm) show that the model provides a good fit to the data in general.
The residuals vs. leverage plot identifies 4 provinces that have the largest Cook distance values and are influential to model fitting. These 4 provinces are Porrentruy, Sierre, Sion, and Rive Gauche.
We try to filter out the outliers by getting the indizes first:
(1:47)[rownames(swiss)=="Sion"] #38
## [1] 38
(1:47)[rownames(swiss)=="Sierre"] #37
## [1] 37
(1:47)[rownames(swiss)=="Porrentruy"] #6
## [1] 6
(1:47)[rownames(swiss)=="Rive Gauche"] #47
## [1] 47
swiss[c(6,37,38,47),]
rownames(swiss)[c(6,37,38,47)]
## [1] "Porrentruy" "Sierre" "Sion" "Rive Gauche"
hatvalues(lmodi_reduced)
## Courtelary Delemont Franches-Mnt Moutier Neuveville Porrentruy
## 0.10260793 0.07580639 0.13623845 0.05973401 0.07103360 0.18823843
## Broye Glane Gruyere Sarine Veveyse Aigle
## 0.08842886 0.11689602 0.07577713 0.14642334 0.10934403 0.07917772
## Aubonne Avenches Cossonay Echallens Grandson Lausanne
## 0.07598143 0.12553452 0.08628630 0.07370320 0.06376129 0.22670153
## La Vallee Lavaux Morges Moudon Nyone Orbe
## 0.34956953 0.10836510 0.05764165 0.09609790 0.05289707 0.11454455
## Oron Payerne Paysd'enhaut Rolle Vevey Yverdon
## 0.13083077 0.11033637 0.10103769 0.07368892 0.06659714 0.06513090
## Conthey Entremont Herens Martigwy Monthey St Maurice
## 0.19704637 0.09502856 0.15680036 0.07880332 0.10133459 0.10272389
## Sierre Sion Boudry La Chauxdfnd Le Locle Neuchatel
## 0.15149713 0.15886464 0.04426588 0.15781432 0.10040794 0.31804461
## Val de Ruz ValdeTravers V. De Geneve Rive Droite Rive Gauche
## 0.06265152 0.14516128 0.48313011 0.18286595 0.23514778
influence.measures(lmodi_reduced)
## Influence measures of
## lm(formula = Fertility ~ Agriculture + Education + Catholic + Infant.Mortality + Education:Catholic, data = swiss) :
##
## dfb.1_ dfb.Agrc dfb.Edct dfb.Cthl dfb.In.M dfb.Ed.C dffit
## Courtelary 0.04432 -0.23133 -0.04702 0.02321 0.08976 -0.045367 0.3477
## Delemont 0.02296 -0.07573 -0.06375 0.02187 0.02022 0.045778 0.1524
## Franches-Mnt 0.21223 -0.34081 -0.13079 0.24396 -0.10810 -0.040840 0.4714
## Moutier 0.23577 -0.31232 -0.24858 -0.04524 -0.06493 0.109009 0.4287
## Neuveville -0.19949 0.14133 0.30260 0.07440 0.17956 -0.277192 0.4697
## Porrentruy 0.28682 0.51113 0.17016 -0.33337 -0.65439 0.042706 -1.1444
## Broye -0.05121 0.02808 0.01518 0.02464 0.05232 -0.009638 0.0894
## Glane -0.33831 0.14506 0.05770 0.09969 0.37261 -0.003844 0.5522
## Gruyere 0.00769 -0.01512 -0.00899 0.01531 -0.00207 0.002235 0.0360
## Sarine -0.04332 -0.08637 -0.17368 -0.08170 0.14313 0.224415 0.3991
## Veveyse -0.02780 0.00534 0.00927 0.02191 0.03192 -0.011992 0.0512
## Aigle 0.02555 0.08203 0.04882 -0.02599 -0.06348 -0.027970 0.1668
## Aubonne -0.00590 0.04462 -0.00259 -0.04341 0.00245 0.011573 0.0789
## Avenches -0.04235 0.03756 0.03065 -0.00626 0.03866 -0.020754 0.0588
## Cossonay -0.01869 -0.07902 0.04641 0.11646 0.01740 -0.059470 -0.1765
## Echallens 0.04652 -0.10426 0.05875 0.11913 -0.06872 -0.059214 -0.2316
## Grandson 0.02560 -0.03053 -0.02578 -0.01816 -0.00468 0.010647 0.0563
## Lausanne 0.15517 -0.03529 -0.35875 -0.22776 -0.10298 0.323257 -0.4258
## La Vallee -0.04166 0.02470 -0.00858 -0.01925 0.04586 0.018153 -0.0606
## Lavaux -0.01514 0.02795 0.01095 -0.01275 0.00996 -0.003650 0.0359
## Morges 0.00589 0.03622 0.01248 -0.02478 -0.01456 -0.005858 0.0774
## Moudon 0.02621 -0.01703 0.20818 0.27871 -0.15585 -0.177715 -0.4225
## Nyone -0.10466 -0.01819 -0.01116 0.02788 0.12435 0.010459 -0.2100
## Orbe -0.24083 0.04520 0.14986 0.13634 0.22332 -0.114807 -0.3093
## Oron -0.00167 0.04374 -0.06701 -0.10697 0.02335 0.071831 0.1384
## Payerne -0.07553 0.05346 0.01904 -0.03989 0.08947 -0.008693 0.1207
## Paysd'enhaut 0.12733 0.02770 -0.16655 -0.20464 -0.09151 0.154218 0.2853
## Rolle -0.01353 -0.01267 -0.00167 0.01062 0.01910 -0.000469 -0.0375
## Vevey 0.06764 0.05712 -0.16425 -0.10122 -0.09586 0.160539 -0.3186
## Yverdon 0.09036 -0.04223 0.00781 0.09329 -0.14709 -0.001803 -0.2467
## Conthey -0.06158 -0.03564 -0.05678 -0.13143 0.11872 0.091731 -0.2257
## Entremont 0.10322 -0.21873 -0.11342 -0.15669 -0.00473 0.077303 -0.4227
## Herens 0.04272 -0.11975 -0.13976 -0.19995 0.03890 0.172081 -0.3090
## Martigwy 0.02012 -0.10830 -0.05430 -0.14520 0.04214 0.050813 -0.3363
## Monthey -0.00772 0.04713 -0.06527 -0.24601 0.01466 0.158212 -0.3184
## St Maurice -0.09479 -0.07063 0.06414 -0.01520 0.14726 -0.111047 -0.3379
## Sierre 0.19380 0.20869 0.24983 0.55966 -0.44117 -0.370574 0.9782
## Sion 0.37408 -0.03665 -0.44940 -0.24868 -0.37496 0.625641 0.8890
## Boudry -0.00192 -0.00941 0.01487 -0.00289 0.01547 -0.023829 0.0765
## La Chauxdfnd -0.28260 0.48690 0.21199 -0.02541 0.06726 -0.030768 -0.5489
## Le Locle 0.11188 -0.15224 -0.04212 0.02140 -0.05499 -0.014786 0.1923
## Neuchatel -0.25321 0.08241 0.36291 0.19664 0.20863 -0.293513 0.4274
## Val de Ruz 0.12806 -0.13513 -0.14788 -0.11270 -0.02544 0.077760 0.2757
## ValdeTravers -0.23155 0.28786 0.20848 0.06018 0.09330 -0.091000 -0.3449
## V. De Geneve -0.01186 0.03937 0.13125 -0.07455 -0.05176 0.134480 0.5617
## Rive Droite -0.00920 -0.10566 -0.00117 0.13190 0.05586 -0.167981 -0.3347
## Rive Gauche -0.22843 0.11133 0.31880 0.40254 0.16802 -0.638422 -0.8724
## cov.r cook.d hat inf
## Courtelary 1.105 0.020115 0.1026
## Delemont 1.203 0.003941 0.0758
## Franches-Mnt 1.091 0.036669 0.1362
## Moutier 0.811 0.029276 0.0597
## Neuveville 0.822 0.035158 0.0710
## Porrentruy 0.647 0.196042 0.1882
## Broye 1.257 0.001362 0.0884
## Glane 0.939 0.049247 0.1169
## Gruyere 1.252 0.000221 0.0758
## Sarine 1.184 0.026593 0.1464
## Veveyse 1.298 0.000447 0.1093
## Aigle 1.200 0.004715 0.0792
## Aubonne 1.241 0.001061 0.0760
## Avenches 1.321 0.000590 0.1255
## Cossonay 1.208 0.005280 0.0863
## Echallens 1.133 0.009011 0.0737
## Grandson 1.230 0.000540 0.0638
## Lausanne 1.368 0.030501 0.2267
## La Vallee 1.781 0.000627 0.3496 *
## Lavaux 1.299 0.000221 0.1084
## Morges 1.213 0.001022 0.0576
## Moudon 1.003 0.029263 0.0961
## Nyone 1.089 0.007385 0.0529
## Orbe 1.173 0.016043 0.1145
## Oron 1.309 0.003262 0.1308
## Payerne 1.281 0.002483 0.1103
## Paysd'enhaut 1.158 0.013654 0.1010
## Rolle 1.249 0.000240 0.0737
## Vevey 1.007 0.016744 0.0666
## Yverdon 1.090 0.010175 0.0651
## Conthey 1.400 0.008659 0.1970
## Entremont 0.998 0.029280 0.0950
## Herens 1.274 0.016108 0.1568
## Martigwy 1.036 0.018707 0.0788
## Monthey 1.129 0.016938 0.1013
## St Maurice 1.115 0.019030 0.1027
## Sierre 0.643 0.144148 0.1515
## Sion 0.759 0.122232 0.1589
## Boudry 1.191 0.000996 0.0443
## La Chauxdfnd 1.087 0.049474 0.1578
## Le Locle 1.227 0.006262 0.1004
## Neuchatel 1.604 0.030904 0.3180 *
## Val de Ruz 1.046 0.012622 0.0627
## ValdeTravers 1.222 0.019977 0.1452
## V. De Geneve 2.133 0.053448 0.4831 *
## Rive Droite 1.317 0.018903 0.1829
## Rive Gauche 1.058 0.122438 0.2351
cooks.distance(lmodi_reduced)
## Courtelary Delemont Franches-Mnt Moutier Neuveville Porrentruy
## 0.0201154852 0.0039410826 0.0366691513 0.0292755138 0.0351584235 0.1960420472
## Broye Glane Gruyere Sarine Veveyse Aigle
## 0.0013622033 0.0492469430 0.0002214636 0.0265934794 0.0004467372 0.0047153887
## Aubonne Avenches Cossonay Echallens Grandson Lausanne
## 0.0010608615 0.0005897404 0.0052804731 0.0090110500 0.0005400547 0.0305009295
## La Vallee Lavaux Morges Moudon Nyone Orbe
## 0.0006271319 0.0002206554 0.0010221776 0.0292633997 0.0073849102 0.0160432221
## Oron Payerne Paysd'enhaut Rolle Vevey Yverdon
## 0.0032624985 0.0024832884 0.0136542028 0.0002399045 0.0167444949 0.0101748505
## Conthey Entremont Herens Martigwy Monthey St Maurice
## 0.0086585000 0.0292802881 0.0161075451 0.0187071209 0.0169381979 0.0190298484
## Sierre Sion Boudry La Chauxdfnd Le Locle Neuchatel
## 0.1441475824 0.1222319428 0.0009958496 0.0494736931 0.0062623898 0.0309042646
## Val de Ruz ValdeTravers V. De Geneve Rive Droite Rive Gauche
## 0.0126224991 0.0199766071 0.0534481565 0.0189033468 0.1224375602
sort(cooks.distance(lmodi_reduced))
## Lavaux Gruyere Rolle Veveyse Grandson Avenches
## 0.0002206554 0.0002214636 0.0002399045 0.0004467372 0.0005400547 0.0005897404
## La Vallee Boudry Morges Aubonne Broye Payerne
## 0.0006271319 0.0009958496 0.0010221776 0.0010608615 0.0013622033 0.0024832884
## Oron Delemont Aigle Cossonay Le Locle Nyone
## 0.0032624985 0.0039410826 0.0047153887 0.0052804731 0.0062623898 0.0073849102
## Conthey Echallens Yverdon Val de Ruz Paysd'enhaut Orbe
## 0.0086585000 0.0090110500 0.0101748505 0.0126224991 0.0136542028 0.0160432221
## Herens Vevey Monthey Martigwy Rive Droite St Maurice
## 0.0161075451 0.0167444949 0.0169381979 0.0187071209 0.0189033468 0.0190298484
## ValdeTravers Courtelary Sarine Moudon Moutier Entremont
## 0.0199766071 0.0201154852 0.0265934794 0.0292633997 0.0292755138 0.0292802881
## Lausanne Neuchatel Neuveville Franches-Mnt Glane La Chauxdfnd
## 0.0305009295 0.0309042646 0.0351584235 0.0366691513 0.0492469430 0.0494736931
## V. De Geneve Sion Rive Gauche Sierre Porrentruy
## 0.0534481565 0.1222319428 0.1224375602 0.1441475824 0.1960420472
We observe that these four provinces have the most extreme residuals in regard to model lmodi_reduced, but do not have large leverage values. The predictors values of these 4 provinces are mostly unusual in comparison with those of other provinces.
5. Some predictions of future observations for interesting values of the predictors
While there should be many interesting values of the predictors, we chose to predict the Fertility value at the mean values of the predictors.
First we costruct a dataframe to perform prediction on.
predictor_df <- data.frame(Agriculture=mean(swiss$Agriculture), Examination=mean(swiss$Examination), Education=mean(swiss$Education), Catholic=mean(swiss$Catholic), Infant.Mortality=mean(swiss$Infant.Mortality))
pp <- predict(lmodi_reduced,new=predictor_df, se.fit=T);
pp
## $fit
## 1
## 69.46289
##
## $se.fit
## [1] 1.045219
##
## $df
## [1] 41
##
## $residual.scale
## [1] 6.852949
We can also construct a data frame with purely random data.
test = data.frame(Agriculture=80, Education=6, Catholic=12, Infant.Mortality=20)
predict(lmodi_reduced, test)
## 1
## 66.40531
The predicted value of Fertility equals 69.46289 with standard error 1.045219.
predict(lmodi_reduced)
## Courtelary Delemont Franches-Mnt Moutier Neuveville Porrentruy
## 73.53024 79.56271 84.97775 74.75043 65.92926 90.00600
## Broye Glane Gruyere Sarine Veveyse Aigle
## 81.90081 82.77823 81.56129 76.79370 86.14437 60.32787
## Aubonne Avenches Cossonay Echallens Grandson Lausanne
## 65.06722 67.89389 65.49427 73.73686 70.25363 60.46134
## La Vallee Lavaux Morges Moudon Nyone Orbe
## 54.76245 64.42461 63.39349 73.37298 62.54028 62.96261
## Oron Payerne Paysd'enhaut Rolle Vevey Yverdon
## 70.19603 71.95956 66.45264 61.38721 66.15653 71.60261
## Conthey Entremont Herens Martigwy Monthey St Maurice
## 78.32541 77.73235 81.83654 78.03419 85.56745 71.48278
## Sierre Sion Boudry La Chauxdfnd Le Locle Neuchatel
## 78.30662 66.91495 67.99368 73.61544 68.92875 60.83159
## Val de Ruz ValdeTravers V. De Geneve Rive Droite Rive Gauche
## 70.53746 72.92320 32.11418 49.11011 52.06441
We get predictions where the difference of true vs. predicted values is greater then 5%:
# Get data where predictions was wrong by > 5%
wrong_preds = swiss[c((1:47)[abs(swiss$Fertility - predict(lmodi_reduced)) > 5]),]
# Show wrong predictions by order of magnitude
sort(abs(wrong_preds$Fertility - predict(lmodi_reduced, wrong_preds)), decreasing = TRUE)
## Porrentruy Sierre Sion Moutier Neuveville Glane
## 13.905996 13.893381 12.385050 11.049569 10.970740 9.621765
## Rive Gauche Entremont Moudon La Chauxdfnd Vevey Martigwy
## 9.264406 8.432350 8.372979 7.915438 7.856525 7.534193
## Franches-Mnt Val de Ruz Courtelary St Maurice Yverdon Monthey
## 7.522247 7.062543 6.669760 6.482779 6.202610 6.167454
## Sarine Nyone Orbe Paysd'enhaut Echallens ValdeTravers
## 6.106300 5.940283 5.562608 5.547357 5.436856 5.323200
6. Interpretation of the meaning of the model with respect to the particular area of application
summary(lmodi_reduced)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality + Education:Catholic, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.9060 -5.4997 0.9556 3.6698 13.8934
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.752308 9.919330 5.419 2.89e-06 ***
## Agriculture -0.134055 0.065843 -2.036 0.04825 *
## Education -0.515105 0.252478 -2.040 0.04781 *
## Catholic 0.207038 0.046184 4.483 5.81e-05 ***
## Infant.Mortality 1.239697 0.372195 3.331 0.00184 **
## Education:Catholic -0.011255 0.005058 -2.225 0.03161 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.853 on 41 degrees of freedom
## Multiple R-squared: 0.7318, Adjusted R-squared: 0.699
## F-statistic: 22.37 on 5 and 41 DF, p-value: 9.443e-11
The selected model lmodi_reduced suggests that all predictors except “Examination” are significantly related to “Fertility” with the directions of the relations been given in the summary(lmodi_reduced)
output. In addition, “Education” and “Catholic” have significant interaction effect on Fertility.