Analysis of Swiss Dataset in R

23 minute read

Published:

In this blog post I would like to provide an anylsis of the Swiss dataset which can be accessed in R. The exercises below are part of the course MAST90139: Statistical Modelling for Data Science at the University of Melbourne. Analysis of Swiss dataset in R (Swiss Fertility and Socioeconomic Indicators 1888)

First we clean up any variables that may be left in the existing R environment.

rm(list = ls())

Load data from Faraway.

library(faraway); require(graphics);
data(swiss)
?swiss
dim(swiss); 
## [1] 47  6
head(swiss)

1. Initial data analysis that explores the numerical and graphical characteristics of the data

Numerical characteristics

Print out numerical summary of variables

summary(swiss)
##    Fertility      Agriculture     Examination      Education    
##  Min.   :35.00   Min.   : 1.20   Min.   : 3.00   Min.   : 1.00  
##  1st Qu.:64.70   1st Qu.:35.90   1st Qu.:12.00   1st Qu.: 6.00  
##  Median :70.40   Median :54.10   Median :16.00   Median : 8.00  
##  Mean   :70.14   Mean   :50.66   Mean   :16.49   Mean   :10.98  
##  3rd Qu.:78.45   3rd Qu.:67.65   3rd Qu.:22.00   3rd Qu.:12.00  
##  Max.   :92.50   Max.   :89.70   Max.   :37.00   Max.   :53.00  
##     Catholic       Infant.Mortality
##  Min.   :  2.150   Min.   :10.80   
##  1st Qu.:  5.195   1st Qu.:18.15   
##  Median : 15.140   Median :20.00   
##  Mean   : 41.144   Mean   :19.94   
##  3rd Qu.: 93.125   3rd Qu.:21.70   
##  Max.   :100.000   Max.   :26.60
cor(swiss)
##                   Fertility Agriculture Examination   Education   Catholic
## Fertility         1.0000000  0.35307918  -0.6458827 -0.66378886  0.4636847
## Agriculture       0.3530792  1.00000000  -0.6865422 -0.63952252  0.4010951
## Examination      -0.6458827 -0.68654221   1.0000000  0.69841530 -0.5727418
## Education        -0.6637889 -0.63952252   0.6984153  1.00000000 -0.1538589
## Catholic          0.4636847  0.40109505  -0.5727418 -0.15385892  1.0000000
## Infant.Mortality  0.4165560 -0.06085861  -0.1140216 -0.09932185  0.1754959
##                  Infant.Mortality
## Fertility              0.41655603
## Agriculture           -0.06085861
## Examination           -0.11402160
## Education             -0.09932185
## Catholic               0.17549591
## Infant.Mortality       1.00000000

The numerical summary of the data shows that all the 6 variables are numerical with weak to moderate linear correlations among them.

Graphical characteristics

pairs(swiss, panel = panel.smooth, main = "swiss data", col = 3 + (swiss$Catholic > 50))

plot(density(swiss$Fertility),main="Fertility",xlab="Fertility")
rug(swiss$Fertility)
hist(swiss$Fertility,freq=F,add=T)

qqnorm(swiss$Fertility, ylab="Fertility")
qqline(swiss$Fertility)

It seems the distribution of Fertility is not too different from the normal except for small values of Fertility.

plot(swiss)

A matrix of scatter-plots for the 6 variables indicates * Fertility has positive correlation with Agriculture and Infant.Mortality; * * Fertility has negative correlation with Examination and Education; * Fertility hasa curvature correlation with Catholic.

plot(Fertility ~ Agriculture, swiss, xlab="", las=3)

# Interesting observation (higher degree of catholic comes with higher fertility )
plot(Fertility ~ Catholic, swiss, xlab="", las=3)

plot(Fertility ~ Education, swiss, xlab="", las=3)

plot(Fertility ~ Infant.Mortality, swiss, xlab="", las=3)

2. Variable selection to choose the best model

We start by fitting a linear regression model.

lmod <- lm(Fertility ~ ., swiss);
summary(lmod)
## 
## Call:
## lm(formula = Fertility ~ ., data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2743  -5.2617   0.5032   4.1198  15.3213 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      66.91518   10.70604   6.250 1.91e-07 ***
## Agriculture      -0.17211    0.07030  -2.448  0.01873 *  
## Examination      -0.25801    0.25388  -1.016  0.31546    
## Education        -0.87094    0.18303  -4.758 2.43e-05 ***
## Catholic          0.10412    0.03526   2.953  0.00519 ** 
## Infant.Mortality  1.07705    0.38172   2.822  0.00734 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared:  0.7067, Adjusted R-squared:  0.671 
## F-statistic: 19.76 on 5 and 41 DF,  p-value: 5.594e-10
# Use drop1(lmod, test="F") alternatively
lmod_reduced = step(lmod)
## Start:  AIC=190.69
## Fertility ~ Agriculture + Examination + Education + Catholic + 
##     Infant.Mortality
## 
##                    Df Sum of Sq    RSS    AIC
## - Examination       1     53.03 2158.1 189.86
## <none>                          2105.0 190.69
## - Agriculture       1    307.72 2412.8 195.10
## - Infant.Mortality  1    408.75 2513.8 197.03
## - Catholic          1    447.71 2552.8 197.75
## - Education         1   1162.56 3267.6 209.36
## 
## Step:  AIC=189.86
## Fertility ~ Agriculture + Education + Catholic + Infant.Mortality
## 
##                    Df Sum of Sq    RSS    AIC
## <none>                          2158.1 189.86
## - Agriculture       1    264.18 2422.2 193.29
## - Infant.Mortality  1    409.81 2567.9 196.03
## - Catholic          1    956.57 3114.6 205.10
## - Education         1   2249.97 4408.0 221.43
summary(lmod_reduced)
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic + 
##     Infant.Mortality, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6765  -6.0522   0.7514   3.1664  16.1422 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      62.10131    9.60489   6.466 8.49e-08 ***
## Agriculture      -0.15462    0.06819  -2.267  0.02857 *  
## Education        -0.98026    0.14814  -6.617 5.14e-08 ***
## Catholic          0.12467    0.02889   4.315 9.50e-05 ***
## Infant.Mortality  1.07844    0.38187   2.824  0.00722 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared:  0.6993, Adjusted R-squared:  0.6707 
## F-statistic: 24.42 on 4 and 42 DF,  p-value: 1.717e-10
anova(lmod, lmod_reduced)

By both a t-test and an ANOVA F test we find Examination does not have significant effect on Fertility.

We then treat Fertility ∼ (Agriculture + Education + Catholic + Infant.Mortality)^2 as the full model, and use step() with BIC for selecting the best model.

# Interaction term doesn't seem to bring major improvements
lmodi = lm(Fertility ~ (Agriculture + Education + Catholic +  Infant.Mortality)^2, data = swiss)
lmodi_reduced = step(lmodi, trace = FALSE, k = log(47))
summary(lmodi_reduced)
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic + 
##     Infant.Mortality + Education:Catholic, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.9060  -5.4997   0.9556   3.6698  13.8934 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        53.752308   9.919330   5.419 2.89e-06 ***
## Agriculture        -0.134055   0.065843  -2.036  0.04825 *  
## Education          -0.515105   0.252478  -2.040  0.04781 *  
## Catholic            0.207038   0.046184   4.483 5.81e-05 ***
## Infant.Mortality    1.239697   0.372195   3.331  0.00184 ** 
## Education:Catholic -0.011255   0.005058  -2.225  0.03161 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.853 on 41 degrees of freedom
## Multiple R-squared:  0.7318, Adjusted R-squared:  0.699 
## F-statistic: 22.37 on 5 and 41 DF,  p-value: 9.443e-11

The fitted best model is Fertility = 53.75 − 0.134Agriculture − 0.515Education + 0.207Catholic + 1.24Infant.Mortality − 0.011Education:Catholic

with R2 = 0.7318 and Ra2 = 0.699.

drop1(lmodi_reduced)
summary(lmodi_reduced)
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic + 
##     Infant.Mortality + Education:Catholic, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.9060  -5.4997   0.9556   3.6698  13.8934 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        53.752308   9.919330   5.419 2.89e-06 ***
## Agriculture        -0.134055   0.065843  -2.036  0.04825 *  
## Education          -0.515105   0.252478  -2.040  0.04781 *  
## Catholic            0.207038   0.046184   4.483 5.81e-05 ***
## Infant.Mortality    1.239697   0.372195   3.331  0.00184 ** 
## Education:Catholic -0.011255   0.005058  -2.225  0.03161 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.853 on 41 degrees of freedom
## Multiple R-squared:  0.7318, Adjusted R-squared:  0.699 
## F-statistic: 22.37 on 5 and 41 DF,  p-value: 9.443e-11

The terms in this model cannot be further reduced by the drop1() command.

par(mfrow=c(2,2)); termplot(lmodi_reduced,partial=T,terms=NULL); plot(lmodi_reduced)
## Warning in termplot(lmodi_reduced, partial = T, terms = NULL): 'model' appears
## to involve interactions: see the help page

3. Exploration of transformations to improve the fit of the model

The model does not seem to need a transformation on the response variable because the empirical distribution of Fertility is not far from the normal.

On the other hand, the relationship between Fertility and Catholoc seems to be curvature.

_Hence we investigate the transformation of the “Catholic” variable because it has a curvature effect on Fertility. We replace “Catholic” with a quadratic term

library(MASS)
# poly(3) constructs a transformation of poly w/ vec, vec^1, vec^2 such that correlation is minimized
# vec = c(1,2,3,4)
# poly(vec, 3, raw=TRUE)
Wlmodp<-lm(Fertility~Agriculture+Education+poly(Catholic,2)+Infant.Mortality + Education:poly(Catholic,2), swiss) 
summary(Wlmodp)
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + poly(Catholic, 
##     2) + Infant.Mortality + Education:poly(Catholic, 2), data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5584  -5.0451   0.0393   3.5404  15.3300 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  60.37216    9.90815   6.093 3.84e-07 ***
## Agriculture                  -0.13234    0.07102  -1.863 0.069955 .  
## Education                    -0.68355    0.26160  -2.613 0.012684 *  
## poly(Catholic, 2)1           51.29861   14.04601   3.652 0.000763 ***
## poly(Catholic, 2)2            1.52390   12.92429   0.118 0.906744    
## Infant.Mortality              1.21767    0.37666   3.233 0.002496 ** 
## Education:poly(Catholic, 2)1 -1.79870    1.73121  -1.039 0.305211    
## Education:poly(Catholic, 2)2  0.76806    0.72902   1.054 0.298573    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.851 on 39 degrees of freedom
## Multiple R-squared:  0.745,  Adjusted R-squared:  0.6992 
## F-statistic: 16.28 on 7 and 39 DF,  p-value: 8.356e-10
Wlmodp1<-lm(Fertility~Agriculture + Education + poly(Catholic,2) + Infant.Mortality + Education:Catholic, swiss)
summary(Wlmodp1)
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + poly(Catholic, 
##     2) + Infant.Mortality + Education:Catholic, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.8316  -5.2273   0.2632   4.0651  14.2838 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        61.826469   9.825002   6.293 1.83e-07 ***
## Agriculture        -0.152171   0.068575  -2.219 0.032221 *  
## Education          -0.517682   0.252752  -2.048 0.047145 *  
## poly(Catholic, 2)1 55.884416  13.372902   4.179 0.000155 ***
## poly(Catholic, 2)2  9.820777  10.261947   0.957 0.344311    
## Infant.Mortality    1.269834   0.373906   3.396 0.001556 ** 
## Education:Catholic -0.009239   0.005484  -1.685 0.099837 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.86 on 40 degrees of freedom
## Multiple R-squared:  0.7378, Adjusted R-squared:  0.6984 
## F-statistic: 18.75 on 6 and 40 DF,  p-value: 3.078e-10
plot(Wlmodp1)

termplot(Wlmodp,partial=T,terms=3)

library(splines)
Wlmods<-lm(Fertility~Agriculture+Education+bs(Catholic,3)+Infant.Mortality + Education:bs(Catholic,3), swiss) 
summary(Wlmods)
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + bs(Catholic, 
##     3) + Infant.Mortality + Education:bs(Catholic, 3), data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5473  -5.0681   0.5734   3.2353  15.5592 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 59.15343   11.66882   5.069 1.14e-05 ***
## Agriculture                 -0.12260    0.07291  -1.681   0.1011    
## Education                   -0.55823    0.41581  -1.343   0.1876    
## bs(Catholic, 3)1           -11.07029   18.70215  -0.592   0.5575    
## bs(Catholic, 3)2            35.45274   28.03511   1.265   0.2139    
## bs(Catholic, 3)3            13.63965    6.35112   2.148   0.0384 *  
## Infant.Mortality             1.00679    0.43136   2.334   0.0251 *  
## Education:bs(Catholic, 3)1   0.66876    1.50115   0.445   0.6586    
## Education:bs(Catholic, 3)2  -2.42495    1.48402  -1.634   0.1107    
## Education:bs(Catholic, 3)3  -0.25491    0.84904  -0.300   0.7657    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.931 on 37 degrees of freedom
## Multiple R-squared:  0.7524, Adjusted R-squared:  0.6921 
## F-statistic: 12.49 on 9 and 37 DF,  p-value: 8.342e-09
Wlmods1<-lm(Fertility~Agriculture+Education+bs(Catholic,3)+Infant.Mortality + Education:Catholic, swiss)
summary(Wlmods1)
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + bs(Catholic, 
##     3) + Infant.Mortality + Education:Catholic, data = swiss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.377  -5.072   0.321   4.014  14.446 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        55.455117  10.117571   5.481 2.72e-06 ***
## Agriculture        -0.148741   0.070264  -2.117 0.040702 *  
## Education          -0.479725   0.284130  -1.688 0.099316 .  
## bs(Catholic, 3)1   -3.217588  11.717565  -0.275 0.785077    
## bs(Catholic, 3)2   11.056766  19.100029   0.579 0.565994    
## bs(Catholic, 3)3   19.202744   4.706975   4.080 0.000216 ***
## Infant.Mortality    1.259959   0.379587   3.319 0.001964 ** 
## Education:Catholic -0.010382   0.006687  -1.553 0.128614    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.939 on 39 degrees of freedom
## Multiple R-squared:  0.7384, Adjusted R-squared:  0.6914 
## F-statistic: 15.72 on 7 and 39 DF,  p-value: 1.35e-09
plot(Wlmods1)

termplot(Wlmods,partial=T,terms=NULL)
## Warning in termplot(Wlmods, partial = T, terms = NULL): 'model' appears to
## involve interactions: see the help page

Thus, we replace Catholic by poly(Catholic, 2) or bs(Catholic, 3) in the model lmodi_reduced to see whether any improvement of fit can be made.

It does not seem to achieve any significant improvement by doing this. Here the order 2 in poly() and df 3 in bs() are selected using a try-and-error approach

4. Diagnostics to check the assumptions of the model.

plot(lmodi_reduced)

The diagnostics plots given by plot(smallm) show that the model provides a good fit to the data in general.

The residuals vs. leverage plot identifies 4 provinces that have the largest Cook distance values and are influential to model fitting. These 4 provinces are Porrentruy, Sierre, Sion, and Rive Gauche.

We try to filter out the outliers by getting the indizes first:

(1:47)[rownames(swiss)=="Sion"]  #38
## [1] 38
(1:47)[rownames(swiss)=="Sierre"] #37
## [1] 37
(1:47)[rownames(swiss)=="Porrentruy"] #6
## [1] 6
(1:47)[rownames(swiss)=="Rive Gauche"] #47
## [1] 47
swiss[c(6,37,38,47),]
rownames(swiss)[c(6,37,38,47)]
## [1] "Porrentruy"  "Sierre"      "Sion"        "Rive Gauche"
hatvalues(lmodi_reduced)
##   Courtelary     Delemont Franches-Mnt      Moutier   Neuveville   Porrentruy 
##   0.10260793   0.07580639   0.13623845   0.05973401   0.07103360   0.18823843 
##        Broye        Glane      Gruyere       Sarine      Veveyse        Aigle 
##   0.08842886   0.11689602   0.07577713   0.14642334   0.10934403   0.07917772 
##      Aubonne     Avenches     Cossonay    Echallens     Grandson     Lausanne 
##   0.07598143   0.12553452   0.08628630   0.07370320   0.06376129   0.22670153 
##    La Vallee       Lavaux       Morges       Moudon        Nyone         Orbe 
##   0.34956953   0.10836510   0.05764165   0.09609790   0.05289707   0.11454455 
##         Oron      Payerne Paysd'enhaut        Rolle        Vevey      Yverdon 
##   0.13083077   0.11033637   0.10103769   0.07368892   0.06659714   0.06513090 
##      Conthey    Entremont       Herens     Martigwy      Monthey   St Maurice 
##   0.19704637   0.09502856   0.15680036   0.07880332   0.10133459   0.10272389 
##       Sierre         Sion       Boudry La Chauxdfnd     Le Locle    Neuchatel 
##   0.15149713   0.15886464   0.04426588   0.15781432   0.10040794   0.31804461 
##   Val de Ruz ValdeTravers V. De Geneve  Rive Droite  Rive Gauche 
##   0.06265152   0.14516128   0.48313011   0.18286595   0.23514778
influence.measures(lmodi_reduced)
## Influence measures of
##   lm(formula = Fertility ~ Agriculture + Education + Catholic +      Infant.Mortality + Education:Catholic, data = swiss) :
## 
##                dfb.1_ dfb.Agrc dfb.Edct dfb.Cthl dfb.In.M  dfb.Ed.C   dffit
## Courtelary    0.04432 -0.23133 -0.04702  0.02321  0.08976 -0.045367  0.3477
## Delemont      0.02296 -0.07573 -0.06375  0.02187  0.02022  0.045778  0.1524
## Franches-Mnt  0.21223 -0.34081 -0.13079  0.24396 -0.10810 -0.040840  0.4714
## Moutier       0.23577 -0.31232 -0.24858 -0.04524 -0.06493  0.109009  0.4287
## Neuveville   -0.19949  0.14133  0.30260  0.07440  0.17956 -0.277192  0.4697
## Porrentruy    0.28682  0.51113  0.17016 -0.33337 -0.65439  0.042706 -1.1444
## Broye        -0.05121  0.02808  0.01518  0.02464  0.05232 -0.009638  0.0894
## Glane        -0.33831  0.14506  0.05770  0.09969  0.37261 -0.003844  0.5522
## Gruyere       0.00769 -0.01512 -0.00899  0.01531 -0.00207  0.002235  0.0360
## Sarine       -0.04332 -0.08637 -0.17368 -0.08170  0.14313  0.224415  0.3991
## Veveyse      -0.02780  0.00534  0.00927  0.02191  0.03192 -0.011992  0.0512
## Aigle         0.02555  0.08203  0.04882 -0.02599 -0.06348 -0.027970  0.1668
## Aubonne      -0.00590  0.04462 -0.00259 -0.04341  0.00245  0.011573  0.0789
## Avenches     -0.04235  0.03756  0.03065 -0.00626  0.03866 -0.020754  0.0588
## Cossonay     -0.01869 -0.07902  0.04641  0.11646  0.01740 -0.059470 -0.1765
## Echallens     0.04652 -0.10426  0.05875  0.11913 -0.06872 -0.059214 -0.2316
## Grandson      0.02560 -0.03053 -0.02578 -0.01816 -0.00468  0.010647  0.0563
## Lausanne      0.15517 -0.03529 -0.35875 -0.22776 -0.10298  0.323257 -0.4258
## La Vallee    -0.04166  0.02470 -0.00858 -0.01925  0.04586  0.018153 -0.0606
## Lavaux       -0.01514  0.02795  0.01095 -0.01275  0.00996 -0.003650  0.0359
## Morges        0.00589  0.03622  0.01248 -0.02478 -0.01456 -0.005858  0.0774
## Moudon        0.02621 -0.01703  0.20818  0.27871 -0.15585 -0.177715 -0.4225
## Nyone        -0.10466 -0.01819 -0.01116  0.02788  0.12435  0.010459 -0.2100
## Orbe         -0.24083  0.04520  0.14986  0.13634  0.22332 -0.114807 -0.3093
## Oron         -0.00167  0.04374 -0.06701 -0.10697  0.02335  0.071831  0.1384
## Payerne      -0.07553  0.05346  0.01904 -0.03989  0.08947 -0.008693  0.1207
## Paysd'enhaut  0.12733  0.02770 -0.16655 -0.20464 -0.09151  0.154218  0.2853
## Rolle        -0.01353 -0.01267 -0.00167  0.01062  0.01910 -0.000469 -0.0375
## Vevey         0.06764  0.05712 -0.16425 -0.10122 -0.09586  0.160539 -0.3186
## Yverdon       0.09036 -0.04223  0.00781  0.09329 -0.14709 -0.001803 -0.2467
## Conthey      -0.06158 -0.03564 -0.05678 -0.13143  0.11872  0.091731 -0.2257
## Entremont     0.10322 -0.21873 -0.11342 -0.15669 -0.00473  0.077303 -0.4227
## Herens        0.04272 -0.11975 -0.13976 -0.19995  0.03890  0.172081 -0.3090
## Martigwy      0.02012 -0.10830 -0.05430 -0.14520  0.04214  0.050813 -0.3363
## Monthey      -0.00772  0.04713 -0.06527 -0.24601  0.01466  0.158212 -0.3184
## St Maurice   -0.09479 -0.07063  0.06414 -0.01520  0.14726 -0.111047 -0.3379
## Sierre        0.19380  0.20869  0.24983  0.55966 -0.44117 -0.370574  0.9782
## Sion          0.37408 -0.03665 -0.44940 -0.24868 -0.37496  0.625641  0.8890
## Boudry       -0.00192 -0.00941  0.01487 -0.00289  0.01547 -0.023829  0.0765
## La Chauxdfnd -0.28260  0.48690  0.21199 -0.02541  0.06726 -0.030768 -0.5489
## Le Locle      0.11188 -0.15224 -0.04212  0.02140 -0.05499 -0.014786  0.1923
## Neuchatel    -0.25321  0.08241  0.36291  0.19664  0.20863 -0.293513  0.4274
## Val de Ruz    0.12806 -0.13513 -0.14788 -0.11270 -0.02544  0.077760  0.2757
## ValdeTravers -0.23155  0.28786  0.20848  0.06018  0.09330 -0.091000 -0.3449
## V. De Geneve -0.01186  0.03937  0.13125 -0.07455 -0.05176  0.134480  0.5617
## Rive Droite  -0.00920 -0.10566 -0.00117  0.13190  0.05586 -0.167981 -0.3347
## Rive Gauche  -0.22843  0.11133  0.31880  0.40254  0.16802 -0.638422 -0.8724
##              cov.r   cook.d    hat inf
## Courtelary   1.105 0.020115 0.1026    
## Delemont     1.203 0.003941 0.0758    
## Franches-Mnt 1.091 0.036669 0.1362    
## Moutier      0.811 0.029276 0.0597    
## Neuveville   0.822 0.035158 0.0710    
## Porrentruy   0.647 0.196042 0.1882    
## Broye        1.257 0.001362 0.0884    
## Glane        0.939 0.049247 0.1169    
## Gruyere      1.252 0.000221 0.0758    
## Sarine       1.184 0.026593 0.1464    
## Veveyse      1.298 0.000447 0.1093    
## Aigle        1.200 0.004715 0.0792    
## Aubonne      1.241 0.001061 0.0760    
## Avenches     1.321 0.000590 0.1255    
## Cossonay     1.208 0.005280 0.0863    
## Echallens    1.133 0.009011 0.0737    
## Grandson     1.230 0.000540 0.0638    
## Lausanne     1.368 0.030501 0.2267    
## La Vallee    1.781 0.000627 0.3496   *
## Lavaux       1.299 0.000221 0.1084    
## Morges       1.213 0.001022 0.0576    
## Moudon       1.003 0.029263 0.0961    
## Nyone        1.089 0.007385 0.0529    
## Orbe         1.173 0.016043 0.1145    
## Oron         1.309 0.003262 0.1308    
## Payerne      1.281 0.002483 0.1103    
## Paysd'enhaut 1.158 0.013654 0.1010    
## Rolle        1.249 0.000240 0.0737    
## Vevey        1.007 0.016744 0.0666    
## Yverdon      1.090 0.010175 0.0651    
## Conthey      1.400 0.008659 0.1970    
## Entremont    0.998 0.029280 0.0950    
## Herens       1.274 0.016108 0.1568    
## Martigwy     1.036 0.018707 0.0788    
## Monthey      1.129 0.016938 0.1013    
## St Maurice   1.115 0.019030 0.1027    
## Sierre       0.643 0.144148 0.1515    
## Sion         0.759 0.122232 0.1589    
## Boudry       1.191 0.000996 0.0443    
## La Chauxdfnd 1.087 0.049474 0.1578    
## Le Locle     1.227 0.006262 0.1004    
## Neuchatel    1.604 0.030904 0.3180   *
## Val de Ruz   1.046 0.012622 0.0627    
## ValdeTravers 1.222 0.019977 0.1452    
## V. De Geneve 2.133 0.053448 0.4831   *
## Rive Droite  1.317 0.018903 0.1829    
## Rive Gauche  1.058 0.122438 0.2351
cooks.distance(lmodi_reduced)
##   Courtelary     Delemont Franches-Mnt      Moutier   Neuveville   Porrentruy 
## 0.0201154852 0.0039410826 0.0366691513 0.0292755138 0.0351584235 0.1960420472 
##        Broye        Glane      Gruyere       Sarine      Veveyse        Aigle 
## 0.0013622033 0.0492469430 0.0002214636 0.0265934794 0.0004467372 0.0047153887 
##      Aubonne     Avenches     Cossonay    Echallens     Grandson     Lausanne 
## 0.0010608615 0.0005897404 0.0052804731 0.0090110500 0.0005400547 0.0305009295 
##    La Vallee       Lavaux       Morges       Moudon        Nyone         Orbe 
## 0.0006271319 0.0002206554 0.0010221776 0.0292633997 0.0073849102 0.0160432221 
##         Oron      Payerne Paysd'enhaut        Rolle        Vevey      Yverdon 
## 0.0032624985 0.0024832884 0.0136542028 0.0002399045 0.0167444949 0.0101748505 
##      Conthey    Entremont       Herens     Martigwy      Monthey   St Maurice 
## 0.0086585000 0.0292802881 0.0161075451 0.0187071209 0.0169381979 0.0190298484 
##       Sierre         Sion       Boudry La Chauxdfnd     Le Locle    Neuchatel 
## 0.1441475824 0.1222319428 0.0009958496 0.0494736931 0.0062623898 0.0309042646 
##   Val de Ruz ValdeTravers V. De Geneve  Rive Droite  Rive Gauche 
## 0.0126224991 0.0199766071 0.0534481565 0.0189033468 0.1224375602
sort(cooks.distance(lmodi_reduced))
##       Lavaux      Gruyere        Rolle      Veveyse     Grandson     Avenches 
## 0.0002206554 0.0002214636 0.0002399045 0.0004467372 0.0005400547 0.0005897404 
##    La Vallee       Boudry       Morges      Aubonne        Broye      Payerne 
## 0.0006271319 0.0009958496 0.0010221776 0.0010608615 0.0013622033 0.0024832884 
##         Oron     Delemont        Aigle     Cossonay     Le Locle        Nyone 
## 0.0032624985 0.0039410826 0.0047153887 0.0052804731 0.0062623898 0.0073849102 
##      Conthey    Echallens      Yverdon   Val de Ruz Paysd'enhaut         Orbe 
## 0.0086585000 0.0090110500 0.0101748505 0.0126224991 0.0136542028 0.0160432221 
##       Herens        Vevey      Monthey     Martigwy  Rive Droite   St Maurice 
## 0.0161075451 0.0167444949 0.0169381979 0.0187071209 0.0189033468 0.0190298484 
## ValdeTravers   Courtelary       Sarine       Moudon      Moutier    Entremont 
## 0.0199766071 0.0201154852 0.0265934794 0.0292633997 0.0292755138 0.0292802881 
##     Lausanne    Neuchatel   Neuveville Franches-Mnt        Glane La Chauxdfnd 
## 0.0305009295 0.0309042646 0.0351584235 0.0366691513 0.0492469430 0.0494736931 
## V. De Geneve         Sion  Rive Gauche       Sierre   Porrentruy 
## 0.0534481565 0.1222319428 0.1224375602 0.1441475824 0.1960420472

We observe that these four provinces have the most extreme residuals in regard to model lmodi_reduced, but do not have large leverage values. The predictors values of these 4 provinces are mostly unusual in comparison with those of other provinces.

5. Some predictions of future observations for interesting values of the predictors

While there should be many interesting values of the predictors, we chose to predict the Fertility value at the mean values of the predictors.

First we costruct a dataframe to perform prediction on.

predictor_df <- data.frame(Agriculture=mean(swiss$Agriculture), Examination=mean(swiss$Examination), Education=mean(swiss$Education), Catholic=mean(swiss$Catholic), Infant.Mortality=mean(swiss$Infant.Mortality))
pp <- predict(lmodi_reduced,new=predictor_df, se.fit=T); 
pp
## $fit
##        1 
## 69.46289 
## 
## $se.fit
## [1] 1.045219
## 
## $df
## [1] 41
## 
## $residual.scale
## [1] 6.852949

We can also construct a data frame with purely random data.

test = data.frame(Agriculture=80, Education=6, Catholic=12, Infant.Mortality=20)
predict(lmodi_reduced, test)
##        1 
## 66.40531

The predicted value of Fertility equals 69.46289 with standard error 1.045219.

predict(lmodi_reduced)
##   Courtelary     Delemont Franches-Mnt      Moutier   Neuveville   Porrentruy 
##     73.53024     79.56271     84.97775     74.75043     65.92926     90.00600 
##        Broye        Glane      Gruyere       Sarine      Veveyse        Aigle 
##     81.90081     82.77823     81.56129     76.79370     86.14437     60.32787 
##      Aubonne     Avenches     Cossonay    Echallens     Grandson     Lausanne 
##     65.06722     67.89389     65.49427     73.73686     70.25363     60.46134 
##    La Vallee       Lavaux       Morges       Moudon        Nyone         Orbe 
##     54.76245     64.42461     63.39349     73.37298     62.54028     62.96261 
##         Oron      Payerne Paysd'enhaut        Rolle        Vevey      Yverdon 
##     70.19603     71.95956     66.45264     61.38721     66.15653     71.60261 
##      Conthey    Entremont       Herens     Martigwy      Monthey   St Maurice 
##     78.32541     77.73235     81.83654     78.03419     85.56745     71.48278 
##       Sierre         Sion       Boudry La Chauxdfnd     Le Locle    Neuchatel 
##     78.30662     66.91495     67.99368     73.61544     68.92875     60.83159 
##   Val de Ruz ValdeTravers V. De Geneve  Rive Droite  Rive Gauche 
##     70.53746     72.92320     32.11418     49.11011     52.06441

We get predictions where the difference of true vs. predicted values is greater then 5%:

# Get data where predictions was wrong by > 5%
wrong_preds = swiss[c((1:47)[abs(swiss$Fertility - predict(lmodi_reduced)) > 5]),]
# Show wrong predictions by order of magnitude
sort(abs(wrong_preds$Fertility - predict(lmodi_reduced, wrong_preds)), decreasing = TRUE)
##   Porrentruy       Sierre         Sion      Moutier   Neuveville        Glane 
##    13.905996    13.893381    12.385050    11.049569    10.970740     9.621765 
##  Rive Gauche    Entremont       Moudon La Chauxdfnd        Vevey     Martigwy 
##     9.264406     8.432350     8.372979     7.915438     7.856525     7.534193 
## Franches-Mnt   Val de Ruz   Courtelary   St Maurice      Yverdon      Monthey 
##     7.522247     7.062543     6.669760     6.482779     6.202610     6.167454 
##       Sarine        Nyone         Orbe Paysd'enhaut    Echallens ValdeTravers 
##     6.106300     5.940283     5.562608     5.547357     5.436856     5.323200

6. Interpretation of the meaning of the model with respect to the particular area of application

summary(lmodi_reduced)
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic + 
##     Infant.Mortality + Education:Catholic, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.9060  -5.4997   0.9556   3.6698  13.8934 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        53.752308   9.919330   5.419 2.89e-06 ***
## Agriculture        -0.134055   0.065843  -2.036  0.04825 *  
## Education          -0.515105   0.252478  -2.040  0.04781 *  
## Catholic            0.207038   0.046184   4.483 5.81e-05 ***
## Infant.Mortality    1.239697   0.372195   3.331  0.00184 ** 
## Education:Catholic -0.011255   0.005058  -2.225  0.03161 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.853 on 41 degrees of freedom
## Multiple R-squared:  0.7318, Adjusted R-squared:  0.699 
## F-statistic: 22.37 on 5 and 41 DF,  p-value: 9.443e-11

The selected model lmodi_reduced suggests that all predictors except “Examination” are significantly related to “Fertility” with the directions of the relations been given in the summary(lmodi_reduced) output. In addition, “Education” and “Catholic” have significant interaction effect on Fertility.