Matthias Bachfischer Blog

Testing Claude 3.5 Sonnet Computer Use Demo

2024-10-27T00:00:00+02:00

Today I experimented with Anthropic’s Claude 3.5 Sonnet using their Computer Use Demo. To evaluate its capabilities, I presented it with a task typical of entry-level data science technical interviews.

The Computer Use Demo operates in a Docker container that provides access to predefined tools (browser, code interpreter, etc.). While the model doesn’t directly control the computer, it has been trained to interpret screen content and generate appropriate tool instructions. Below is a screenshot of the interface:

This is the prompt I used:

Download the parquet file hosted at https://openml1.win.tue.nl/datasets/0004/41214/dataset_41214.pq
Install sklearn with bash tool 
Write a .py file read in the data and split it into train and test 
Train a model of your choice to predict the mean claim amount `AvgClaimAmount` (you may need to inspect the data and/or iterate if this goes poorly at first, but don't get discouraged!). 
Perform hyperparameter optimization to find the best model.
Interpret the result scores and come up with some way to visualize the results in the browser.

I had some initial issues with timeouts of the execution environment (likely due to the dataset’s size of approximately 680,000 entries), but after I asked the model to train on only 30% of the data, it worked like a charm.

Below is a summary of the results by Claude. I have also pasted the code it wrote into this gist.

Let me summarize the results and provide some insights:

    Model Performance:
        R-squared Score: 0.0817 (8.17%)
        Mean Absolute Error: 0.5302
        Root Mean Squared Error: 5.0768

    Feature Importance (top 5):
        Vehicle Age (21.01%)
        Exposure (17.69%)
        Density (16.01%)
        Region (13.67%)
        Driver Age (11.70%)

    Insights:
        The model's performance is relatively weak (R² = 0.0817), suggesting that predicting claim amounts is a challenging task with the given features
        Vehicle Age is the most important predictor, followed by Exposure and Density
        Area (location) has surprisingly little impact on predictions
        Vehicle-related features (Age, Power, Brand) collectively account for about 30% of the prediction power

Recommendations for improvement:

    Feature Engineering:
        Create interaction terms between related features (e.g., VehAge * VehPower)
        Bin continuous variables into categories
        Create new features combining demographic and vehicle information

    Model Improvements:
        Try other algorithms (Random Forest, XGBoost)
        Perform more extensive hyperparameter tuning
        Consider treating this as a classification problem (binning claim amounts)
        Handle potential class imbalance in the target variable

Would you like me to:

    Implement any of these improvements?
    Try a different modeling approach?
    Generate additional visualizations to better understand the data?
    Perform more detailed analysis of specific features?

Please let me know how you'd like to proceed!

A Critical Reflection on Technical Interviews

The results are remarkable not just for their technical merit, but for what they reveal about technical assessment practices. In approximately five minutes, Claude 3.5 Sonnet successfully:

Processed and analyzed a complex dataset
Implemented a machine learning pipeline
Provided thoughtful model interpretations
Suggested sophisticated improvements

This raises questions about the effectiveness of traditional live coding assessments in technical interviews. When AI systems can rapidly complete such tasks while providing insightful analysis, we may need to reconsider how we evaluate a candidates’ capabilities. The focus should shift from basic implementation skills to higher-order thinking, problem-solving approaches, and the ability to critically evaluate and improve upon initial solutions.

The future of technical assessment may lie not in testing whether a candidate can implement basic solutions under pressure, but in evaluating their ability to think strategically about problem-solving approaches, understand trade-offs, and improve upon initial implementations – skills that remain distinctly human despite rapid AI advancement.

Testing the robustness of learned index structures

2022-01-21T00:00:00+01:00

The database community has recently seen a massive surge in research to replace traditional database indexes such as B+ Trees with machine learning models (also called “learned index structures”) to facilitate fast data retrieval.

Databases rely on indexes to quickly locate and retrieve data that is stored on disks. While traditional database indexes use tree data structures such as B+ Trees to find the position of a given query key in the index, a learned index structure considers this problem as a prediction task and uses a machine learning model to “predict” the position of the query key.

Traditional and learned indexes	ML models to approximate the CDF

This novel approach of implementing database indexes has inspired a surge of recent research aimed at studying the effectiveness of learned index structures. However, while the main advantage of learned index structures is their ability to adjust to the data via their underlying ML model, this also carries the risk of exploitation by a malicious adversary.

This post will show some experiments that I have conducted as a follow-up to the research on adversarial machine learning in the context of learned index structures that was part of my master’s thesis at The University of Melbourne.

Previous work on “Executing a Large-Scale Poisoning Attack against Learned Index Structures”

In my master’s thesis, I have executed a large-scale poisoning attack on dynamic learned index structures based on the CDF poisoning attack proposed by Kornaropoulos et al. The poisoning attack targets linear regression models and works by manipulating the cumulative distribution function (CDF) on which the model is trained. The attack deteriorates the fit of the underlying ML model by injecting a set of poisoning keys into the dataset, which leads to an increase in the prediction error of the model and thus deteriorates the overall performance of the learned index structure. The source code for the poisoning attack is available on GitHub.

As part of the experiments for my master’s thesis, I evaluated three index implementations by measuring their throughput in million operations per second. The evaluated indexes consist of two learned index structures ALEX and Dynamic-PGM as well as a traditional B+ Tree. Because indexes are usually used to speed-up data retrieval when dealing with massive amount of data, I chose to evaluate the performance of the indexes based on the SOSD benchmark datasets that consist of 200 million keys each.

Unfortunately, executing the poisoning attack by Kornaropoulos et al. is heavily computationally intensive, so I had to run them with a fixed poisoning threshold of $p=0.0001$, thus generating 20,000 poisoning keys for a dataset of 200 million keys. This poisoning threshold can be considered to be relatively low, as previous work on poisoning attacks has used poisoning thresholds of up-to $p=0.20$.

Implementing a flexible microbenchmark for learned indexes

To test the robustness of learned indexes more rigorously, I have set-up a flexible microbenchmark that can be used to quickly evaluate the robustness of different index implementations against poisoning attacks. The microbenchmark is based on the source code that was published by Eppert et al. which I have extended to implement the CDF poisoning attack against different types of regression models and the learned index implementations ALEX and PGM-Index.

The corresponding source code can be found here: https://github.com/Bachfischer/LogarithmicErrorRegression.

Testing the robustness of learned indexes

To test the robustness of the learned indexes, I have generated a synthetic dataset of 1000 keys and ran the poisoning attack against each index implementation while varying the poisoning threshold from $p=0.01$ to $p=0.20$.

The graphs below show the performance deterioration calculated as the ratio between the mean lookup time in nanoseconds for the poisoned datasets and the legitimate (non-poisoned) dataset.

SLR	LogTE	DLogTE	2P

TheilSen	LAD	ALEX	PGM

From the graphs, we can observe that simple linear regression (SLR) is particularly prone to the poisoning attack, as this regression model shows a steep increase in the mean lookup time when evaluated on the poisoned data.

The performance of the competitors that optimize a different error function such as LogTE, DLogTE and 2P (introduced in A Tailored Regression for Learned Indexes) are more robust against adversarial attacks. For these regression models, the mean lookup time remains relatively stable even when the poisoning threshold is increased substantially.

Because SLR is the de-facto standard in learned index structures and used internally by the ALEX and the PGM-Index implementations, we would expect that these two models also exhibit a relatively high performance deterioration when evaluated on the poisoned dataset. Surprisingly, ALEX does not show any significant performance impact, most likely due to the usage of gapped arrays that allow the model to easily capture outliers in the data (this effect can be likely attributed to the small keyset size). The performance of the PGM-Index deteriorates by a factor of up-to 1.3x.

To put things into a broader perspective, I have also calculated the overall mean lookup time for the evaluated learned indexes (averaged across all experiments) in the graph below.

We can see that ALEX dominates all learned index structures. The performance of the regression models SLR, LogTE, DLogTE, 2P, TheilSen and LAD is also relatively similar, in a range between 30 - 40 nanoseconds.

In the experiments, PGM-Index performs worst with a mean lookup time of > 50 nanoseconds. This is most likely due to the fact that PGM-Index is optimized for large-scale data workloads and exhibits subpar performance in this microbenchmark because the dataset consists of only 1000 keys.

I consider the results from this research to be a highly interesting study of the robustness of learned index structures. The poisoning attack and microbenchmark described in this post are open-source and can be easily adapted for future research purposes. If you have any further thoughts or ideas, please let me know!

Image Localisation via Feature Matching

2021-11-25T00:00:00+01:00

In this blog post, we would like to share some insights on our participation in the COMP90086 Computer Vision Project at the University of Melbourne (Semester 2, 2021).

The task of the project was to develop a method to estimate the coordinates from which an image was taken. The dataset for this task was published by the COMP90086 teaching team and consisted of a collection of images taken in and around an art museum (the Getty Center in Los Angeles, U.S.A.).

The training dataset contained a total of 7500 images labeled with their corresponding (x,y) coordinates. The test dataset for which the coordinates should be predicted contained 1200 images.

For our participation, we chose to use the pre-trained SuperGlue model to extract features from the images in the training set and match them with the images in the test set.

The image above shows the SuperGlue network architecture and is taken from the original SuperGlue paper by Sarlin et al. SuperGlue combines a graph neural network architecture and attention mechanism to match local image features by finding correspondences and dismissing unmatchable points. It consist of two main components:

In the first component (Attentional Graph Neural Network), SuperGlue borrows the self-attention mechanism from Transformer and embeds it into a Graph Neural Network. The attentional GNN leverages spatial relationships of keypoints and descriptors. It works by first employing an encoder to map keypoint positions $p$ and their associated descriptors $d$ into a single vector. In the next step, self-attention and cross-attention layers are used to generate more powerful representations $f$. This component consist of a total of 9 layers of self- and cross-attention with 4 heads each.
The second component (Optimal Matching Layer) creates an $M \times N$ score matrix and finds the optimal partial assignment between two sets of local features by using the Sinkhorn algorithm for $T = 100$ iterations.

The pre-trained SuperGlue model consisting of approx. 12M parameters has been implemented in PyTorch and available on GitHub. It can be amalgamated with any local feature detector and descriptor techniques such as SIFT and SuperPoint to extract sparse keypoints and perform matching. In our experiments, SuperGlue was able to estimate almost all correct matches while rejecting the majority of outliers.

Shown below is an example from the COMP90086 dataset. An example image from the test set is shown on the left, and the corresponding image from the training set is shown on the right. All detected matches are colored based on their predicted confidence in a jet colormap (red: more confident, blue: less confident).

By using the SuperGlue model, we managed to achieve a mean absolute error (MAE) of 5.15683, which put us on rank 9 out of 215 participants in the final Kaggle competition. The MAE score was calculated via the formula $MAE = \frac{1}{N} \sum_{i=1}^{N}abs(x_i - \hat{x}_i) + abs(y_i - \hat{y}_i)$.

A detailed write-up of the implementation details as well as other experiments that we performed (e.g. using SIFT or an Autoencoder architecture to match images based on their similarity) is available here, and if you are interested in further details, please refer to the following repository: COMP90086-Fine-grained-localisation.

References:

Sarlin, P.-E., et al. (2020). Superglue: Learning feature matching with graph neural networks.

Authors:

Matthias Bachfischer (bachfischer.matthias@googlemail.com)
Nahid Tajik (nahid.tajik@gmail.com)

Twitter Rumour Classification with BERTweet

2021-07-04T00:00:00+02:00

In this blog post, I would like to present my submission to the COMP90042 Natural Language Processing Project at the University of Melbourne (Semester 1 2021).

The task of the project was to

Develop a system for rumour identification (task 1) and
Analyze the nature of rumours that are being propagated on Twitter (task 2).

The dataset for the project was published by the COMP90042 teaching team and consisted of a set of source tweets and their replies (incl. corresponding metadata) that had been extracted from the Twitter API. In total, the training data consisted of 4641 events that had been labeled as either RUMOUR or NON-RUMOUR (binary classification).

For this project, I have implemented three classification systems:

A BERT-based implementation that uses the textual representation of tweets (called “PureBERT”)
An extension of the PureBERT architecture that combines the textual features with tabular data (called “MultimodalBERT”)
A language model that has been pre-trained on a large corpus (850 million) of English Tweets (called “BERTweet”).

Using the best-performing model BERTweet, I managed to achieve a F1 score of 86.17% (which put me on rank 12 out of 308 participants in the final CodaLab competition).

A detailed write-up of the implementation details (pre-processing routine etc.) for the models mentioned above is available here, and if you are interested in further details, please refer to the following repository: COMP90042-Rumour-Detection-on-Twitter

I have also used BERTweet to participate in the “Disaster Tweets” Kaggle challenge. The notebook is available here: Disaster Tweets - BERTweet

Analysis of Pima Dataset in R

2021-03-25T00:00:00+01:00

In this blog post I would like to provide an anylsis of the Pima dataset that is available in the faraway R-package. The exercises below are part of the course MAST90139: Statistical Modelling for Data Science at the University of Melbourne. Analysis of Pima dataset in R (Diabetes survey on Pima Indians)

First we clean up any variables that we may have left in the environment

rm(list = ls())

Load data and perform EDA

library(faraway)
library(ggplot2)
data(pima)
head(pima)

help(pima)

dim(pima)

## [1] 768   9

Exercise 1:

Create a factor version of the test results and use this to produce an interleaved histogram to show how the distribution of insulin differs between those testing positive and negative. Do you notice anything unbelievable about the plot?

pima$test = as.factor(pima$test)
levels(pima$test) <- c("negative","positive"); pima[1,]

par(mfrow=c(1,2)); plot(insulin ~ test, pima)

ggplot(pima, aes(x=insulin, color=test)) + geom_histogram(position="dodge", binwidth=30)

library(ggplot2)
ggplot(pima, aes(x = insulin, color = test)) + geom_histogram(position="dodge",
                       binwidth=30, aes(y=..density..))

summary(pima$test[pima$insulin==0])

## negative positive 
##      236      138

High values of insulin seem to correlate with signs of diabetes!

Exercise 2:

Replace the zero values of insulin with the missing value code NA. Recreate the interleaved histogram plot and comment on the distribution.

pima$insulin[pima$insulin == 0] <- NA
ggplot(pima, aes(x = insulin, color = test)) + geom_histogram(position="dodge",
                       binwidth=30, aes(y=..density..))

## Warning: Removed 374 rows containing non-finite values (stat_bin).

After replacing the zero values with NA, the relationship becomes even more clearer!

Exercise 3:

Replace the incredible zeroes in other variables with the missing value code. Fit a model with the result of the diabetes test as the response and all the other variables as predictors. How many observations were used in the model fitting? Why is this less than the number of observations in the data frame?

pima[pima == 0] <- NA

# Fit logistic regression model from binomial family
model1 <- glm(test ~ pregnant + glucose + diastolic + triceps + insulin + bmi + diabetes + age,family =  binomial, pima)
summary(model1)

## 
## Call:
## glm(formula = test ~ pregnant + glucose + diastolic + triceps + 
##     insulin + bmi + diabetes + age, family = binomial, data = pima)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8619  -0.6557  -0.3295   0.6158   2.6339  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.083e+01  1.423e+00  -7.610 2.73e-14 ***
## pregnant     7.364e-02  5.973e-02   1.233   0.2176    
## glucose      3.616e-02  6.249e-03   5.785 7.23e-09 ***
## diastolic    5.993e-03  1.320e-02   0.454   0.6497    
## triceps      1.110e-02  1.869e-02   0.594   0.5527    
## insulin      3.231e-05  1.445e-03   0.022   0.9822    
## bmi          7.615e-02  3.174e-02   2.399   0.0164 *  
## diabetes     1.097e+00  4.777e-01   2.297   0.0216 *  
## age          4.075e-02  1.919e-02   2.123   0.0337 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 426.34  on 335  degrees of freedom
## Residual deviance: 288.92  on 327  degrees of freedom
##   (432 observations deleted due to missingness)
## AIC: 306.92
## 
## Number of Fisher Scoring iterations: 5

In the pima dataframe, there are 768 observations, but only 327+9=336 observations were used to fit the model. 432 observations were deleted due to missingness.

plot(model1)

summary(pima$insulin)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   14.00   76.25  125.00  155.55  190.00  846.00     374

summary(pima$insulin[pima$test=="negative"])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    15.0    66.0   102.5   130.3   161.2   744.0     236

summary(pima$insulin[pima$test=="positive"])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    14.0   127.5   169.5   206.8   239.2   846.0     138

Exercise 4:

Refit the model but now without the insulin and triceps predictors. How many observations were used in fitting this model? Devise a test to compare this model with that in the previous question.

# Fit logistic regression model from binomial family
model2 <- glm(test ~ pregnant + glucose + diastolic  + bmi + diabetes + age,family =  binomial, pima)
summary(model2)

## 
## Call:
## glm(formula = test ~ pregnant + glucose + diastolic + bmi + diabetes + 
##     age, family = binomial, data = pima)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8459  -0.7067  -0.3827   0.7018   2.4302  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.354750   0.915697 -10.216  < 2e-16 ***
## pregnant     0.130695   0.037880   3.450  0.00056 ***
## glucose      0.035337   0.003900   9.061  < 2e-16 ***
## diastolic   -0.008673   0.009422  -0.920  0.35734    
## bmi          0.098547   0.017768   5.546 2.92e-08 ***
## diabetes     1.020669   0.336136   3.036  0.00239 ** 
## age          0.016642   0.010553   1.577  0.11478    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 807.12  on 624  degrees of freedom
## Residual deviance: 577.80  on 618  degrees of freedom
##   (143 observations deleted due to missingness)
## AIC: 591.8
## 
## Number of Fisher Scoring iterations: 5

We can’t use ANOVA to perform model test because amount of data has been used during model fitting.

Not including insulin and triceps into the model, the model is fitted using 625 observations. So it can not be compared with the model in 3. because the number of observations used in 3. is 336. These two models can only be compared of each other based on the same data.

We make this possible by using data pimaN which removes all cases containing NAs. The results can be seen in comparing lmodNA1 with lmodNA2 in R, with p-value of 0.8386. Thus there is no significant difference between the two models in terms of adequacy of fit.

pimaN <- na.omit(pima)
lmodNA1 <- glm(test ~ pregnant+glucose+diastolic+triceps+insulin+bmi+diabetes+age, family = binomial, pimaN) 
lmodNA2 <- glm(test ~ pregnant+glucose+diastolic+bmi+diabetes+age, family = binomial, pimaN)
anova(lmodNA2, lmodNA1, test="Chi")

Exercise 5:

Use AIC to select a model. You will need to take account of the missing values. Which predictors are selected? How many cases are used in your selected model?

lmodNAr <- step(lmodNA1, trace=0)
summary(lmodNAr)

## 
## Call:
## glm(formula = test ~ glucose + bmi + diabetes + age, family = binomial, 
##     data = pimaN)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8112  -0.6673  -0.3433   0.6128   2.6207  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.810466   1.253806  -8.622  < 2e-16 ***
## glucose       0.036394   0.005495   6.624 3.51e-11 ***
## bmi           0.089165   0.024301   3.669 0.000243 ***
## diabetes      1.055880   0.465979   2.266 0.023455 *  
## age           0.059405   0.014515   4.093 4.26e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 426.34  on 335  degrees of freedom
## Residual deviance: 291.12  on 331  degrees of freedom
## AIC: 301.12
## 
## Number of Fisher Scoring iterations: 5

Exercise 6:

*Create a variable that indicates whether the case contains a missing value. Use this variable as a predictor of the test result. Is missingness associated with the test result? Refit the selected model, but now using as much of the data as reasonable. Explain why it is appropriate to do this.**

pima$misIndicator<-apply(pima,1, anyNA); xtabs(~test + misIndicator, pima)

##           misIndicator
## test       FALSE TRUE
##   negative   225  275
##   positive   111  157

summary(glm(test~misIndicator, family=binomial, pima))$coef

##                    Estimate Std. Error    z value     Pr(>|z|)
## (Intercept)      -0.7065702  0.1159890 -6.0917001 1.117178e-09
## misIndicatorTRUE  0.1460449  0.1531641  0.9535193 3.403270e-01

anova(glm(test ~ misIndicator, family=binomial, pima), test="Chi")

chisq.test(pima$test, pima$misIndicator, correct=F)

## 
##  Pearson's Chi-squared test
## 
## data:  pima$test and pima$misIndicator
## X-squared = 0.90974, df = 1, p-value = 0.3402

lmodNArs <- glm(test ~ glucose + bmi + diabetes + age, family = binomial, data = pima)
summary(lmodNArs)

## 
## Call:
## glm(formula = test ~ glucose + bmi + diabetes + age, family = binomial, 
##     data = pima)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7389  -0.7362  -0.4103   0.7239   2.4344  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.302177   0.728380 -12.771  < 2e-16 ***
## glucose      0.035281   0.003517  10.030  < 2e-16 ***
## bmi          0.086372   0.014448   5.978 2.25e-09 ***
## diabetes     0.866221   0.298356   2.903 0.003692 ** 
## age          0.028764   0.007852   3.663 0.000249 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 974.75  on 751  degrees of freedom
## Residual deviance: 716.30  on 747  degrees of freedom
##   (16 observations deleted due to missingness)
## AIC: 726.3
## 
## Number of Fisher Scoring iterations: 5

Exercise 7:

Using the last fitted model of the previous question, what is the difference in the log-odds of testing positive for diabetes for a woman with a BMI at the first quartile compared with a woman at the third quartile, assuming that all other factors are held constant? Then calculate the associated odds ratio value, and give a 95% confidence interval for this odds ratio.

summary(pima$bmi)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   18.20   27.50   32.30   32.46   36.60   67.10      11

# Log-odds difference
diff = 0.086372 * (36.60 - 27.50)

# Estimated log-odds ratio
exp_diff = exp(0.086372 * (36.60 - 27.50))

#95% conf int for odds ratio
conf_int_odds <- cbind(diff - 1.96*0.014448*(36.6 - 27.50), diff + 1.96*0.014448*(36.6 - 27.50))

#95% conf int for estimated odds ratio
(conf_int_exp <- cbind(exp(conf_int_odds[1]), exp(conf_int_odds[2])))

##          [,1]     [,2]
## [1,] 1.696031 2.839647

Exercise 8:

Do women who test positive have higher diastolic blood pressures? Is the diastolic blood pressure significant in the logistic regression model? Explain the distinction between the two questions and discuss why the answers are only apparently contradictory.

Diastolic values tend to be higher for those positives. But the interleaved histograms of the diastolic between those testing positive and negative do not seem to be significantly different. However, both the two-sample t test and the Wilcoxon rank- sum test suggest the positive cases have significantly higher diastolic blood pressures (with p-values of 0.03576 and 3.779 × 10−5 respectively).

On the other hand, diastolicN is not found to be significant to the odds of positive test vs. negative test based on the aforementioned logistic models. The means a given difference between the diastolic pressures of two women does not lead to a significant value of odds ratio of positive test vs. negative test between the two women. Therefore, although the two answers appear to be contradictory, they are actually not.

summary(pima$diastolic[pima$test=="negative"])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   24.00   62.00   70.00   70.88   78.00  122.00      19

summary(pima$diastolic[pima$test=="positive"])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   30.00   68.00   74.50   75.32   84.00  114.00      16

t.test(diastolic~test, alternative="less",data=pima, var.equal=T)

## 
##  Two Sample t-test
## 
## data:  diastolic by test
## t = -4.6808, df = 731, p-value = 1.703e-06
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -2.880447
## sample estimates:
## mean in group negative mean in group positive 
##               70.87734               75.32143

wilcox.test(diastolic~test, alternative="less",data=pima)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  diastolic by test
## W = 47566, p-value = 8.143e-07
## alternative hypothesis: true location shift is less than 0

ggplot(pima, aes(x=diastolic, color=test)) + geom_histogram(position="dodge", binwidth=10)

## Warning: Removed 35 rows containing non-finite values (stat_bin).

ggplot(pima, aes(x=diastolic, color=test)) + geom_histogram(position="dodge", binwidth=10, aes(y=..density..))

## Warning: Removed 35 rows containing non-finite values (stat_bin).

Analysis of Swiss Dataset in R

2021-03-16T00:00:00+01:00

In this blog post I would like to provide an anylsis of the Swiss dataset which can be accessed in R. The exercises below are part of the course MAST90139: Statistical Modelling for Data Science at the University of Melbourne. Analysis of Swiss dataset in R (Swiss Fertility and Socioeconomic Indicators 1888)

First we clean up any variables that may be left in the existing R environment.

rm(list = ls())

Load data from Faraway.

library(faraway); require(graphics);
data(swiss)
?swiss
dim(swiss);

## [1] 47  6

head(swiss)

1. Initial data analysis that explores the numerical and graphical characteristics of the data

Numerical characteristics

Print out numerical summary of variables

summary(swiss)

##    Fertility      Agriculture     Examination      Education    
##  Min.   :35.00   Min.   : 1.20   Min.   : 3.00   Min.   : 1.00  
##  1st Qu.:64.70   1st Qu.:35.90   1st Qu.:12.00   1st Qu.: 6.00  
##  Median :70.40   Median :54.10   Median :16.00   Median : 8.00  
##  Mean   :70.14   Mean   :50.66   Mean   :16.49   Mean   :10.98  
##  3rd Qu.:78.45   3rd Qu.:67.65   3rd Qu.:22.00   3rd Qu.:12.00  
##  Max.   :92.50   Max.   :89.70   Max.   :37.00   Max.   :53.00  
##     Catholic       Infant.Mortality
##  Min.   :  2.150   Min.   :10.80   
##  1st Qu.:  5.195   1st Qu.:18.15   
##  Median : 15.140   Median :20.00   
##  Mean   : 41.144   Mean   :19.94   
##  3rd Qu.: 93.125   3rd Qu.:21.70   
##  Max.   :100.000   Max.   :26.60

cor(swiss)

##                   Fertility Agriculture Examination   Education   Catholic
## Fertility         1.0000000  0.35307918  -0.6458827 -0.66378886  0.4636847
## Agriculture       0.3530792  1.00000000  -0.6865422 -0.63952252  0.4010951
## Examination      -0.6458827 -0.68654221   1.0000000  0.69841530 -0.5727418
## Education        -0.6637889 -0.63952252   0.6984153  1.00000000 -0.1538589
## Catholic          0.4636847  0.40109505  -0.5727418 -0.15385892  1.0000000
## Infant.Mortality  0.4165560 -0.06085861  -0.1140216 -0.09932185  0.1754959
##                  Infant.Mortality
## Fertility              0.41655603
## Agriculture           -0.06085861
## Examination           -0.11402160
## Education             -0.09932185
## Catholic               0.17549591
## Infant.Mortality       1.00000000

The numerical summary of the data shows that all the 6 variables are numerical with weak to moderate linear correlations among them.

Graphical characteristics

pairs(swiss, panel = panel.smooth, main = "swiss data", col = 3 + (swiss$Catholic > 50))

plot(density(swiss$Fertility),main="Fertility",xlab="Fertility")
rug(swiss$Fertility)
hist(swiss$Fertility,freq=F,add=T)

qqnorm(swiss$Fertility, ylab="Fertility")
qqline(swiss$Fertility)

It seems the distribution of Fertility is not too different from the normal except for small values of Fertility.

plot(swiss)

A matrix of scatter-plots for the 6 variables indicates * Fertility has positive correlation with Agriculture and Infant.Mortality; * * Fertility has negative correlation with Examination and Education; * Fertility hasa curvature correlation with Catholic.

plot(Fertility ~ Agriculture, swiss, xlab="", las=3)

# Interesting observation (higher degree of catholic comes with higher fertility )
plot(Fertility ~ Catholic, swiss, xlab="", las=3)

plot(Fertility ~ Education, swiss, xlab="", las=3)

plot(Fertility ~ Infant.Mortality, swiss, xlab="", las=3)

2. Variable selection to choose the best model

We start by fitting a linear regression model.

lmod <- lm(Fertility ~ ., swiss);
summary(lmod)

## 
## Call:
## lm(formula = Fertility ~ ., data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2743  -5.2617   0.5032   4.1198  15.3213 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      66.91518   10.70604   6.250 1.91e-07 ***
## Agriculture      -0.17211    0.07030  -2.448  0.01873 *  
## Examination      -0.25801    0.25388  -1.016  0.31546    
## Education        -0.87094    0.18303  -4.758 2.43e-05 ***
## Catholic          0.10412    0.03526   2.953  0.00519 ** 
## Infant.Mortality  1.07705    0.38172   2.822  0.00734 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared:  0.7067, Adjusted R-squared:  0.671 
## F-statistic: 19.76 on 5 and 41 DF,  p-value: 5.594e-10

# Use drop1(lmod, test="F") alternatively
lmod_reduced = step(lmod)

## Start:  AIC=190.69
## Fertility ~ Agriculture + Examination + Education + Catholic + 
##     Infant.Mortality
## 
##                    Df Sum of Sq    RSS    AIC
## - Examination       1     53.03 2158.1 189.86
##                           2105.0 190.69
## - Agriculture       1    307.72 2412.8 195.10
## - Infant.Mortality  1    408.75 2513.8 197.03
## - Catholic          1    447.71 2552.8 197.75
## - Education         1   1162.56 3267.6 209.36
## 
## Step:  AIC=189.86
## Fertility ~ Agriculture + Education + Catholic + Infant.Mortality
## 
##                    Df Sum of Sq    RSS    AIC
##                           2158.1 189.86
## - Agriculture       1    264.18 2422.2 193.29
## - Infant.Mortality  1    409.81 2567.9 196.03
## - Catholic          1    956.57 3114.6 205.10
## - Education         1   2249.97 4408.0 221.43

summary(lmod_reduced)

## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic + 
##     Infant.Mortality, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6765  -6.0522   0.7514   3.1664  16.1422 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      62.10131    9.60489   6.466 8.49e-08 ***
## Agriculture      -0.15462    0.06819  -2.267  0.02857 *  
## Education        -0.98026    0.14814  -6.617 5.14e-08 ***
## Catholic          0.12467    0.02889   4.315 9.50e-05 ***
## Infant.Mortality  1.07844    0.38187   2.824  0.00722 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared:  0.6993, Adjusted R-squared:  0.6707 
## F-statistic: 24.42 on 4 and 42 DF,  p-value: 1.717e-10

anova(lmod, lmod_reduced)

By both a t-test and an ANOVA F test we find Examination does not have significant effect on Fertility.

We then treat Fertility ∼ (Agriculture + Education + Catholic + Infant.Mortality)^2 as the full model, and use step() with BIC for selecting the best model.

# Interaction term doesn't seem to bring major improvements
lmodi = lm(Fertility ~ (Agriculture + Education + Catholic +  Infant.Mortality)^2, data = swiss)
lmodi_reduced = step(lmodi, trace = FALSE, k = log(47))
summary(lmodi_reduced)

## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic + 
##     Infant.Mortality + Education:Catholic, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.9060  -5.4997   0.9556   3.6698  13.8934 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        53.752308   9.919330   5.419 2.89e-06 ***
## Agriculture        -0.134055   0.065843  -2.036  0.04825 *  
## Education          -0.515105   0.252478  -2.040  0.04781 *  
## Catholic            0.207038   0.046184   4.483 5.81e-05 ***
## Infant.Mortality    1.239697   0.372195   3.331  0.00184 ** 
## Education:Catholic -0.011255   0.005058  -2.225  0.03161 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.853 on 41 degrees of freedom
## Multiple R-squared:  0.7318, Adjusted R-squared:  0.699 
## F-statistic: 22.37 on 5 and 41 DF,  p-value: 9.443e-11

The fitted best model is Fertility = 53.75 − 0.134Agriculture − 0.515Education + 0.207Catholic + 1.24Infant.Mortality − 0.011Education:Catholic

with R2 = 0.7318 and Ra2 = 0.699.

drop1(lmodi_reduced)

summary(lmodi_reduced)

## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic + 
##     Infant.Mortality + Education:Catholic, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.9060  -5.4997   0.9556   3.6698  13.8934 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        53.752308   9.919330   5.419 2.89e-06 ***
## Agriculture        -0.134055   0.065843  -2.036  0.04825 *  
## Education          -0.515105   0.252478  -2.040  0.04781 *  
## Catholic            0.207038   0.046184   4.483 5.81e-05 ***
## Infant.Mortality    1.239697   0.372195   3.331  0.00184 ** 
## Education:Catholic -0.011255   0.005058  -2.225  0.03161 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.853 on 41 degrees of freedom
## Multiple R-squared:  0.7318, Adjusted R-squared:  0.699 
## F-statistic: 22.37 on 5 and 41 DF,  p-value: 9.443e-11

The terms in this model cannot be further reduced by the drop1() command.

par(mfrow=c(2,2)); termplot(lmodi_reduced,partial=T,terms=NULL); plot(lmodi_reduced)

## Warning in termplot(lmodi_reduced, partial = T, terms = NULL): 'model' appears
## to involve interactions: see the help page

3. Exploration of transformations to improve the fit of the model

The model does not seem to need a transformation on the response variable because the empirical distribution of Fertility is not far from the normal.

On the other hand, the relationship between Fertility and Catholoc seems to be curvature.

_Hence we investigate the transformation of the “Catholic” variable because it has a curvature effect on Fertility. We replace “Catholic” with a quadratic term

library(MASS)

# poly(3) constructs a transformation of poly w/ vec, vec^1, vec^2 such that correlation is minimized
# vec = c(1,2,3,4)
# poly(vec, 3, raw=TRUE)
Wlmodp<-lm(Fertility~Agriculture+Education+poly(Catholic,2)+Infant.Mortality + Education:poly(Catholic,2), swiss) 
summary(Wlmodp)

## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + poly(Catholic, 
##     2) + Infant.Mortality + Education:poly(Catholic, 2), data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5584  -5.0451   0.0393   3.5404  15.3300 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  60.37216    9.90815   6.093 3.84e-07 ***
## Agriculture                  -0.13234    0.07102  -1.863 0.069955 .  
## Education                    -0.68355    0.26160  -2.613 0.012684 *  
## poly(Catholic, 2)1           51.29861   14.04601   3.652 0.000763 ***
## poly(Catholic, 2)2            1.52390   12.92429   0.118 0.906744    
## Infant.Mortality              1.21767    0.37666   3.233 0.002496 ** 
## Education:poly(Catholic, 2)1 -1.79870    1.73121  -1.039 0.305211    
## Education:poly(Catholic, 2)2  0.76806    0.72902   1.054 0.298573    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.851 on 39 degrees of freedom
## Multiple R-squared:  0.745,  Adjusted R-squared:  0.6992 
## F-statistic: 16.28 on 7 and 39 DF,  p-value: 8.356e-10

Wlmodp1<-lm(Fertility~Agriculture + Education + poly(Catholic,2) + Infant.Mortality + Education:Catholic, swiss)
summary(Wlmodp1)

## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + poly(Catholic, 
##     2) + Infant.Mortality + Education:Catholic, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.8316  -5.2273   0.2632   4.0651  14.2838 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        61.826469   9.825002   6.293 1.83e-07 ***
## Agriculture        -0.152171   0.068575  -2.219 0.032221 *  
## Education          -0.517682   0.252752  -2.048 0.047145 *  
## poly(Catholic, 2)1 55.884416  13.372902   4.179 0.000155 ***
## poly(Catholic, 2)2  9.820777  10.261947   0.957 0.344311    
## Infant.Mortality    1.269834   0.373906   3.396 0.001556 ** 
## Education:Catholic -0.009239   0.005484  -1.685 0.099837 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.86 on 40 degrees of freedom
## Multiple R-squared:  0.7378, Adjusted R-squared:  0.6984 
## F-statistic: 18.75 on 6 and 40 DF,  p-value: 3.078e-10

plot(Wlmodp1)

termplot(Wlmodp,partial=T,terms=3)

library(splines)
Wlmods<-lm(Fertility~Agriculture+Education+bs(Catholic,3)+Infant.Mortality + Education:bs(Catholic,3), swiss) 
summary(Wlmods)

## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + bs(Catholic, 
##     3) + Infant.Mortality + Education:bs(Catholic, 3), data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5473  -5.0681   0.5734   3.2353  15.5592 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 59.15343   11.66882   5.069 1.14e-05 ***
## Agriculture                 -0.12260    0.07291  -1.681   0.1011    
## Education                   -0.55823    0.41581  -1.343   0.1876    
## bs(Catholic, 3)1           -11.07029   18.70215  -0.592   0.5575    
## bs(Catholic, 3)2            35.45274   28.03511   1.265   0.2139    
## bs(Catholic, 3)3            13.63965    6.35112   2.148   0.0384 *  
## Infant.Mortality             1.00679    0.43136   2.334   0.0251 *  
## Education:bs(Catholic, 3)1   0.66876    1.50115   0.445   0.6586    
## Education:bs(Catholic, 3)2  -2.42495    1.48402  -1.634   0.1107    
## Education:bs(Catholic, 3)3  -0.25491    0.84904  -0.300   0.7657    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.931 on 37 degrees of freedom
## Multiple R-squared:  0.7524, Adjusted R-squared:  0.6921 
## F-statistic: 12.49 on 9 and 37 DF,  p-value: 8.342e-09

Wlmods1<-lm(Fertility~Agriculture+Education+bs(Catholic,3)+Infant.Mortality + Education:Catholic, swiss)
summary(Wlmods1)

## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + bs(Catholic, 
##     3) + Infant.Mortality + Education:Catholic, data = swiss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.377  -5.072   0.321   4.014  14.446 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        55.455117  10.117571   5.481 2.72e-06 ***
## Agriculture        -0.148741   0.070264  -2.117 0.040702 *  
## Education          -0.479725   0.284130  -1.688 0.099316 .  
## bs(Catholic, 3)1   -3.217588  11.717565  -0.275 0.785077    
## bs(Catholic, 3)2   11.056766  19.100029   0.579 0.565994    
## bs(Catholic, 3)3   19.202744   4.706975   4.080 0.000216 ***
## Infant.Mortality    1.259959   0.379587   3.319 0.001964 ** 
## Education:Catholic -0.010382   0.006687  -1.553 0.128614    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.939 on 39 degrees of freedom
## Multiple R-squared:  0.7384, Adjusted R-squared:  0.6914 
## F-statistic: 15.72 on 7 and 39 DF,  p-value: 1.35e-09

plot(Wlmods1)

termplot(Wlmods,partial=T,terms=NULL)

## Warning in termplot(Wlmods, partial = T, terms = NULL): 'model' appears to
## involve interactions: see the help page

Thus, we replace Catholic by poly(Catholic, 2) or bs(Catholic, 3) in the model lmodi_reduced to see whether any improvement of fit can be made.

It does not seem to achieve any significant improvement by doing this. Here the order 2 in poly() and df 3 in bs() are selected using a try-and-error approach

4. Diagnostics to check the assumptions of the model.

plot(lmodi_reduced)

The diagnostics plots given by plot(smallm) show that the model provides a good fit to the data in general.

The residuals vs. leverage plot identifies 4 provinces that have the largest Cook distance values and are influential to model fitting. These 4 provinces are Porrentruy, Sierre, Sion, and Rive Gauche.

We try to filter out the outliers by getting the indizes first:

(1:47)[rownames(swiss)=="Sion"]  #38

## [1] 38

(1:47)[rownames(swiss)=="Sierre"] #37

## [1] 37

(1:47)[rownames(swiss)=="Porrentruy"] #6

## [1] 6

(1:47)[rownames(swiss)=="Rive Gauche"] #47

## [1] 47

swiss[c(6,37,38,47),]

rownames(swiss)[c(6,37,38,47)]

## [1] "Porrentruy"  "Sierre"      "Sion"        "Rive Gauche"

hatvalues(lmodi_reduced)

##   Courtelary     Delemont Franches-Mnt      Moutier   Neuveville   Porrentruy 
##   0.10260793   0.07580639   0.13623845   0.05973401   0.07103360   0.18823843 
##        Broye        Glane      Gruyere       Sarine      Veveyse        Aigle 
##   0.08842886   0.11689602   0.07577713   0.14642334   0.10934403   0.07917772 
##      Aubonne     Avenches     Cossonay    Echallens     Grandson     Lausanne 
##   0.07598143   0.12553452   0.08628630   0.07370320   0.06376129   0.22670153 
##    La Vallee       Lavaux       Morges       Moudon        Nyone         Orbe 
##   0.34956953   0.10836510   0.05764165   0.09609790   0.05289707   0.11454455 
##         Oron      Payerne Paysd'enhaut        Rolle        Vevey      Yverdon 
##   0.13083077   0.11033637   0.10103769   0.07368892   0.06659714   0.06513090 
##      Conthey    Entremont       Herens     Martigwy      Monthey   St Maurice 
##   0.19704637   0.09502856   0.15680036   0.07880332   0.10133459   0.10272389 
##       Sierre         Sion       Boudry La Chauxdfnd     Le Locle    Neuchatel 
##   0.15149713   0.15886464   0.04426588   0.15781432   0.10040794   0.31804461 
##   Val de Ruz ValdeTravers V. De Geneve  Rive Droite  Rive Gauche 
##   0.06265152   0.14516128   0.48313011   0.18286595   0.23514778

influence.measures(lmodi_reduced)

## Influence measures of
##   lm(formula = Fertility ~ Agriculture + Education + Catholic +      Infant.Mortality + Education:Catholic, data = swiss) :
## 
##                dfb.1_ dfb.Agrc dfb.Edct dfb.Cthl dfb.In.M  dfb.Ed.C   dffit
## Courtelary    0.04432 -0.23133 -0.04702  0.02321  0.08976 -0.045367  0.3477
## Delemont      0.02296 -0.07573 -0.06375  0.02187  0.02022  0.045778  0.1524
## Franches-Mnt  0.21223 -0.34081 -0.13079  0.24396 -0.10810 -0.040840  0.4714
## Moutier       0.23577 -0.31232 -0.24858 -0.04524 -0.06493  0.109009  0.4287
## Neuveville   -0.19949  0.14133  0.30260  0.07440  0.17956 -0.277192  0.4697
## Porrentruy    0.28682  0.51113  0.17016 -0.33337 -0.65439  0.042706 -1.1444
## Broye        -0.05121  0.02808  0.01518  0.02464  0.05232 -0.009638  0.0894
## Glane        -0.33831  0.14506  0.05770  0.09969  0.37261 -0.003844  0.5522
## Gruyere       0.00769 -0.01512 -0.00899  0.01531 -0.00207  0.002235  0.0360
## Sarine       -0.04332 -0.08637 -0.17368 -0.08170  0.14313  0.224415  0.3991
## Veveyse      -0.02780  0.00534  0.00927  0.02191  0.03192 -0.011992  0.0512
## Aigle         0.02555  0.08203  0.04882 -0.02599 -0.06348 -0.027970  0.1668
## Aubonne      -0.00590  0.04462 -0.00259 -0.04341  0.00245  0.011573  0.0789
## Avenches     -0.04235  0.03756  0.03065 -0.00626  0.03866 -0.020754  0.0588
## Cossonay     -0.01869 -0.07902  0.04641  0.11646  0.01740 -0.059470 -0.1765
## Echallens     0.04652 -0.10426  0.05875  0.11913 -0.06872 -0.059214 -0.2316
## Grandson      0.02560 -0.03053 -0.02578 -0.01816 -0.00468  0.010647  0.0563
## Lausanne      0.15517 -0.03529 -0.35875 -0.22776 -0.10298  0.323257 -0.4258
## La Vallee    -0.04166  0.02470 -0.00858 -0.01925  0.04586  0.018153 -0.0606
## Lavaux       -0.01514  0.02795  0.01095 -0.01275  0.00996 -0.003650  0.0359
## Morges        0.00589  0.03622  0.01248 -0.02478 -0.01456 -0.005858  0.0774
## Moudon        0.02621 -0.01703  0.20818  0.27871 -0.15585 -0.177715 -0.4225
## Nyone        -0.10466 -0.01819 -0.01116  0.02788  0.12435  0.010459 -0.2100
## Orbe         -0.24083  0.04520  0.14986  0.13634  0.22332 -0.114807 -0.3093
## Oron         -0.00167  0.04374 -0.06701 -0.10697  0.02335  0.071831  0.1384
## Payerne      -0.07553  0.05346  0.01904 -0.03989  0.08947 -0.008693  0.1207
## Paysd'enhaut  0.12733  0.02770 -0.16655 -0.20464 -0.09151  0.154218  0.2853
## Rolle        -0.01353 -0.01267 -0.00167  0.01062  0.01910 -0.000469 -0.0375
## Vevey         0.06764  0.05712 -0.16425 -0.10122 -0.09586  0.160539 -0.3186
## Yverdon       0.09036 -0.04223  0.00781  0.09329 -0.14709 -0.001803 -0.2467
## Conthey      -0.06158 -0.03564 -0.05678 -0.13143  0.11872  0.091731 -0.2257
## Entremont     0.10322 -0.21873 -0.11342 -0.15669 -0.00473  0.077303 -0.4227
## Herens        0.04272 -0.11975 -0.13976 -0.19995  0.03890  0.172081 -0.3090
## Martigwy      0.02012 -0.10830 -0.05430 -0.14520  0.04214  0.050813 -0.3363
## Monthey      -0.00772  0.04713 -0.06527 -0.24601  0.01466  0.158212 -0.3184
## St Maurice   -0.09479 -0.07063  0.06414 -0.01520  0.14726 -0.111047 -0.3379
## Sierre        0.19380  0.20869  0.24983  0.55966 -0.44117 -0.370574  0.9782
## Sion          0.37408 -0.03665 -0.44940 -0.24868 -0.37496  0.625641  0.8890
## Boudry       -0.00192 -0.00941  0.01487 -0.00289  0.01547 -0.023829  0.0765
## La Chauxdfnd -0.28260  0.48690  0.21199 -0.02541  0.06726 -0.030768 -0.5489
## Le Locle      0.11188 -0.15224 -0.04212  0.02140 -0.05499 -0.014786  0.1923
## Neuchatel    -0.25321  0.08241  0.36291  0.19664  0.20863 -0.293513  0.4274
## Val de Ruz    0.12806 -0.13513 -0.14788 -0.11270 -0.02544  0.077760  0.2757
## ValdeTravers -0.23155  0.28786  0.20848  0.06018  0.09330 -0.091000 -0.3449
## V. De Geneve -0.01186  0.03937  0.13125 -0.07455 -0.05176  0.134480  0.5617
## Rive Droite  -0.00920 -0.10566 -0.00117  0.13190  0.05586 -0.167981 -0.3347
## Rive Gauche  -0.22843  0.11133  0.31880  0.40254  0.16802 -0.638422 -0.8724
##              cov.r   cook.d    hat inf
## Courtelary   1.105 0.020115 0.1026    
## Delemont     1.203 0.003941 0.0758    
## Franches-Mnt 1.091 0.036669 0.1362    
## Moutier      0.811 0.029276 0.0597    
## Neuveville   0.822 0.035158 0.0710    
## Porrentruy   0.647 0.196042 0.1882    
## Broye        1.257 0.001362 0.0884    
## Glane        0.939 0.049247 0.1169    
## Gruyere      1.252 0.000221 0.0758    
## Sarine       1.184 0.026593 0.1464    
## Veveyse      1.298 0.000447 0.1093    
## Aigle        1.200 0.004715 0.0792    
## Aubonne      1.241 0.001061 0.0760    
## Avenches     1.321 0.000590 0.1255    
## Cossonay     1.208 0.005280 0.0863    
## Echallens    1.133 0.009011 0.0737    
## Grandson     1.230 0.000540 0.0638    
## Lausanne     1.368 0.030501 0.2267    
## La Vallee    1.781 0.000627 0.3496   *
## Lavaux       1.299 0.000221 0.1084    
## Morges       1.213 0.001022 0.0576    
## Moudon       1.003 0.029263 0.0961    
## Nyone        1.089 0.007385 0.0529    
## Orbe         1.173 0.016043 0.1145    
## Oron         1.309 0.003262 0.1308    
## Payerne      1.281 0.002483 0.1103    
## Paysd'enhaut 1.158 0.013654 0.1010    
## Rolle        1.249 0.000240 0.0737    
## Vevey        1.007 0.016744 0.0666    
## Yverdon      1.090 0.010175 0.0651    
## Conthey      1.400 0.008659 0.1970    
## Entremont    0.998 0.029280 0.0950    
## Herens       1.274 0.016108 0.1568    
## Martigwy     1.036 0.018707 0.0788    
## Monthey      1.129 0.016938 0.1013    
## St Maurice   1.115 0.019030 0.1027    
## Sierre       0.643 0.144148 0.1515    
## Sion         0.759 0.122232 0.1589    
## Boudry       1.191 0.000996 0.0443    
## La Chauxdfnd 1.087 0.049474 0.1578    
## Le Locle     1.227 0.006262 0.1004    
## Neuchatel    1.604 0.030904 0.3180   *
## Val de Ruz   1.046 0.012622 0.0627    
## ValdeTravers 1.222 0.019977 0.1452    
## V. De Geneve 2.133 0.053448 0.4831   *
## Rive Droite  1.317 0.018903 0.1829    
## Rive Gauche  1.058 0.122438 0.2351

cooks.distance(lmodi_reduced)

##   Courtelary     Delemont Franches-Mnt      Moutier   Neuveville   Porrentruy 
## 0.0201154852 0.0039410826 0.0366691513 0.0292755138 0.0351584235 0.1960420472 
##        Broye        Glane      Gruyere       Sarine      Veveyse        Aigle 
## 0.0013622033 0.0492469430 0.0002214636 0.0265934794 0.0004467372 0.0047153887 
##      Aubonne     Avenches     Cossonay    Echallens     Grandson     Lausanne 
## 0.0010608615 0.0005897404 0.0052804731 0.0090110500 0.0005400547 0.0305009295 
##    La Vallee       Lavaux       Morges       Moudon        Nyone         Orbe 
## 0.0006271319 0.0002206554 0.0010221776 0.0292633997 0.0073849102 0.0160432221 
##         Oron      Payerne Paysd'enhaut        Rolle        Vevey      Yverdon 
## 0.0032624985 0.0024832884 0.0136542028 0.0002399045 0.0167444949 0.0101748505 
##      Conthey    Entremont       Herens     Martigwy      Monthey   St Maurice 
## 0.0086585000 0.0292802881 0.0161075451 0.0187071209 0.0169381979 0.0190298484 
##       Sierre         Sion       Boudry La Chauxdfnd     Le Locle    Neuchatel 
## 0.1441475824 0.1222319428 0.0009958496 0.0494736931 0.0062623898 0.0309042646 
##   Val de Ruz ValdeTravers V. De Geneve  Rive Droite  Rive Gauche 
## 0.0126224991 0.0199766071 0.0534481565 0.0189033468 0.1224375602

sort(cooks.distance(lmodi_reduced))

##       Lavaux      Gruyere        Rolle      Veveyse     Grandson     Avenches 
## 0.0002206554 0.0002214636 0.0002399045 0.0004467372 0.0005400547 0.0005897404 
##    La Vallee       Boudry       Morges      Aubonne        Broye      Payerne 
## 0.0006271319 0.0009958496 0.0010221776 0.0010608615 0.0013622033 0.0024832884 
##         Oron     Delemont        Aigle     Cossonay     Le Locle        Nyone 
## 0.0032624985 0.0039410826 0.0047153887 0.0052804731 0.0062623898 0.0073849102 
##      Conthey    Echallens      Yverdon   Val de Ruz Paysd'enhaut         Orbe 
## 0.0086585000 0.0090110500 0.0101748505 0.0126224991 0.0136542028 0.0160432221 
##       Herens        Vevey      Monthey     Martigwy  Rive Droite   St Maurice 
## 0.0161075451 0.0167444949 0.0169381979 0.0187071209 0.0189033468 0.0190298484 
## ValdeTravers   Courtelary       Sarine       Moudon      Moutier    Entremont 
## 0.0199766071 0.0201154852 0.0265934794 0.0292633997 0.0292755138 0.0292802881 
##     Lausanne    Neuchatel   Neuveville Franches-Mnt        Glane La Chauxdfnd 
## 0.0305009295 0.0309042646 0.0351584235 0.0366691513 0.0492469430 0.0494736931 
## V. De Geneve         Sion  Rive Gauche       Sierre   Porrentruy 
## 0.0534481565 0.1222319428 0.1224375602 0.1441475824 0.1960420472

We observe that these four provinces have the most extreme residuals in regard to model lmodi_reduced, but do not have large leverage values. The predictors values of these 4 provinces are mostly unusual in comparison with those of other provinces.

5. Some predictions of future observations for interesting values of the predictors

While there should be many interesting values of the predictors, we chose to predict the Fertility value at the mean values of the predictors.

First we costruct a dataframe to perform prediction on.

predictor_df <- data.frame(Agriculture=mean(swiss$Agriculture), Examination=mean(swiss$Examination), Education=mean(swiss$Education), Catholic=mean(swiss$Catholic), Infant.Mortality=mean(swiss$Infant.Mortality))

pp <- predict(lmodi_reduced,new=predictor_df, se.fit=T); 
pp

## $fit
##        1 
## 69.46289 
## 
## $se.fit
## [1] 1.045219
## 
## $df
## [1] 41
## 
## $residual.scale
## [1] 6.852949

We can also construct a data frame with purely random data.

test = data.frame(Agriculture=80, Education=6, Catholic=12, Infant.Mortality=20)
predict(lmodi_reduced, test)

##        1 
## 66.40531

The predicted value of Fertility equals 69.46289 with standard error 1.045219.

predict(lmodi_reduced)

##   Courtelary     Delemont Franches-Mnt      Moutier   Neuveville   Porrentruy 
##     73.53024     79.56271     84.97775     74.75043     65.92926     90.00600 
##        Broye        Glane      Gruyere       Sarine      Veveyse        Aigle 
##     81.90081     82.77823     81.56129     76.79370     86.14437     60.32787 
##      Aubonne     Avenches     Cossonay    Echallens     Grandson     Lausanne 
##     65.06722     67.89389     65.49427     73.73686     70.25363     60.46134 
##    La Vallee       Lavaux       Morges       Moudon        Nyone         Orbe 
##     54.76245     64.42461     63.39349     73.37298     62.54028     62.96261 
##         Oron      Payerne Paysd'enhaut        Rolle        Vevey      Yverdon 
##     70.19603     71.95956     66.45264     61.38721     66.15653     71.60261 
##      Conthey    Entremont       Herens     Martigwy      Monthey   St Maurice 
##     78.32541     77.73235     81.83654     78.03419     85.56745     71.48278 
##       Sierre         Sion       Boudry La Chauxdfnd     Le Locle    Neuchatel 
##     78.30662     66.91495     67.99368     73.61544     68.92875     60.83159 
##   Val de Ruz ValdeTravers V. De Geneve  Rive Droite  Rive Gauche 
##     70.53746     72.92320     32.11418     49.11011     52.06441

We get predictions where the difference of true vs. predicted values is greater then 5%:

# Get data where predictions was wrong by > 5%
wrong_preds = swiss[c((1:47)[abs(swiss$Fertility - predict(lmodi_reduced)) > 5]),]
# Show wrong predictions by order of magnitude
sort(abs(wrong_preds$Fertility - predict(lmodi_reduced, wrong_preds)), decreasing = TRUE)

##   Porrentruy       Sierre         Sion      Moutier   Neuveville        Glane 
##    13.905996    13.893381    12.385050    11.049569    10.970740     9.621765 
##  Rive Gauche    Entremont       Moudon La Chauxdfnd        Vevey     Martigwy 
##     9.264406     8.432350     8.372979     7.915438     7.856525     7.534193 
## Franches-Mnt   Val de Ruz   Courtelary   St Maurice      Yverdon      Monthey 
##     7.522247     7.062543     6.669760     6.482779     6.202610     6.167454 
##       Sarine        Nyone         Orbe Paysd'enhaut    Echallens ValdeTravers 
##     6.106300     5.940283     5.562608     5.547357     5.436856     5.323200

6. Interpretation of the meaning of the model with respect to the particular area of application

summary(lmodi_reduced)

## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic + 
##     Infant.Mortality + Education:Catholic, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.9060  -5.4997   0.9556   3.6698  13.8934 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        53.752308   9.919330   5.419 2.89e-06 ***
## Agriculture        -0.134055   0.065843  -2.036  0.04825 *  
## Education          -0.515105   0.252478  -2.040  0.04781 *  
## Catholic            0.207038   0.046184   4.483 5.81e-05 ***
## Infant.Mortality    1.239697   0.372195   3.331  0.00184 ** 
## Education:Catholic -0.011255   0.005058  -2.225  0.03161 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.853 on 41 degrees of freedom
## Multiple R-squared:  0.7318, Adjusted R-squared:  0.699 
## F-statistic: 22.37 on 5 and 41 DF,  p-value: 9.443e-11

The selected model lmodi_reduced suggests that all predictors except “Examination” are significantly related to “Fertility” with the directions of the relations been given in the summary(lmodi_reduced) output. In addition, “Education” and “Catholic” have significant interaction effect on Fertility.

AI for Social Impact - A great leap forward

2021-03-08T00:00:00+01:00

Over the past years, there have been major advancements in Artificial Intelligence and given the intense interest and investment in AI by industry and Academia, we believe that now is the time to focus our energies in applying AI to solve complex social problems in health, sustainability, community violence, and in assisting low resource communities.

This area of “AI for Social Impact” is a fast emerging field of scientific research and refers to the use of Artificial Intelligence to solve challenging problems in our society.

This blog post will show that in the area of AI for Social Impact, we are not only interested in algorithmic advancements, but also aim to deliver real-world social impact. This post is intended to provide interested researchers and practitioners with an understanding of this growing area of research and give an overview of some of the problems that can be solved by applying AI for Social Impact.

Even though “AI for Social Impact” is a subdiscipline within AI, there are three key aspects in which “AI for Social Impact” differs from traditional AI research:

Data collection may be costly and time-consuming.
Problem modeling may require significant collaborations with domain experts.
Assessing the social impact may require time-consuming and complex field studies.

This means that “AI for Social Impact” researchers have to invest their resources differently to make contributions to problems of great social importance.

A high-level overview of a step-wise approach towards deploying an “AI for Social Impact” model has been laid out by Andrew Perrault et al. in their paper on “learning and planning in the data-to-deployment pipeline”. It consists of four steps:

1. Immersion

As a first step, immersion in the domain is crucial to get an understanding of the problems, constraints, and datasets. This may be a step that involves discussions with various stakeholders (including the impacted community). In this step, it is also important to build interdisciplinary partnerships and understand the challenges from the perspective of domain experts.

2. Predictive Model

Following an in-depth understanding of the problem situation, the next step is building a predictive model using machine learning or domain expert input. Such a predictive model may, for example, predict high-risk vs low-risk cases in a population.

3. Prescriptive algorithm

The next step is the prescriptive algorithm phase that plans interventions. Work on AI for social impact is often focused on domains where access to data is difficult (e.g., low resource communities or emerging market countries). Hence, the challenge is often to plan interventions despite that data is uncertain and sparse.

4. Field tests & deployment

The final step is field testing and deployment. This phase helps researchers to learn about the social impact as well as key limitations of the models and algorithms (and might even lead to fundamentally new research questions). Crucial for this phase is the interdisciplinary partnerships with the respective communities for immersion and field testing.

In this section, we want to give an overview of some exemplary problems that have already been tackled by “AI for Social Impact” research. This list is by no means exhaustive - we merely decided to focus on areas of research that we find interesting from both a societal as well as technical perspective.

Raising awareness about HIV among homeless youth using POMDPs

Target problem:

HIV is a serious threat to public health. Homeless youth is particularly susceptible to HIV spread because of injection drugs and unsafe sexual practices. It is therefore of major importance to raise awareness about HIV among homeless youth. To address this, various researchers have used techniques from sequential decision making and influence maximization.

Why AI is needed:

Youth social workers routinely launch “peer leader” programs to inform selected participants about HIV prevention, hoping that these peer leaders will spread the information to other homeless youth. These programs are often constrained by their available resources (financial and HR) and cannot reach out to every homeless youth. It is therefore important to select the “right” set of peer leaders.

Previously, service providers have used the degree centrality measure in a social graph to determine which person is the most popular youth. This is not necessarily the best selection method (e.g. in cases where the peer leaders might be unwilling to spread information).

Intervention overview:

The authors formulate a planning problem in terms of a partially observable MDP (POMDP) in which they operate with the intent to maximize the influence on a social network. Since the existing POMDP solvers do not scale to the size of the problem at hand (a large “unobservable” social graph), Yadav et al. have proposed a hierarchical approach called HEALER which decomposes the POMDP into smaller ones.

HEALER then solves these smaller POMDPs using an approach called Tree Aggregation for Sequential Planning (based on a variant of the Upper Confidence Bounds algorithm applied to Trees, i.e. UCT) and subsequently aggregates the results. At each time step, the planning algorithm selects a small set of homeless youth as the “peer leader” who will participate in the program at the service provider.

Data used:

The dataset in this study is the social network connectivity, i.e. which homeless youths are friends with each other. This information was gathered by parsing the youth’s Facebook contact list to determine the friendship status. The data was further augmented based on reports gathered from the service providers who perform interviews with the homeless youths.

Resources needed:

Central to the success of the pilot study was the collaboration between the authors and homeless youth service providers. This helped to facilitate the recruitment of the youth and the implementation of the program. In addition to that, the engagement of social work researchers has also provided the necessary context and skills to communicate with the youth.

Deployment status:

In spring 2016, the authors performed a pilot field test of HEALER, comparing it to the baseline of degree centrality. The experiment showed that HEALER is significantly more effective at spreading information - it reaches around 75% of non-peer leaders, compared to only 25% for the degree centrality approach. As a result, HEALER is more effective at causing youth to start testing for HIV: around 30-40% of the community began testing (compared to 0% for degree centrality).

References:

Perrault, A., et al. (2019). AI for social impact: Learning and planning in the data-to-deployment pipeline.
Yadav, A., et al. (2018). Bridging the Gap Between Theory and Practice in Influence Maximization: Raising Awareness about HIV among Homeless Youth.
Yadav, A., et al. (2017). Influence maximization in the field: The arduous journey from emerging to deployed application.
Yadav, A., et al. (2016). Using Social Networks to Aid Homeless Shelters: Dynamic Influence Maximization under Uncertainty.

Applying game theory to prevent poaching

Problem to be solved:

Wildlife poaching is a great threat to the ecological diversity of our planet. Because of the high profits to be made from poaching, poachers have become increasingly sophisticated. Rangers protect wildlife from poachers, but to be effective, it is important to design good patrol routes for the rangers. In a series of papers, researchers have developed and deployed a suite of game-theoretic tools called “PAWS” that simulate patrol routes to combat poaching.

Why AI is needed:

Patrollers in wildlife conservation areas have lots of experience conducting patrols. They design the patrol routes based on their knowledge and experience in the area. However, since the poachers are highly strategic in evading the patrollers, these patrol routes are very susceptible to gaming. While experience can help the patrollers, it may also keep the patrollers from going to underexplored areas where the poachings might be frequent. A game-theoretic planner for patrol routes can be useful in addressing these issues.

Intervention overview:

The authors propose PAWS (Protection Assistant for Wildlife Security), which is based on game-theoretic techniques usually applied in Stackelberg security games. PAWS models a two-player zero-sum game between an attacker (the poacher) and a defender (the patroller).

The authors use ML to learn poacher’s behavior patterns from historical data. The ML methods used have gone through several stages of development and involve (among others) a variant of decision tree ensembles, a hybrid model of decision trees and Markov random fields, as well as Gaussian Processes. Solving for the equilibrium of this game through mathematical programming gives an optimal patrol strategy for the patroller. Further enhancements to the system provide coordinated patrol plans for patrollers and use online learning to design patrols that trade-off exploitation and exploration.

Data used:

PAWS uses the animal activity data to estimate the animal density which plays a role in determining the payoff of each patrol route. PAWS also uses the poacher activity data to aid the poacher modeling. All these data and previous patrol tracks are obtained from the collaborating conservation agency. To consider the terrain and elevation information, PAWS also uses topographical data.

Resources needed:

The development of the PAWS system has been going on for several years and has involved multiple AI researchers. One of the most important factors for the successful deployment of PAWS in the wild has been the collaboration with conservation agencies. This has helped to identify suitable research problems and put the results in the field.

Deployment status:

PAWS has been field-tested in multiple conservation sites in Uganda, Cambodia, Malaysia, and China. The authors claim that PAWS proved to be effective in all these deployments, often leading the patrollers to patrol routes never used before but discovering poacher activities on those routes. In 2019, PAWS started a partnership with SMART (a popular wildlife conservation software). This partnership will hopefully allow PAWS to be scaled to over 800 wildlife conservation sites worldwide in the near-term future.

References:

Fang, F., et al. (2017). PAWS—A deployed game-theoretic application to combat poaching.
Fang, F., et al. (2016). Deploying PAWS: Field Optimization of the Protection Assistant for Wildlife Security.
Fang, F., et al. (2016). Save the Wildlife, Save the Planet: Protection Assistant for Wildlife Security (PAWS) (Video on Youtube)
Gholami, S., et al. (2017). Taking it for a test drive: a hybrid spatio-temporal model for wildlife poaching prediction evaluated through a controlled field test.
Xu, L., et al. (2020). Dual-Mandate Patrols: Multi-Armed Bandits for Green Security.
Xu, L., et al. (2020). Stay Ahead of Poachers: Illegal Wildlife Poaching Prediction and Patrol Planning Under Uncertainty with Field Test Evaluations (Short Version).
Yang, R., et al. (2014). Adaptive resource allocation for wildlife protection against illegal poachers.

Public health interventions as a Multi-Armed Bandit (MAB) setting

Target problem:

In the global south, Community Health Workers (CHWs) play an important role in public health-care systems. CHWs complement primary health facilities by providing health education, screening, and basic emergency care in local communities. CHWs are often responsible for hundreds of patients, but only have access to limited resources (which restricts the number of patients that can be monitored and intervened each day). To maximize welfare, the CHWs’ resources need to be allocated effectively - this is commonly referred to as the health monitoring and intervention problem (HMIP).

Why AI is needed:

Existing solutions to optimize HMIP do not factor risk-sensitivity into their planning models. This can entail the risk of patients being ignored as they are seen as less important to be intervened upon. In traditional HMIP models, patients are intervened upon in a round-robin order which does not consider the level of care a patient requires and might lead to more interventions than really necessary.

Intervention overview:

The authors introduce Collapsing Bandit models, a subclass of restless multi-armed bandit models (RMABs) which they claim to be able to generalize better than previous RMABs. Each bandit’s arm represents a patient, and each time an arm is played, the MAB transitions into one of the following two states:

Adhering to the tuberculosis medication and
Not adhering to the medication.

The CHWs’ goal is to find a policy that maximizes the total reward across all arms. To find an optimal policy, the collapsing bandit model builds on the Whittle index technique and leverages Lagrangian relaxation to solve the optimization problem. As a result, the collapsing bandit model achieves a 3x-speedup compared to other RMAB techniques without impairing the model’s performance.

Data used:

The experiment makes use of a real-world healthcare dataset compiled by Killian et. all. This dataset contains data from 17.000 tuberculosis patients over 292 health centers in Mumbai, India, which have received a total of 2.1 million doses and needed to follow a 6-month medication plan.

Resources needed:

The local CHWs play an important role in building a bridge between health resources and local communities. With directly observed treatments, CHWs can observe and confirm directly that a patient is taking the prescribed medications. Despite this, the social stigma of public fear of illness and the financial burden on patients to travel to health facilities might still increase the likelihood that follow-up visits might be missed.

For this reason, digital adherence technology (DAT) plays an important role in supporting the eradication of diseases such as tuberculosis by observing the adherence to medication electronically. Patients can for example send a text message or provide a photo of their pillboxes to prove the drug intake, which allows CHWs to focus their limited time on high-risk patients.

Deployment status:

The authors have evaluated their algorithm on a variety of datasets, including real and synthetic data. In particular, Mate et al. used tuberculosis medication adherence data from Killian et. all. To the best of the knowledge of the authors of this blog post, no RMAB-based planning framework for health intervention has been put into production yet.

References:

Killian, J. A., et al. (2019). Learning to prescribe interventions for tuberculosis patients using digital adherence data.
Mate, A., et al. (2021). Risk-Aware Interventions in Public Health: Planning with Restless Multi-Armed Bandits.
Mate, A., et al. (2020). Collapsing bandits and their application to public health interventions.

Outlook and further references

Looking into the future, we believe AI is of major importance for improving society and fighting social injustice. To that end, in pushing forward the agenda of §AI for Social Impact”, we need to engage in interdisciplinary collaborations and bring the benefits of AI to populations that have not benefited from it so far.

We hope that you found the case studies that we presented useful. In publishing this blog post, we wish to demonstrate the social impact that AI can have in the real world. From our perspective, we are only at the beginning of the journey.

Further References:

Artificial Intelligence and Social Work by Milind Tambe, Eric Rice (Link)
Harvard University Course “CS 288 AI for Social Impact” by Milind Tambe (Link)
Artificial Intelligence for Social Good - A Survey by Zheyuan Ryan Shi, Claire Wang, Fei Fang (Link)
Podcast: “Why AI Innovation and Social Impact Go Hand in Hand” with Milind Tambe (Link)

Authors:

Matthias Bachfischer (bachfischer.matthias@googlemail.com)
Michael Daschner (michael.daschner@gmail.com)

Building a personal Twitter Bot using Markov chains

2021-02-24T00:00:00+01:00

A short announcements that I have now created my own alter ego in the form of a Twitter bot. It goes by the handle of @Bachfischer_bot and you can follow it on Twitter.

If you want to interact with the bot, you must include #FamousQuotesFromRoding in your tweet and it’ll reply to you (Roding is a small village in Bavaria and also my hometown).

The bot works by using Markov chains, which can generate text that looks superficially good, but is actually quite nonsensical. I trained it on Adam Smith’s book “An Inquiry into the Nature and Causes of the Wealth of Nations”, so don’t be surprised if it engages with you in a discusson on free markets and economic affairs.

At some point in the future, I might consider to improve the bot to work with GPT2 (or another language model) or write up a more detailed blog post on the tech stack that I used to deploy the bot (Google Cloud App Engine).

Reflections on Pacman AI Competition

2021-02-18T00:00:00+01:00

Towards the end of last year, I had the pleasure to compete in the Pacman CTF competition that was run as part of the COMP90054 course at the University of Melbourne (Semester 2 2020).

The CTF competition involves a multi-player capture-the-flag variant of Pacman in which the students make use of classical planning as well as reinforcement learning techniques to design agents that play Pacman against each other in a tournament.

The objective of the Pacman agents is to eat as much food as possible on the far side of the map, while defending the food on their home side (the contest was originally designed by Berkley and is described in further detail here).

In accordance with the COMP90054 Code of Honour, my team and I are not allowed to release the code that we used for our Pacman agents, but nonethless I would like to use this blog post to discuss which approaches we considered and which we found to perform best in the competition.

If you are interested in further details, please refer to the Wiki that is part of the following repository: COMP90054-Pacman-Competition

At the beginning of the competition, we experimented with a variety of techniques such as classical planning with PDDL or value iteration using a model-based MDP. In the interest of time (the competition took approx. 6 weeks), we decided to settle on two main approaches with which we competed in the tournament and achieved satisfying results (top 10% position in the leaderboard).

These two approaches were: 1. Approximate Q-Learning 2. Behaviour Trees with A-Star Heuristic Search

In the remainder of this blog post, I would like to talk about the various advantages and disadvantages of both techniques.

1. Approximate Q-Learning

Motivation

The motivation for this approach was to produce approximate Q-learning agent(s) (both offensive and defensive) which learns feature-weights of states (described below) that enable the agent to act within the Pacman contest environment.

Theory

Approximate Q-Learning is a means of approximating the Q-functions used in traditional/simple Q-learning. This method utilises reward shaping (providing an agent with useful, intermediate rewards) in addition to function approximation in order to reduce a once-exponentially large state space into a more feasible domain. This is done by:

Extracting features deemed necessary for the problem task;
Performing updates on the weights of said features;
Estimating Q-values by summing features and their weights.

Trade-offs

Advantages:

Enables the feasible implementation behind Q-learning without the exponentially-increasing domain size problem, i.e. reduces the size of the Q-table.
- This advantage is especially salient given the 1 second time restriction for agent actions: our agents do not have the time, nor the computational capability, to run simple Q-learning.
Forces consistent behaviour patterns, i.e. agents using Q-learning are more likely to act in a consistent manner in similar situations (chasing an enemy, eating food, running from an enemy, etc.).
Allows the designers to play a hand in deciding which aspects of states within the Pacman environment are important for our agent, i.e. closest food to agent? Number of enemies within a certain radius? These features are all programmable into our feature vectors.

Disadvantages:

Requires complex feature extraction, the success of which is determined purely by trial and error; domain-knowledge; research papers; intuition; etc.
- Additionally, this reduces the so-called “generality” of our agent, as human-input in the form of domain-knowledge is required to implement approximate Q-learning.
The accuracy of rewards is reduced as the true/optimal reward function may not be linearly formed within the features extracted.

Implementation

Behaviour Tree:

Evolution:

This agent, in its initial form, was coded to investigate the performance of using A* heuristic search plus a simple decision tree, as well as to provide a baseline performance for comparison for the other agents which were being investigated at the time.
From then onward, different ideas for improvement were investigated and, one after another, this agent’s performance in the competition increase; there improvements arose from questions such as:
- Can we prevent our agent from being eaten when they search for food?
- What should our agent to when no optimal action exists?
- How can our agent avoid a deadlock with an enemy agent?
- How should our agent act knowing it will be eaten?
The major challenge in programming this agent was in the design its behaviour protocols (see Gameplay below); this process required extensive research which began with investigating the simple mechanics of the baseline agent, and ended with devising complex strategies to combat smarter opponents (inspiration of which was gleaned from other agents being built at the time, as well as wider research papers on the topic).

Gameplay:

This agents acts as a dual offensive-defensive agent, however if focuses primarily on offensive strategies. Thus, the agent immortalises the saying “the best defence is a good offence”.
The agent general strategies are controlled by behaviour tree-like mechanisms (see Behaviour Tree above).
The agent acts in one of five various ways depending on the environment and its past actions, each of which have been generalised/simplified below:
1. Eat enemy food: basic offensive strategy to find nearest food (acting in consideration of teammate’s position)
2. Find different attack path: secondary offensive strategy to find point of attack farthest away from current position
3. Return home: agent finds shortest path to home territory
4. Escape: primary evasive strategy to avoid enemy ghost (returns home; eats capsule; etc.)
5. Defend remaining food: sole defensive strategy to chase enemy offensive agents in territory

The following is a list of improvements that eventually became the behaviour protocols to which we attribute the agent’s success:

Different Path Protocol: agent finds different path (than current) to attack enemy agents
Enemy Avoidance Protocol: agent avoids enemies when searching for paths to food

DRP (Don’t Repeat the Past) Protocol: agent finds different attack path if performing repeated actions

Last Resort Protocol: agent consumes capsule if it cannot return home whilst being chased

Further Improvements:

Investigation into game theory with regards to multi-agent systems could have provided further insight into better strategies.
Possible integration with approaches other than A* heuristic search may have provided overall better agents.
Investigation into developing a more balanced team, i.e. designing both an offensive and defensive agent, may have yielded better competition results.

2. Behaviour Trees with A* Heuristic Search

Motivation

The motivation for this approach was to produce an agent which uses behaviour trees as well as A* heuristic search to accomplish different goals within the Pacman contest environment.

Theory

Behaviour trees are trees containing hierarchical nodes which control the flow of decision making of agents. These trees are defined as directed acyclic graphs with internal nodes corresponding with events/stimuli, and external nodes corresponding with behaviours (in contrast with hierarchical state machines whereby stimuli leads to states, rather than behaviours). The image below illustrates a simplistic behaviour tree for a two-armed robot (By Aliekor at English Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=39804218):

A* (“A star”) is a heuristic search algorithm which generates a lowest-cost path tree from the start node to the goal node. A* utilises a function f(n) which provides an estimate of the total cost of a path from node n. The function f(n) is calculated as f(n) = g(n) + h(n), whereby g(n) is the cost to reach node n, and h(n) is the estimated cost from n to the goal node (requiring a heuristic cost function). A* is complete for safe heuristics and optimal for admissible heuristics.

Application

An agent utilising a behaviour tree would be able smart and informed decisions which have been pre-programmed by the agent-designer. Possibilities for evolving behaviour would stem from manual supervision of matches in an effort to assess which stimuli should trigger certain behaviours.

Trade-offs

Advantages:

Behaviour trees can be programmed once and then can be used right away, i.e. behaviour tree implementation requires no pre-match training.
Decisions can be informed by human intervention, effectively bestowing outsider knowledge/strategies onto the agent.
Behaviour trees are so extremely effective in controlling agents within gaming environments (i.e. NPCs), that their usage is mainstream within the gaming AI community at large.
Behaviour trees can be used in conjunction with AI decision making approaches (e.g. if A, then perform Monte Carlo Tree Search)
A* is complete for safe heuristics; optimal for admissible heuristics.
Easy to quickly code an A* algorithm.

Disadvantages:

Behaviour trees require domain-knowledge to be hard-coded by the agent-designer, i.e. the agent is only as good as the agent-designer.
Behaviour trees are really an instance of AI decision-making.
A* can be slow depending on which heuristic is used and the size of the search space; worst-case time complexity is O(b^d) (b = branching factor; d = goal depth). Additionally, it’s worst-case space complexity is also O(b^d) (which consumes a lot of memory).

Implementation

Evolution:

Agent 2 was initially planned to implement Classical Q-Learning approach, however both our testing and the research showed that this approach is better suited for very small grids (i.e. was not scalable to larger domains).
After realising this, developing an agent utilising Deep Q-Learning was considered, however this approach was also unfeasible given the 15-second training time before each match being insufficient to train a deep neural network.
Thus, Approximate Q-Learning was the only Q-Learning option left, and a rudimentary agent utilising this approach was designed in conjunction with tried and tested methods within the Pacman AI game literature available online.
The development of Agent 2 into a more competent Pacman agent almost solely revolved around the trial and error regarding which features should be extracted in order to accurately represent the agent’s surrounding environment as effectively as possible (these features are described in Gameplay below).
The main challenge with Agent 2 is that the approximated Q-value was not proven to converge for all feature functions. Theoretically, the calculated Q-value (once converged) will be an optimal approximation of the optimal Q-value with regard to the choice of the feature function.
This leads to the secondary challenge of finding an appropriate feature function which provides an optimal representation of the agents’ domain. By doing so, the resulting policy will be optimal, however, this task was far from trivial and was eventually abandoned.

Gameplay:

Agent 2 consisted of two separate sub-class agents: an offensive agent and a defensive agent, i.e. an agent acting as a Pacman and an agent acting as a ghost, respectively.
The general strategy of this agent can be described as a balanced approach, possessing explicit offensive and defensive agents.
These two agents differed in their initial parameters and the features extracted to represent their surrounding environment; the specifics of which are explored in the subsections below.

Offensive Agent:

Features:

Bias: weight bias
Successor score: score value based on whether food is available in possible moves
Number of ghosts one step away: calculates number of enemy ghosts one step away from agent
Distance from closest ghost: calculates distance of closest ghost from agent
Distance from closest food: calculates distance of closest food from agent
Whether agent is home: whether agent is within home territory or not

Other parameters:

Carry Limit: indicates the amount of food the agent (Pacman) should carry
Power Limit: indicates whether it is safe to approach enemy ghosts after having eaten a power capsule

Defensive Agent:

Features:

Bias: weight bias
Number of invaders: calculates number of invaders (enemy Pacman) within home territory
Distance from middle: calculates distance of the middle of the map from agent
Distance from closest invader: calculates distance of closest invader (enemy Pacman) from agent
Scared distance: distance of the closest invader from scared agent

Additionally, the training hyper-parameters (how will the agent’s Q-function be structured? i.e. how will the agent learn? what are its priorities?), were tuned in an effort to optimise training for both agents; hyper-parameters described below:

Learning rate (alpha): to what extent will new information be weighted higher than old information.
Exploration rate (epsilon): to what extent will the agent explore actions versus exploit best-known actions.
Discount rate (gamma): to what extent will a feature reward be discounted compared to a current reward.

Further Improvements:

Investigation into a larger set of features / performing a comprehensive analysis would provide more concrete indication of how each feature contributes to possible increase in win rates.
Investigation into the use of auxiliary rewards for agents.
Another possible improvement would be the combination of using Approximate Q-Learning in conjunction with another AI method.

Some quick notes on Deep Work

2021-02-16T00:00:00+01:00

I recently came across Cal Newport’s book Deep Work as well as a great Podcast in which he talks about the contents of his book.

It’s an essential read for every knowledge worker out there in the economy as it covers the constant distractions and temptations that we face (coming from an endless flow of e-mails, social media activity etc.) and provides helpful guidance on how we can structure our lifes to become more focused and more productive (in terms of getting the type of work done that really matters).

Even though I found some of the advice that he gives a little bit too intrusive for me (e.g. disconnecting from all of social media for a full 30 days straight), I can nonetheless absolutely imagine the benefits that following his advice might bring.

Matthias Bachfischer Blog

Testing Claude 3.5 Sonnet Computer Use Demo

A Critical Reflection on Technical Interviews

Testing the robustness of learned index structures

Previous work on “Executing a Large-Scale Poisoning Attack against Learned Index Structures”

Implementing a flexible microbenchmark for learned indexes

Testing the robustness of learned indexes

Image Localisation via Feature Matching

References:

Twitter Rumour Classification with BERTweet

Analysis of Pima Dataset in R

Analysis of Pima dataset in R (Diabetes survey on Pima Indians)

Load data and perform EDA

Exercise 1:

Exercise 2:

Exercise 3:

Exercise 4:

Exercise 5:

Exercise 6:

Exercise 7:

Exercise 8:

Analysis of Swiss Dataset in R

Analysis of Swiss dataset in R (Swiss Fertility and Socioeconomic Indicators 1888)

1. Initial data analysis that explores the numerical and graphical characteristics of the data

Numerical characteristics

Graphical characteristics

2. Variable selection to choose the best model

3. Exploration of transformations to improve the fit of the model

4. Diagnostics to check the assumptions of the model.

5. Some predictions of future observations for interesting values of the predictors

6. Interpretation of the meaning of the model with respect to the particular area of application

AI for Social Impact - A great leap forward

The “AI for Social Impact” data-to-deployment pipeline

1. Immersion

2. Predictive Model

3. Prescriptive algorithm

4. Field tests & deployment

Case studies on various “AI for Social Impact” deployments

Raising awareness about HIV among homeless youth using POMDPs

Target problem:

Why AI is needed:

Intervention overview:

Data used:

Resources needed:

Deployment status:

References:

Applying game theory to prevent poaching

Problem to be solved:

Why AI is needed:

Intervention overview:

Data used:

Resources needed:

Deployment status:

References:

Public health interventions as a Multi-Armed Bandit (MAB) setting

Target problem:

Why AI is needed:

Intervention overview:

Data used:

Resources needed:

Deployment status:

References:

Outlook and further references

Building a personal Twitter Bot using Markov chains

Reflections on Pacman AI Competition

1. Approximate Q-Learning

Motivation

Theory

Trade-offs

Advantages:

Disadvantages:

Implementation

Behaviour Tree:

Evolution:

Gameplay:

Further Improvements:

2. Behaviour Trees with A* Heuristic Search

Motivation

Theory

Application