The Computer Use Demo operates in a Docker container that provides access to predefined tools (browser, code interpreter, etc.). While the model doesn’t directly control the computer, it has been trained to interpret screen content and generate appropriate tool instructions. Below is a screenshot of the interface:
This is the prompt I used:
1. Download the parquet file hosted at https://openml1.win.tue.nl/datasets/0004/41214/dataset_41214.pq
2. Install sklearn with bash tool
3. Write a .py file read in the data and split it into train and test
4. Train a model of your choice to predict the mean claim amount `AvgClaimAmount` (you may need to inspect the data and/or iterate if this goes poorly at first, but don't get discouraged!).
5. Perform hyperparameter optimization to find the best model.
6. Interpret the result scores and come up with some way to visualize the results in the browser.
I had some initial issues with timeouts of the execution environment (likely due to the dataset’s size of approximately 680,000 entries), but after I asked the model to train on only 30% of the data, it worked like a charm.
Below is a summary of the results by Claude. I have also pasted the code it wrote into this gist.
Let me summarize the results and provide some insights:
Model Performance:
R-squared Score: 0.0817 (8.17%)
Mean Absolute Error: 0.5302
Root Mean Squared Error: 5.0768
Feature Importance (top 5):
Vehicle Age (21.01%)
Exposure (17.69%)
Density (16.01%)
Region (13.67%)
Driver Age (11.70%)
Insights:
The model's performance is relatively weak (R² = 0.0817), suggesting that predicting claim amounts is a challenging task with the given features
Vehicle Age is the most important predictor, followed by Exposure and Density
Area (location) has surprisingly little impact on predictions
Vehicle-related features (Age, Power, Brand) collectively account for about 30% of the prediction power
Recommendations for improvement:
Feature Engineering:
Create interaction terms between related features (e.g., VehAge * VehPower)
Bin continuous variables into categories
Create new features combining demographic and vehicle information
Model Improvements:
Try other algorithms (Random Forest, XGBoost)
Perform more extensive hyperparameter tuning
Consider treating this as a classification problem (binning claim amounts)
Handle potential class imbalance in the target variable
Would you like me to:
Implement any of these improvements?
Try a different modeling approach?
Generate additional visualizations to better understand the data?
Perform more detailed analysis of specific features?
Please let me know how you'd like to proceed!
The results are remarkable not just for their technical merit, but for what they reveal about technical assessment practices. In approximately five minutes, Claude 3.5 Sonnet successfully:
This raises questions about the effectiveness of traditional live coding assessments in technical interviews. When AI systems can rapidly complete such tasks while providing insightful analysis, we may need to reconsider how we evaluate a candidates’ capabilities. The focus should shift from basic implementation skills to higher-order thinking, problem-solving approaches, and the ability to critically evaluate and improve upon initial solutions.
The future of technical assessment may lie not in testing whether a candidate can implement basic solutions under pressure, but in evaluating their ability to think strategically about problem-solving approaches, understand trade-offs, and improve upon initial implementations – skills that remain distinctly human despite rapid AI advancement.
]]>Databases rely on indexes to quickly locate and retrieve data that is stored on disks. While traditional database indexes use tree data structures such as B+ Trees to find the position of a given query key in the index, a learned index structure considers this problem as a prediction task and uses a machine learning model to “predict” the position of the query key.
Traditional and learned indexes | ML models to approximate the CDF |
---|---|
This novel approach of implementing database indexes has inspired a surge of recent research aimed at studying the effectiveness of learned index structures. However, while the main advantage of learned index structures is their ability to adjust to the data via their underlying ML model, this also carries the risk of exploitation by a malicious adversary.
This post will show some experiments that I have conducted as a follow-up to the research on adversarial machine learning in the context of learned index structures that was part of my master’s thesis at The University of Melbourne.
In my master’s thesis, I have executed a large-scale poisoning attack on dynamic learned index structures based on the CDF poisoning attack proposed by Kornaropoulos et al. The poisoning attack targets linear regression models and works by manipulating the cumulative distribution function (CDF) on which the model is trained. The attack deteriorates the fit of the underlying ML model by injecting a set of poisoning keys into the dataset, which leads to an increase in the prediction error of the model and thus deteriorates the overall performance of the learned index structure. The source code for the poisoning attack is available on GitHub.
As part of the experiments for my master’s thesis, I evaluated three index implementations by measuring their throughput in million operations per second. The evaluated indexes consist of two learned index structures ALEX and Dynamic-PGM as well as a traditional B+ Tree. Because indexes are usually used to speed-up data retrieval when dealing with massive amount of data, I chose to evaluate the performance of the indexes based on the SOSD benchmark datasets that consist of 200 million keys each.
Unfortunately, executing the poisoning attack by Kornaropoulos et al. is heavily computationally intensive, so I had to run them with a fixed poisoning threshold of $p=0.0001$, thus generating 20,000 poisoning keys for a dataset of 200 million keys. This poisoning threshold can be considered to be relatively low, as previous work on poisoning attacks has used poisoning thresholds of up-to $p=0.20$.
To test the robustness of learned indexes more rigorously, I have set-up a flexible microbenchmark that can be used to quickly evaluate the robustness of different index implementations against poisoning attacks. The microbenchmark is based on the source code that was published by Eppert et al. which I have extended to implement the CDF poisoning attack against different types of regression models and the learned index implementations ALEX and PGM-Index.
The corresponding source code can be found here: https://github.com/Bachfischer/LogarithmicErrorRegression.
To test the robustness of the learned indexes, I have generated a synthetic dataset of 1000 keys and ran the poisoning attack against each index implementation while varying the poisoning threshold from $p=0.01$ to $p=0.20$.
The graphs below show the performance deterioration calculated as the ratio between the mean lookup time in nanoseconds for the poisoned datasets and the legitimate (non-poisoned) dataset.
SLR | LogTE | DLogTE | 2P |
---|---|---|---|
TheilSen | LAD | ALEX | PGM |
---|---|---|---|
From the graphs, we can observe that simple linear regression (SLR) is particularly prone to the poisoning attack, as this regression model shows a steep increase in the mean lookup time when evaluated on the poisoned data.
The performance of the competitors that optimize a different error function such as LogTE, DLogTE and 2P (introduced in A Tailored Regression for Learned Indexes) are more robust against adversarial attacks. For these regression models, the mean lookup time remains relatively stable even when the poisoning threshold is increased substantially.
Because SLR is the de-facto standard in learned index structures and used internally by the ALEX and the PGM-Index implementations, we would expect that these two models also exhibit a relatively high performance deterioration when evaluated on the poisoned dataset. Surprisingly, ALEX does not show any significant performance impact, most likely due to the usage of gapped arrays that allow the model to easily capture outliers in the data (this effect can be likely attributed to the small keyset size). The performance of the PGM-Index deteriorates by a factor of up-to 1.3x.
To put things into a broader perspective, I have also calculated the overall mean lookup time for the evaluated learned indexes (averaged across all experiments) in the graph below.
We can see that ALEX dominates all learned index structures. The performance of the regression models SLR, LogTE, DLogTE, 2P, TheilSen and LAD is also relatively similar, in a range between 30 - 40 nanoseconds.
In the experiments, PGM-Index performs worst with a mean lookup time of > 50 nanoseconds. This is most likely due to the fact that PGM-Index is optimized for large-scale data workloads and exhibits subpar performance in this microbenchmark because the dataset consists of only 1000 keys.
I consider the results from this research to be a highly interesting study of the robustness of learned index structures. The poisoning attack and microbenchmark described in this post are open-source and can be easily adapted for future research purposes. If you have any further thoughts or ideas, please let me know!
]]>The task of the project was to develop a method to estimate the coordinates from which an image was taken. The dataset for this task was published by the COMP90086 teaching team and consisted of a collection of images taken in and around an art museum (the Getty Center in Los Angeles, U.S.A.).
The training dataset contained a total of 7500 images labeled with their corresponding (x,y) coordinates. The test dataset for which the coordinates should be predicted contained 1200 images.
For our participation, we chose to use the pre-trained SuperGlue model to extract features from the images in the training set and match them with the images in the test set.
The image above shows the SuperGlue network architecture and is taken from the original SuperGlue paper by Sarlin et al. SuperGlue combines a graph neural network architecture and attention mechanism to match local image features by finding correspondences and dismissing unmatchable points. It consist of two main components:
In the first component (Attentional Graph Neural Network), SuperGlue borrows the self-attention mechanism from Transformer and embeds it into a Graph Neural Network. The attentional GNN leverages spatial relationships of keypoints and descriptors. It works by first employing an encoder to map keypoint positions $p$ and their associated descriptors $d$ into a single vector. In the next step, self-attention and cross-attention layers are used to generate more powerful representations $f$. This component consist of a total of 9 layers of self- and cross-attention with 4 heads each.
The second component (Optimal Matching Layer) creates an $M \times N$ score matrix and finds the optimal partial assignment between two sets of local features by using the Sinkhorn algorithm for $T = 100$ iterations.
The pre-trained SuperGlue model consisting of approx. 12M parameters has been implemented in PyTorch and available on GitHub. It can be amalgamated with any local feature detector and descriptor techniques such as SIFT and SuperPoint to extract sparse keypoints and perform matching. In our experiments, SuperGlue was able to estimate almost all correct matches while rejecting the majority of outliers.
Shown below is an example from the COMP90086 dataset. An example image from the test set is shown on the left, and the corresponding image from the training set is shown on the right. All detected matches are colored based on their predicted confidence in a jet colormap (red: more confident, blue: less confident).
By using the SuperGlue model, we managed to achieve a mean absolute error (MAE) of 5.15683, which put us on rank 9 out of 215 participants in the final Kaggle competition. The MAE score was calculated via the formula $MAE = \frac{1}{N} \sum_{i=1}^{N}abs(x_i - \hat{x}_i) + abs(y_i - \hat{y}_i)$.
A detailed write-up of the implementation details as well as other experiments that we performed (e.g. using SIFT or an Autoencoder architecture to match images based on their similarity) is available here, and if you are interested in further details, please refer to the following repository: COMP90086-Fine-grained-localisation.
Authors:
The task of the project was to
The dataset for the project was published by the COMP90042 teaching team and consisted of a set of source tweets and their replies (incl. corresponding metadata) that had been extracted from the Twitter API. In total, the training data consisted of 4641 events that had been labeled as either RUMOUR or NON-RUMOUR (binary classification).
For this project, I have implemented three classification systems:
Using the best-performing model BERTweet, I managed to achieve a F1 score of 86.17% (which put me on rank 12 out of 308 participants in the final CodaLab competition).
A detailed write-up of the implementation details (pre-processing routine etc.) for the models mentioned above is available here, and if you are interested in further details, please refer to the following repository: COMP90042-Rumour-Detection-on-Twitter
I have also used BERTweet to participate in the “Disaster Tweets” Kaggle challenge. The notebook is available here: Disaster Tweets - BERTweet
farawayR-package. The exercises below are part of the course MAST90139: Statistical Modelling for Data Science at the University of Melbourne.
First we clean up any variables that we may have left in the environment
rm(list = ls())
library(faraway)
library(ggplot2)
data(pima)
head(pima)
help(pima)
dim(pima)
## [1] 768 9
Create a factor version of the test results and use this to produce an interleaved histogram to show how the distribution of insulin differs between those testing positive and negative. Do you notice anything unbelievable about the plot?
pima$test = as.factor(pima$test)
levels(pima$test) <- c("negative","positive"); pima[1,]
par(mfrow=c(1,2)); plot(insulin ~ test, pima)
ggplot(pima, aes(x=insulin, color=test)) + geom_histogram(position="dodge", binwidth=30)
library(ggplot2)
ggplot(pima, aes(x = insulin, color = test)) + geom_histogram(position="dodge",
binwidth=30, aes(y=..density..))
summary(pima$test[pima$insulin==0])
## negative positive
## 236 138
High values of insulin seem to correlate with signs of diabetes!
Replace the zero values of insulin with the missing value code NA. Recreate the interleaved histogram plot and comment on the distribution.
pima$insulin[pima$insulin == 0] <- NA
ggplot(pima, aes(x = insulin, color = test)) + geom_histogram(position="dodge",
binwidth=30, aes(y=..density..))
## Warning: Removed 374 rows containing non-finite values (stat_bin).
After replacing the zero values with NA, the relationship becomes even more clearer!
Replace the incredible zeroes in other variables with the missing value code. Fit a model with the result of the diabetes test as the response and all the other variables as predictors. How many observations were used in the model fitting? Why is this less than the number of observations in the data frame?
pima[pima == 0] <- NA
# Fit logistic regression model from binomial family
model1 <- glm(test ~ pregnant + glucose + diastolic + triceps + insulin + bmi + diabetes + age,family = binomial, pima)
summary(model1)
##
## Call:
## glm(formula = test ~ pregnant + glucose + diastolic + triceps +
## insulin + bmi + diabetes + age, family = binomial, data = pima)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8619 -0.6557 -0.3295 0.6158 2.6339
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.083e+01 1.423e+00 -7.610 2.73e-14 ***
## pregnant 7.364e-02 5.973e-02 1.233 0.2176
## glucose 3.616e-02 6.249e-03 5.785 7.23e-09 ***
## diastolic 5.993e-03 1.320e-02 0.454 0.6497
## triceps 1.110e-02 1.869e-02 0.594 0.5527
## insulin 3.231e-05 1.445e-03 0.022 0.9822
## bmi 7.615e-02 3.174e-02 2.399 0.0164 *
## diabetes 1.097e+00 4.777e-01 2.297 0.0216 *
## age 4.075e-02 1.919e-02 2.123 0.0337 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 426.34 on 335 degrees of freedom
## Residual deviance: 288.92 on 327 degrees of freedom
## (432 observations deleted due to missingness)
## AIC: 306.92
##
## Number of Fisher Scoring iterations: 5
In the pima dataframe, there are 768 observations, but only 327+9=336 observations were used to fit the model. 432 observations were deleted due to missingness.
plot(model1)