Reflections on Pacman AI Competition

12 minute read

Published: February 18, 2021

Pacman Capture-The-Flag

Towards the end of last year, I had the pleasure to compete in the Pacman CTF competition that was run as part of the COMP90054 course at the University of Melbourne (Semester 2 2020).

The CTF competition involves a multi-player capture-the-flag variant of Pacman in which the students make use of classical planning as well as reinforcement learning techniques to design agents that play Pacman against each other in a tournament.

The objective of the Pacman agents is to eat as much food as possible on the far side of the map, while defending the food on their home side (the contest was originally designed by Berkley and is described in further detail here).

In accordance with the COMP90054 Code of Honour, my team and I are not allowed to release the code that we used for our Pacman agents, but nonethless I would like to use this blog post to discuss which approaches we considered and which we found to perform best in the competition.

If you are interested in further details, please refer to the Wiki that is part of the following repository: COMP90054-Pacman-Competition

At the beginning of the competition, we experimented with a variety of techniques such as classical planning with PDDL or value iteration using a model-based MDP. In the interest of time (the competition took approx. 6 weeks), we decided to settle on two main approaches with which we competed in the tournament and achieved satisfying results (top 10% position in the leaderboard).

These two approaches were: 1. Approximate Q-Learning 2. Behaviour Trees with A-Star Heuristic Search

In the remainder of this blog post, I would like to talk about the various advantages and disadvantages of both techniques.

1. Approximate Q-Learning

Motivation

The motivation for this approach was to produce approximate Q-learning agent(s) (both offensive and defensive) which learns feature-weights of states (described below) that enable the agent to act within the Pacman contest environment.

Theory

Approximate Q-Learning is a means of approximating the Q-functions used in traditional/simple Q-learning. This method utilises reward shaping (providing an agent with useful, intermediate rewards) in addition to function approximation in order to reduce a once-exponentially large state space into a more feasible domain. This is done by:

Extracting features deemed necessary for the problem task;
Performing updates on the weights of said features;
Estimating Q-values by summing features and their weights.

Trade-offs

Advantages:

Enables the feasible implementation behind Q-learning without the exponentially-increasing domain size problem, i.e. reduces the size of the Q-table.
- This advantage is especially salient given the 1 second time restriction for agent actions: our agents do not have the time, nor the computational capability, to run simple Q-learning.
Forces consistent behaviour patterns, i.e. agents using Q-learning are more likely to act in a consistent manner in similar situations (chasing an enemy, eating food, running from an enemy, etc.).
Allows the designers to play a hand in deciding which aspects of states within the Pacman environment are important for our agent, i.e. closest food to agent? Number of enemies within a certain radius? These features are all programmable into our feature vectors.

Disadvantages:

Requires complex feature extraction, the success of which is determined purely by trial and error; domain-knowledge; research papers; intuition; etc.
- Additionally, this reduces the so-called “generality” of our agent, as human-input in the form of domain-knowledge is required to implement approximate Q-learning.
The accuracy of rewards is reduced as the true/optimal reward function may not be linearly formed within the features extracted.

Implementation

Behaviour Tree:

Behaviour Tree

Evolution:

This agent, in its initial form, was coded to investigate the performance of using A* heuristic search plus a simple decision tree, as well as to provide a baseline performance for comparison for the other agents which were being investigated at the time.
From then onward, different ideas for improvement were investigated and, one after another, this agent’s performance in the competition increase; there improvements arose from questions such as:
- Can we prevent our agent from being eaten when they search for food?
- What should our agent to when no optimal action exists?
- How can our agent avoid a deadlock with an enemy agent?
- How should our agent act knowing it will be eaten?
The major challenge in programming this agent was in the design its behaviour protocols (see Gameplay below); this process required extensive research which began with investigating the simple mechanics of the baseline agent, and ended with devising complex strategies to combat smarter opponents (inspiration of which was gleaned from other agents being built at the time, as well as wider research papers on the topic).

Gameplay:

This agents acts as a dual offensive-defensive agent, however if focuses primarily on offensive strategies. Thus, the agent immortalises the saying “the best defence is a good offence”.
The agent general strategies are controlled by behaviour tree-like mechanisms (see Behaviour Tree above).
The agent acts in one of five various ways depending on the environment and its past actions, each of which have been generalised/simplified below:
1. Eat enemy food: basic offensive strategy to find nearest food (acting in consideration of teammate’s position)
2. Find different attack path: secondary offensive strategy to find point of attack farthest away from current position
3. Return home: agent finds shortest path to home territory
4. Escape: primary evasive strategy to avoid enemy ghost (returns home; eats capsule; etc.)
5. Defend remaining food: sole defensive strategy to chase enemy offensive agents in territory

The following is a list of improvements that eventually became the behaviour protocols to which we attribute the agent’s success:

Different Path Protocol: agent finds different path (than current) to attack enemy agents
Enemy Avoidance Protocol: agent avoids enemies when searching for paths to food

Enemy Avoidance Protocol

DRP (Don’t Repeat the Past) Protocol: agent finds different attack path if performing repeated actions

DRP Protocol

Last Resort Protocol: agent consumes capsule if it cannot return home whilst being chased

Last Resort Protocol

Further Improvements:

Investigation into game theory with regards to multi-agent systems could have provided further insight into better strategies.
Possible integration with approaches other than A* heuristic search may have provided overall better agents.
Investigation into developing a more balanced team, i.e. designing both an offensive and defensive agent, may have yielded better competition results.