### Rules of the Game

Meta-tic-tac-toe is played on a board consisting of 9 different tic-tac-toe boards, each of which can be won independently for points. Boards are won in an identical fashion to tic-tac-toe, and like tic-tac-toe, turns alternate between player one (1's) and player two (2's). Once a board has been won or is full of pieces, neither player can play into it. The nine boards are arranged in a three by three grid pattern with the coordinate of the previous move determining which board the next player can play into. (Example: a place into the center position of any of the nine tic-tac-toe boards means that the next player must play into the central tic-tac-toe board.) The first player to move must play into the center board. If a player is supposed to play into a board which has already been won or is completely full of pieces, he/she instead chooses which of the remaining boards to play into and plays in an empty position in it. Play is stopped when all individual boards have been won or are completely full of pieces. At this point, the agent with the most points wins.

### How the AI Works

The AI uses the minimax search algorithm to find a next move. Basically the AI looks at all the possible moves that the player MAX can make (player MAX because the current player is trying to maximize the utility of its move), and selects the move with the highest predicted utility. How does the player MAX calculate the predicted utility of a move or state? Player MAX makes a recursive call to the minimax algorithm, except, this time the next player will be player MIN who is trying to minimize the utility of the positions. Player MIN, like player MAX looks at all the possible moves, but choses the move with the lowest predicted utility. How does player MIN calculate the predicted utility of a move? Well you can already guess: making a recursive call to the minimax algorithm. However, at some point these recursive calls must stop, so an additional parameter is passed into the minimax algorithm that keeps track of depth. When the maximum depth has been reached, the minimax algorithm, rather than making a recursive call to itself, makes a call to a utility function that assigns a rough estimate of the value of the state.

### The Utility Function

The utility function looks at a given state and using some heuristics assigns an approximate value to the state. Oftentimes a utility function is composed of a number of basis functions. The return value of each basis function is then multiplied by some constant and then the whole thing is added together:
```def utility(state):
v1 = basis_1(state) * c1
v2 = basis_2(state) * c2
.
.
.

vn = basis_n(state) * cn
return v1 + v2 + ...  + vn
```
For a more familiar game like chess, one can imagine these basis functions being based on a number of things including the relative number of pieces that each player has. For instance, if relative number of pawns basis function is multiplied by a constant equal to 1, then the relative number of rooks basis function should be multiplied by a constant equal to 5 according to chess strategy. In the case of meta tic-tac-toe, my utility function is based on six different basis functions:
• f1_score (multiplied by Constant 1) - the relative "score" of each player. This is calculated by subtracting the score of player MIN from the score of player MAX.
• f2_center (multiplied by Constant 2) - the relative number of center pieces. This is calculated by subtracting the number of center pieces that MIN has from the number of center pieces that MAX has.
• f3_corner (multiplied by Constant 3) - the relative number of corner pieces (e.g. top right).
• f4_side (multiplied by Constant 4) - the relative number of side pieces (e.g. top middle).
• f5_blocking (multiplied by Constant 5) - the relative number of blocking positions. For example any time there are three pieces in a row/column/diagonal that are not all of the same type, that is considered a blocking position.
• f6_potential (multiplied by Constant 6) - the relative number of potential positions. Any position that could lead to a three in a row, at the next opportunity is considered a potential position.
• These six basis functions are components of the utility function, but they should not be weighted equally. f1_score, is clearly much more important than f4_side. In fact, in my implementation, f1_score is 3.07:0.61 times more important than f4_side. One can experiment with how changing these constants will change the behavior of the AI. However, without a lengthy analysis, the precise weighting is difficult to determine. To determine these constants analytically, I used temporal difference learning.

### Learning Utility Function Constants Using Temporal Difference Learning

Temporal difference learning works by comparing two evaluations of the utility of a state: one evaluation that is not very ideal and the other evaluation that is somehow closer to the true value. Then the constants are updated so that if the less ideal evaluation were computed again, it would return a value closer to what the more idea evaluation had returned. In my case, one evaluation is simply the utility function applied to the current state, and the other evaluation that is somehow more "ideal" is what the minimax search predicts will occur -- in other words, it is the utility function evaluation of the state that the minimax search predicts will be played. (In a sense, temporal difference learning is vaguely analogous to supervised learning except rather than having labeled/training instances, we have these more "ideal" predictions based on using the minimax search.) My implementation of temporal difference learning is expressed in only a few lines of python code:
```def td_learning(terminal_state, TD_CONSTS, prev_state):
'''This function modifies TD_CONSTS according to the temporal difference algorithm.
TD_CONSTS: the constants multiplied by the basis functions to compute the utility function.
prev_state: the current state that the minimax search has just found a next move for.
terminal_state: the state that the minimax search predicts will occur
if both players play ideally.
This state is not the state that the AI choses as its next move, but rather the state that the minimax search predicts will
occur in the future if both players play ideally.

Note: the number of moves between prev_state and terminal_state is equal to the number of ply searched,
e.g. the depth_limit parameter passed to the minimax search function, ab.
'''
change_total_utility = utility(terminal_state, TD_CONSTS) - utility(prev_state, TD_CONSTS)
sub1 = sub_utility(terminal_state, TD_CONSTS)
sub2 = sub_utility(prev_state, TD_CONSTS)
change_sub_utility = [ (sub1[i] - sub2[i]) for i in range(len(sub1)) ]
for i in range(len(TD_CONSTS)):
TD_CONSTS['c' + str(i+1)] += ALPHA * change_total_utility * (change_sub_utility[i]) * (-1)

# normalize
TD_CONSTS = normalize(TD_CONSTS)
return TD_CONSTS
```
The training regime proceeds by playing a game where the AI makes moves for both sides. After only a few games of training, one can see dramatic improvements in the relative value of the constants. After about 15-20 training games, the learning AI is typically capable of beating the static AI. Figure 1 shows how the static AI's constants improve over a series of twenty training games. In this series of games, both AIs search to a depth of 3 ply, and the static AI's constants are the same as the default values in this application (i.e. they are {'c1':3, 'c2':2, 'c3':0.5, 'c4':0.5, 'c5':0.5, 'c6':0.5}). Notice that the static AI starts by consistently beating the learning AI, but the last 5 games are all won by the learning AI. Other patterns are also noticeable from the graph. The ending moves of a game tend to produce the most change in the constants. This makes sense because most of the points in a game are typically won in the last quarter of a game. When points are being won rapidly, then the difference between the predicted utility and the actual utility will be at its highest. ### Turning Python into JavaScript with pyjs/pyjamas

The AI for this game is programmed in python. During initial development, the only interface to play the AI was through the python terminal. When I decided to expand this project into a web app, I had to choose whether the AI would run on the client or the server. I quickly decided that this computationally intensive task should be put client side which necessitated somehow transforming my python script into javascript. I realized that I could translate my python code into javascript manually, but a superior solution would be finding an adequate python to javascript compiler/translator. Pyjs/pyjamas seemed adequate for this job, and it also provides a convenient library for creating a user interface. Indeed, pyjs/pyjamas has been able to do everything I needed it to do, but if I had to start over again, I would probably not use pyjs/pyjamas. Due to cryptic or non existent error messages, debugging pyjs/pyjamas is an arduous process. Indeed, the most difficult step was the initial translation of my python script into javascript which required a substantial change in my existing implementation to get around a bug. In addition to cryptic error messages, the output of the pyjs/pyjamas compiler is extremely slow and innefficient. While the python CLI AI can perform a minimax search to a depth of 5 ply, the pyjs/pyjamas AI in the same time can only search to a depth of 3. As a result, the AI for this web app has an inferior skill relative to the command line interface AI.

Written by Chet Weger. Questions, comments, bugs? Contact me at chetweger [at] gmail.com.

See my similar projects including Interactive Meta-Tic-Tac-Toe and Interactive Tic-Tac-Toe. Code for this project is available at https://github.com/chetweger/min-max-games.

Return to my home page.