Rules of the Game
Meta-tic-tac-toe is played on a board consisting of 9 different tic-tac-toe boards, each of which can be won independently for points. Boards are won in an identical fashion to tic-tac-toe, and like tic-tac-toe, turns alternate between player one (1's)
and player two (2's)
. Once a board has been won or is full of pieces, neither player can play into it. The nine boards are arranged in a three by three grid pattern with the coordinate of the previous move
determining which board the next player can play into. (Example: a place into the center
position of any of the nine tic-tac-toe boards means that the next player must play into the central
tic-tac-toe board.) The first player to move must play into the center board. If a player is supposed to play into a board which has already been won or is completely full of pieces, he/she instead chooses which of the remaining boards to play into and plays in an empty position in
it. Play is stopped when all individual boards have been won or are completely full of pieces. At this point, the agent with the most points wins.
How the AI Works
The AI uses the minimax
search algorithm to find a next move along with alpha beta pruning which I describe in more detail here
. In minimax search, the AI looks at all the possible moves that the player MAX can make (player MAX because the current player is trying to maximize the utility of its move), and selects the move with the highest predicted utility. How does the player MAX calculate the predicted utility of a move or state? Player MAX makes a recursive call to the minimax algorithm, except,
this time the next player will be player MIN who is trying to minimize the utility of the positions. Player MIN, like player MAX looks at all the possible moves, but choses the move with the lowest predicted utility. How does player MIN calculate the predicted utility of a move? Well you can already guess: making a recursive call to the minimax algorithm. However, at some point these recursive calls must stop, so an additional parameter is passed into the minimax algorithm that
keeps track of depth. When the maximum depth has been reached, the minimax algorithm, rather than making a recursive call to itself, makes a call to a utility function that assigns a rough estimate of the value of the state.
The Utility Function
The utility function looks at a given state and using some heuristics assigns an approximate value to the state. Oftentimes a utility function is composed of a number of basis functions. The return value of each basis function is then multiplied by some constant and then the whole thing is added together:
v1 = basis_1(state) * c1
v2 = basis_2(state) * c2
vn = basis_n(state) * cn
return v1 + v2 + ... + vn
For a more familiar game like chess, one can imagine these basis functions being based on a number of things including the relative number of pieces that each player has. For instance, if relative number of pawns basis function is multiplied by a constant equal to 1, then the relative number of rooks basis function should be multiplied by a constant equal to 5 according to chess strategy
the case of meta tic-tac-toe, my utility function is based on six different basis functions:
f1_score (Constant 1) - the relative "score" of each player. This is calculated by subtracting the score of player MIN from the score of player MAX.
f2_center (Constant 2) - the relative number of center pieces. This is calculated by subtracting the number of center pieces that MIN has from the number of center pieces that MAX has.
f3_corner (Constant 3) - the relative number of corner pieces (e.g. top right).
f4_side (Constant 4) - the relative number of side pieces (e.g. top middle).
f5_blocking (Constant 5) - the relative number of blocking positions. For example any time there are three pieces in a row/column/diagonal that are not all of the same type, that is considered a blocking position.
f6_potential (Constant 6) - the relative number of potential positions. Any position that could lead to a three in a row, at the next opportunity is considered a potential position.
These six basis functions are components of the utility function, but they should not be weighted equally.
, is clearly much more important than f4_side
. In fact, in my implementation, f1_score
is 3.07:0.61 times more important than f4_side
. One can experiment with how changing these constants will change the behavior of the AI. However, without a lengthy analysis, the precise weighting is difficult to determine. To determine these constants analytically, I used temporal difference learning.
Learning Utility Function Constants Using Temporal Difference Learning
Temporal difference learning
works by comparing two evaluations of the utility of a state: one evaluation that is not very ideal and the other evaluation that is somehow closer to the true value. Then the constants are updated so that if the less ideal evaluation were computed again, it would return a value closer to what the more idea evaluation had returned. In my case, one evaluation is simply the utility function applied to the current state, and the other evaluation that is somehow
more "ideal" is what the minimax search predicts
will occur -- in other words, it is the utility function evaluation of the state that the minimax search predicts will be played. (In a sense, temporal difference learning is vaguely analogous to supervised learning except rather than having labeled/training instances, we have these more "ideal" predictions based on using the minimax search.) My implementation of temporal difference learning is expressed in only a few lines of python code:
def td_learning(terminal_state, TD_CONSTS, prev_state):
'''This function modifies TD_CONSTS according to the temporal difference algorithm.
TD_CONSTS: the constants multiplied by the basis functions to compute the utility function.
prev_state: the current state that the minimax search has just found a next move for.
terminal_state: the state that the minimax search predicts will occur
if both players play ideally.
This state is not the state that the AI choses as its next move, but rather the state that the minimax search predicts will
occur in the future if both players play ideally.
Note: the number of moves between prev_state and terminal_state is equal to the number of ply searched,
e.g. the depth_limit parameter passed to the minimax search function, ab.
change_total_utility = utility(terminal_state, TD_CONSTS) - utility(prev_state, TD_CONSTS)
sub1 = sub_utility(terminal_state, TD_CONSTS)
sub2 = sub_utility(prev_state, TD_CONSTS)
change_sub_utility = [ (sub1[i] - sub2[i]) for i in range(len(sub1)) ]
for i in range(len(TD_CONSTS)):
TD_CONSTS['c' + str(i+1)] += ALPHA * change_total_utility * (change_sub_utility[i]) * (-1)
TD_CONSTS = normalize(TD_CONSTS)
The training regime
for temporal difference learning proceeds by playing a game where the AI makes moves for both sides. After only a few games of training, one can see dramatic improvements in the relative value of the constants.
use pyjs/pyjamas. Due to cryptic or non existent error messages, debugging pyjs/pyjamas is an
. In addition to cryptic error messages, the output of the pyjs/pyjamas compiler is extremely slow and innefficient. While the python CLI AI can perform a minimax search to a depth of 5 ply, the pyjs/pyjamas AI in the same time can only search to a depth
of 3. As a result, the AI for this web app has an inferior skill relative to the command line interface AI.
Written by Chet Weger. Questions, comments, bugs? Contact me at chetweger [at] gmail.com.
See my similar projects including Self-Learning Meta-Tic-Tac-Toe and Interactive Tic-Tac-Toe. Code for this project is available at https://github.com/chetweger/min-max-games.
Return to my home page.