def utility(state): v1 = basis_1(state) * c1 v2 = basis_2(state) * c2 . . . vn = basis_n(state) * cn return v1 + v2 + ... + vnFor a more familiar game like chess, one can imagine these basis functions being based on a number of things including the relative number of pieces that each player has. For instance, if relative number of pawns basis function is multiplied by a constant equal to 1, then the relative number of rooks basis function should be multiplied by a constant equal to 5 according to chess strategy. In the case of meta tic-tac-toe, my utility function is based on six different basis functions:

def td_learning(terminal_state, TD_CONSTS, prev_state): '''This function modifies TD_CONSTS according to the temporal difference algorithm. TD_CONSTS: the constants multiplied by the basis functions to compute the utility function. prev_state: the current state that the minimax search has just found a next move for. terminal_state: the state that the minimax search predicts will occur if both players play ideally. This state is not the state that the AI choses as its next move, but rather the state that the minimax search predicts will occur in the future if both players play ideally. Note: the number of moves between prev_state and terminal_state is equal to the number of ply searched, e.g. the depth_limit parameter passed to the minimax search function, ab. ''' change_total_utility = utility(terminal_state, TD_CONSTS) - utility(prev_state, TD_CONSTS) sub1 = sub_utility(terminal_state, TD_CONSTS) sub2 = sub_utility(prev_state, TD_CONSTS) change_sub_utility = [ (sub1[i] - sub2[i]) for i in range(len(sub1)) ] for i in range(len(TD_CONSTS)): TD_CONSTS['c' + str(i+1)] += ALPHA * change_total_utility * (change_sub_utility[i]) * (-1) # normalize TD_CONSTS = normalize(TD_CONSTS) return TD_CONSTSThe training regime proceeds by playing a game where the AI makes moves for both sides. After only a few games of training, one can see dramatic improvements in the relative value of the constants. After about 15-20 training games, the learning AI is typically capable of beating the static AI. Figure 1 shows how the static AI's constants improve over a series of twenty training games. In this series of games, both AIs search to a depth of 3 ply, and the static AI's constants are the same as the default values in this application (i.e. they are {'c1':3, 'c2':2, 'c3':0.5, 'c4':0.5, 'c5':0.5, 'c6':0.5}). Notice that the static AI starts by consistently beating the learning AI, but the last 5 games are all won by the learning AI. Other patterns are also noticeable from the graph. The ending moves of a game tend to produce the most change in the constants. This makes sense because most of the points in a game are typically won in the last quarter of a game. When points are being won rapidly, then the difference between the predicted utility and the actual utility will be at its highest.

Written by Chet Weger. Questions, comments, bugs? Contact me at chetweger [at] gmail.com.

See my similar projects including Interactive Meta-Tic-Tac-Toe and Interactive Tic-Tac-Toe. Code for this project is available at https://github.com/chetweger/min-max-games.