Teaching an AI to Draft Magic: the Gathering

By Zachary Witten

or, How To Train Your Robot To Train Your Dragon

In a Magic: the Gathering Draft, players iteratively select cards from packs to make the best deck they can. Using data from drafts carried out by humans, I trained a neural network to predict what card the humans would take out of each pack. It reached 60% accuracy at this task. And in the 40% when the human and the model differ, the model’s predicted choice is often *better* than what the human picked. A notebook with the model training code and sample results can be found here.

  • Magic: the Gathering [MTG]: the best card game ever made. “Kind of like a cross between poker and chess, but with dragons”, is how I describe it to rideshare drivers on my way to tournaments. As a player, you summon and command mythical creatures of all sorts, which then slug it out with your opponent’s creatures as the two of you cast magical spells to spur them to greater heights.
  • Constructed: the most common way to play MTG. Before the match begins, each player constructs a deck of 60 cards and brings it to battle. Pros: self-expression (you get to choose exactly what cards go in your deck). Cons: having to acquire all the darn cards in advance ($$$).
  • Draft: the most fun way to play MTG, in this author’s opinion. You and 7 other players sit in a circle at a table. Each of you opens a pack of 15 cards. You draft one card from your pack, and then pass the remaining 14 to your left. You draft one card from those 14, and then pass 13 to your left. Repeat until all the cards are gone; you now have 15 cards in front of you. Then the cycle begins again: you and the other 7 each open a second pack. Repeat a third time and you have 45 cards. Then make the best deck you can out of the cards you drafted. The beauty of Draft is that everyone starts on a level playing field, and that each time is different because you’ll never draft the same deck twice.

To master Draft, you have to balance the raw power level of each card, the synergy of the cards with each other, and the signals that you’re receiving from the cards that are left in the pack at any given time.

The proprietor of Magic Flooey was kind enough to send me a .json file with 2130 drafts of the second-most-recent MTG set, Guilds of Ravnica.

I also scraped the archives of Draftsim, a site where humans can practice drafting against bots with predetermined pick orders. Models trained on this data achieved similar levels of accuracy to models trained on the Flooey data, but I noticed that the model was making worse picks because the Draftsim users were making worse picks than the Flooey users, likely because they weren’t playing for stakes. This is despite Draftsim having orders of magnitude more drafts.

To represent the set of cards available to choose from in a pack, we can use a vector of length N, where N is the number of different cards in Guilds of Ravnica (250). Each card gets an index. The first pack you see has 15 cards in it, so its vector will have 15 ones (the indices of the cards in the pack), and 235 zeros. The second pack will have 14 ones and 236 zeros, etc.

To represent the cards we’ve already selected, we can use another vector of length N, where the value at each index is the number of that card that we’ve already drafted. For the first pick in every draft, it’ll be all zeros. For the second pick, there will be 1 one and 249 zeros. For the third pick, it’ll probably be 2 ones and 248 zeros… unless we took the same card out of both of the first two packs, in which case it’d be a 2 and 249 zeros.

In summary, the input to our model is a vector of length 2N.

The output is a one-hot encoded vector of length N set to be 1 at the index of the card taken by the human, and 0 elsewhere.

This representation ignores one important feature of Draft: memory of what cards you saw in a previous pack and *didn’t* take. Professional players use this information to make guesses about what cards they’re likely to see in future packs. We could represent this by adding an additional 45 vectors of length N to each input. The first of these vectors would represent the cards we saw for our first pick, unless this *is* our first pick, in which case it’s all zeros. In general, the Kth of these vectors represents the cards we saw at our Kth pick, if indeed we have had K picks already, otherwise it’s all zeroes.

Including this info in the model multiplies the size of the input representation by a factor of 24, which isn’t ideal with only 2000x45 data points. With more data I think it could be a valuable feature to add.

I experimented with a linear SGD classifier that reached 50% accuracy. I then tried a neural network with a couple of 512-neuron dense layers spaced with dropout layers into a N-length softmax layer, and got to 59% accuracy.

I went through some of the drafts in the test set and had the model say at each stage what card it would have picked, given the cards already picked, alongside the cards the human picked. You can see them in the notebook accompanying this post.

The model definitely understands the concept of color. In MTG there are 5 colors, and any given draft deck will likely only play cards from 2 or 3 of those colors. So if you’ve already taken a blue card, you should be more likely to take blue cards in future picks. We didn’t tell the model about this, and we also didn’t tell it which cards were which color. But it learned anyway, by observing which cards were often drafted in combination with each other.

Since MTG hasn’t been solved the way Chess and Go sort of have, it’s impossible to say with certainty exactly how good the model’s picks are. I’m a decent player and when I look at the cards it takes, I don’t see anything obviously wrong. In fact, when it and the human drafter disagreed, I found I tended to prefer the model’s picks. But I’m biased, so please look at the picks yourself and tell me what you think, or have your most MTG-knowledgable friend check it out and report back.

Ask a thousand humans to guess how many jelly beans are in a big jar of jelly beans. Average the absolute error of each guess. Then average the guesses, and take the absolute error of *that*. The latter will be less wrong.
 http://wisdomofcrowds.blogspot.com/2009/12/chapter-one-part-i.html

I think something like that is happening here. Everyone overrates some cards, and underrates others. Average all those errors together and you end up in a better place.

If we had match results for each deck in the training set (i.e. how many games the decks ended up winning), we could have the model optimize not just for similarity to human picks, but for the quality of the expected deck it’d get after making a given pick. For each card in the pack, the model would dream through thousands of ways the draft might go on from that point, and evaluate the resulting decks. Then it takes the card that gave it the best dreams. I.e. we could go full AlphaZero.

Another exciting possibility would be to add data from MTGJSON into the model. Right now, each card is just a random integer between 1 and 250. With the info from MTG JSON, we’d be adding explicit info about what cards are similar. This would allow the model to draft a new set without seeing humans draft it first.

Until next time, may you be unable to distinguish sufficiently advanced Magic from technology.