AI image generation models can produce really nice images. But how good are they at producing the images we actually want?
We have a lot of metrics for evaluating the quality of images generated by models (e.g. FID, CLIP score). We call these producibility metrics. But we're missing steerability metrics, which tell us how well models can be guided by people to produce the specific outputs they want.
Steering Arena is a game that helps us measure the steerability of generative models. When you play the game, you'll see an image that was originally generated by some model. Your task is to prompt the model to reproduce the same image. Because the goal image is generated by the model itself, this game measures a model's ability to be steered by people independently of the quality of images it can produce.
Models that produce high-quality images but can't be steered by people will not be useful. Conversely, models that can be steered by people for artificial reasons (e.g. if they only show a single image) will have poor producibility. It's important to measure both to understand a model's capabilities and how to improve them.
You'll occasionally be asked to rate which of two pairs of images is more similar to one another. These images come from other people playing the game. These ratings help form a model leaderboard that ranks models based on their steerability.
For more details, see our paper:
“What’s Producible May Not Be Reachable: Measuring the Steerability of Generative Models”
by Keyon Vafa, Sarah Bentley, Jon Kleinberg, and Sendhil Mullainathan (2025).
We hope you’ll give the game a try and help us understand the steerability of generative models. If you have any questions or want to see your model on the leaderboard, email us at [email protected].