The Importance of Benchmarking Foundation Models: Insights from Percy Liang
Stanford professor Percy Liang explains why benchmarking is crucial for AI progress and how HELM provides transparent, holistic evaluation of foundation models.

Stanford professor Percy Liang explains why benchmarking is crucial for AI progress and how HELM provides transparent, holistic evaluation of foundation models.
Stanford professor Percy Liang explains why benchmarking is crucial for AI progress and how HELM provides transparent, holistic evaluation of foundation models.
Welcome to our coverage of the HubSpot for Startups annual AI Summit in San Francisco. In this session, we're privileged to share insights from Percy Liang, Associate Professor at Stanford University, Director at the Stanford Center for Research on Foundation Models, and Co-founder at Together.xyz.
Percy Liang opened his keynote by emphasizing how benchmarks fundamentally orient AI development. He posed thought-provoking questions: "If it weren't for ImageNet, what would the deep learning revolution look like in computer vision? If it weren't for SQUAD or SuperGLUE, what would the deep learning revolution look like in NLP?"
Benchmarks serve as the North Star for the AI community, providing clear targets to aim for. However, Liang stressed that benchmarking is more than just measurement—it encodes values:
We're living in an era of foundation models, with new models emerging monthly. This rapid pace of innovation has created what Liang describes as a "crisis" in our ability to measure model quality.
"The pace of innovation is happening so quickly that we're not able to measure," Liang noted. While benchmarks historically lasted years, foundation models are now advancing faster than our ability to evaluate them properly.
Language models represent a paradigm shift in AI—a seemingly simple box that takes text in and produces text out, yet capable of an infinite variety of applications:
This versatility is precisely what makes benchmarking so challenging. How do you comprehensively evaluate systems that can do so many different things?
Liang outlined several critical requirements for effective benchmarking:
Beyond just measuring accuracy, Liang emphasized the importance of evaluating:
Last year, the Stanford Center for Research on Foundation Models released HELM (Holistic Evaluation of Language Models), designed with three core principles:
Rather than creating an arbitrary list of tasks, HELM employs a systematic taxonomy:
This approach acknowledges gaps in coverage while providing a framework for expansion.
HELM measures multiple dimensions:
HELM systematically evaluates all models on the same scenarios, ensuring fair comparisons across a wide range of use cases:
The HELM project evaluated 30 models from organizations including AI21, Anthropic, Cohere, EleutherAI, Big Science, Google, Meta, Microsoft, NVIDIA, OpenAI, Tsinghua, and Yandex.
A key commitment of HELM is full transparency:
Liang also discussed several extensions to HELM that address broader evaluation needs:
Future benchmarking must encompass models that handle both text and images, evaluating aspects like:
Since models like ChatGPT are used interactively, there's a need to evaluate human-AI interactions:
Evaluating systems like BingChat and Perplexity revealed interesting findings:
Liang concluded by emphasizing that benchmarking is what orients AI development and determines its future direction. He invited the community to contribute to HELM, particularly with use cases from specialized domains like law, medicine, finance, coding, and creative content.
The goal is to create a comprehensive resource that tracks progress in foundation models and provides an up-to-date dashboard for the community. Like a lighthouse, HELM aims to shine light on this rapidly evolving field.
AI Disclaimer: The insights shared in this video or audio were initially distilled through advanced AI summarization technologies, with subsequent refinements made by the writer and our editorial team to ensure clarity and veracity.