The Importance of Benchmarking Foundation Models: Insights from Percy Liang

Stanford professor Percy Liang explains why benchmarking is crucial for AI progress and how HELM provides transparent, holistic evaluation of foundation models.

Access full video library

The Importance of Benchmarking Foundation Models: Insights from Percy Liang

Stanford professor Percy Liang explains why benchmarking is crucial for AI progress and how HELM provides transparent, holistic evaluation of foundation models.

Access full video library

Welcome to our coverage of the HubSpot for Startups annual AI Summit in San Francisco. In this session, we're privileged to share insights from Percy Liang, Associate Professor at Stanford University, Director at the Stanford Center for Research on Foundation Models, and Co-founder at Together.xyz.

The Critical Role of Benchmarks in AI Development

Percy Liang opened his keynote by emphasizing how benchmarks fundamentally orient AI development. He posed thought-provoking questions: "If it weren't for ImageNet, what would the deep learning revolution look like in computer vision? If it weren't for SQUAD or SuperGLUE, what would the deep learning revolution look like in NLP?"

Benchmarks serve as the North Star for the AI community, providing clear targets to aim for. However, Liang stressed that benchmarking is more than just measurement—it encodes values:

Benchmarks determine what the community prioritizes
The choice of tasks, languages, and domains shapes how technology develops
Without proper benchmarking, innovation lacks direction

The Benchmarking Crisis in Foundation Models

We're living in an era of foundation models, with new models emerging monthly. This rapid pace of innovation has created what Liang describes as a "crisis" in our ability to measure model quality.

"The pace of innovation is happening so quickly that we're not able to measure," Liang noted. While benchmarks historically lasted years, foundation models are now advancing faster than our ability to evaluate them properly.

The Challenge of Evaluating General-Purpose Models

Language models represent a paradigm shift in AI—a seemingly simple box that takes text in and produces text out, yet capable of an infinite variety of applications:

Generating SQL queries
Writing emails
Revising content
Explaining jokes

This versatility is precisely what makes benchmarking so challenging. How do you comprehensively evaluate systems that can do so many different things?

What We Need from Language Model Benchmarking

Liang outlined several critical requirements for effective benchmarking:

Transparency: We need objective understanding of what models can and cannot do
Industry standards: Standardized evaluation methods are essential for fair comparisons
Grounded conversations: Factual foundations enable informed decisions for policymakers and businesses
Socially beneficial development: Benchmarks should guide the creation of reliable, beneficial models

Beyond just measuring accuracy, Liang emphasized the importance of evaluating:

Bias
Robustness
Calibration
Efficiency
Human values

HELM: Holistic Evaluation of Language Models

Last year, the Stanford Center for Research on Foundation Models released HELM (Holistic Evaluation of Language Models), designed with three core principles:

1. Broad Coverage with Recognition of Incompleteness

Rather than creating an arbitrary list of tasks, HELM employs a systematic taxonomy:

Tasks
Domains
Content creators
Time periods
Languages

This approach acknowledges gaps in coverage while providing a framework for expansion.

2. Multiple Metrics Beyond Accuracy

HELM measures multiple dimensions:

Accuracy: Basic correctness of outputs
Calibration: Whether models know what they don't know
Robustness: Consistency when inputs change slightly
Fairness: Performance across demographic groups
Bias: Tendency to generate stereotyped content
Toxicity: Propensity to produce harmful content
Efficiency: Practical considerations for deployment

3. Standardization

HELM systematically evaluates all models on the same scenarios, ensuring fair comparisons across a wide range of use cases:

Question answering
Information extraction
Summarization
Toxicity classification
Sentiment analysis
Various text classification tasks
Coding and reasoning tasks
Legal and medical datasets

Models and Transparency

The HELM project evaluated 30 models from organizations including AI21, Anthropic, Cohere, EleutherAI, Big Science, Google, Meta, Microsoft, NVIDIA, OpenAI, Tsinghua, and Yandex.

A key commitment of HELM is full transparency:

All data tables are available online
Users can examine individual instances and model predictions
All code is open-source
The community can contribute scenarios, metrics, and models

Beyond Text: Expanding Evaluation

Liang also discussed several extensions to HELM that address broader evaluation needs:

Multimodal Models

Future benchmarking must encompass models that handle both text and images, evaluating aspects like:

Quality
Originality
Knowledge
Bias and toxicity

Human-in-the-Loop Evaluation

Since models like ChatGPT are used interactively, there's a need to evaluate human-AI interactions:

Interactive metrics don't always correlate with automatic metrics
Models might optimize for responses that are difficult for humans to work with

Generative Search Engines

Evaluating systems like BingChat and Perplexity revealed interesting findings:

Only 74% of citations actually support the generated statements
There's a trade-off between perceived utility and citation accuracy
Some models generate helpful-sounding but fabricated information

The Path Forward

Liang concluded by emphasizing that benchmarking is what orients AI development and determines its future direction. He invited the community to contribute to HELM, particularly with use cases from specialized domains like law, medicine, finance, coding, and creative content.

The goal is to create a comprehensive resource that tracks progress in foundation models and provides an up-to-date dashboard for the community. Like a lighthouse, HELM aims to shine light on this rapidly evolving field.

AI Disclaimer: The insights shared in this video or audio were initially distilled through advanced AI summarization technologies, with subsequent refinements made by the writer and our editorial team to ensure clarity and veracity.

The Importance of Benchmarking Foundation Models: Insights from Percy Liang

The Importance of Benchmarking Foundation Models: Insights from Percy Liang

The Critical Role of Benchmarks in AI Development

The Benchmarking Crisis in Foundation Models

The Challenge of Evaluating General-Purpose Models

What We Need from Language Model Benchmarking

HELM: Holistic Evaluation of Language Models

1. Broad Coverage with Recognition of Incompleteness

2. Multiple Metrics Beyond Accuracy

3. Standardization

Models and Transparency

Beyond Text: Expanding Evaluation

Multimodal Models

Human-in-the-Loop Evaluation

Generative Search Engines

The Path Forward

Full AI Summit Library