Skip to content

The Importance of Benchmarking Foundation Models: Insights from Percy Liang

Stanford professor Percy Liang explains why benchmarking is crucial for AI progress and how HELM provides transparent, holistic evaluation of foundation models.

foundation-models-percy-liang_quote

The Importance of Benchmarking Foundation Models: Insights from Percy Liang

Stanford professor Percy Liang explains why benchmarking is crucial for AI progress and how HELM provides transparent, holistic evaluation of foundation models.

foundation-models-percy-liang_quote

Welcome to our coverage of the HubSpot for Startups annual AI Summit in San Francisco. In this session, we're privileged to share insights from Percy Liang, Associate Professor at Stanford University, Director at the Stanford Center for Research on Foundation Models, and Co-founder at Together.xyz.

 

The Critical Role of Benchmarks in AI Development

Percy Liang opened his keynote by emphasizing how benchmarks fundamentally orient AI development. He posed thought-provoking questions: "If it weren't for ImageNet, what would the deep learning revolution look like in computer vision? If it weren't for SQUAD or SuperGLUE, what would the deep learning revolution look like in NLP?"

Benchmarks serve as the North Star for the AI community, providing clear targets to aim for. However, Liang stressed that benchmarking is more than just measurement—it encodes values:

  • Benchmarks determine what the community prioritizes
  • The choice of tasks, languages, and domains shapes how technology develops
  • Without proper benchmarking, innovation lacks direction

The Benchmarking Crisis in Foundation Models

We're living in an era of foundation models, with new models emerging monthly. This rapid pace of innovation has created what Liang describes as a "crisis" in our ability to measure model quality.

"The pace of innovation is happening so quickly that we're not able to measure," Liang noted. While benchmarks historically lasted years, foundation models are now advancing faster than our ability to evaluate them properly.

The Challenge of Evaluating General-Purpose Models

Language models represent a paradigm shift in AI—a seemingly simple box that takes text in and produces text out, yet capable of an infinite variety of applications:

  • Generating SQL queries
  • Writing emails
  • Revising content
  • Explaining jokes

This versatility is precisely what makes benchmarking so challenging. How do you comprehensively evaluate systems that can do so many different things?

What We Need from Language Model Benchmarking

Liang outlined several critical requirements for effective benchmarking:

  • Transparency: We need objective understanding of what models can and cannot do
  • Industry standards: Standardized evaluation methods are essential for fair comparisons
  • Grounded conversations: Factual foundations enable informed decisions for policymakers and businesses
  • Socially beneficial development: Benchmarks should guide the creation of reliable, beneficial models

Beyond just measuring accuracy, Liang emphasized the importance of evaluating:

  • Bias
  • Robustness
  • Calibration
  • Efficiency
  • Human values

HELM: Holistic Evaluation of Language Models

Last year, the Stanford Center for Research on Foundation Models released HELM (Holistic Evaluation of Language Models), designed with three core principles:

1. Broad Coverage with Recognition of Incompleteness

Rather than creating an arbitrary list of tasks, HELM employs a systematic taxonomy:

  • Tasks
  • Domains
  • Content creators
  • Time periods
  • Languages

This approach acknowledges gaps in coverage while providing a framework for expansion.

2. Multiple Metrics Beyond Accuracy

HELM measures multiple dimensions:

  • Accuracy: Basic correctness of outputs
  • Calibration: Whether models know what they don't know
  • Robustness: Consistency when inputs change slightly
  • Fairness: Performance across demographic groups
  • Bias: Tendency to generate stereotyped content
  • Toxicity: Propensity to produce harmful content
  • Efficiency: Practical considerations for deployment

3. Standardization

HELM systematically evaluates all models on the same scenarios, ensuring fair comparisons across a wide range of use cases:

  • Question answering
  • Information extraction
  • Summarization
  • Toxicity classification
  • Sentiment analysis
  • Various text classification tasks
  • Coding and reasoning tasks
  • Legal and medical datasets

Models and Transparency

The HELM project evaluated 30 models from organizations including AI21, Anthropic, Cohere, EleutherAI, Big Science, Google, Meta, Microsoft, NVIDIA, OpenAI, Tsinghua, and Yandex.

A key commitment of HELM is full transparency:

  • All data tables are available online
  • Users can examine individual instances and model predictions
  • All code is open-source
  • The community can contribute scenarios, metrics, and models

Beyond Text: Expanding Evaluation

Liang also discussed several extensions to HELM that address broader evaluation needs:

Multimodal Models

Future benchmarking must encompass models that handle both text and images, evaluating aspects like:

  • Quality
  • Originality
  • Knowledge
  • Bias and toxicity

Human-in-the-Loop Evaluation

Since models like ChatGPT are used interactively, there's a need to evaluate human-AI interactions:

  • Interactive metrics don't always correlate with automatic metrics
  • Models might optimize for responses that are difficult for humans to work with

Generative Search Engines

Evaluating systems like BingChat and Perplexity revealed interesting findings:

  • Only 74% of citations actually support the generated statements
  • There's a trade-off between perceived utility and citation accuracy
  • Some models generate helpful-sounding but fabricated information

The Path Forward

Liang concluded by emphasizing that benchmarking is what orients AI development and determines its future direction. He invited the community to contribute to HELM, particularly with use cases from specialized domains like law, medicine, finance, coding, and creative content.

The goal is to create a comprehensive resource that tracks progress in foundation models and provides an up-to-date dashboard for the community. Like a lighthouse, HELM aims to shine light on this rapidly evolving field.

AI Disclaimer: The insights shared in this video or audio were initially distilled through advanced AI summarization technologies, with subsequent refinements made by the writer and our editorial team to ensure clarity and veracity.

Full AI Summit Library

Would you like full access to the complete AI Summit video library, featuring over ten hours of educational content and insights? Click below.

AI Summit Library