Skip to main content

In a recent tweet, Computer Science Professor Sameer Singh asked, “Are Natural Language Processing models as good as they seem on leaderboards?” He then provided the answer. “We know they’re not, but there isn’t a structured way to test them.” He went on to introduce CheckList, a task-agnostic methodology for testing NLP models that he developed in collaboration with Marco Tulio Ribeiro of Microsoft Research and Tongshuang Wu and Carlos Guestrin at the University of Washington. The team presented their paper, “Beyond Accuracy: Behavioral Testing of NLP Models with CheckList,” at the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020). Their work was not only well-received; it was the recipient of the Best Paper Award

“As a community, we increasingly see NLP models that beat humans on accuracy on various datasets, yet we know that these models are not as good as humans for many of these tasks,” says Singh, explaining the motivation behind the work. “This made us think, what can we do about this mismatch in how we currently evaluate these models and what we think is their ‘true’ performance?”

Better Understanding NLP Models
The researchers addressed this mismatch using a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation. In addition, they developed a software tool that can quickly generate a large and diverse number of test cases. The resulting CheckList tool confirmed what the researchers suspected — that although certain commercial and research models could pass benchmark tests, further prodding revealed a variety of severe bugs and an inability to effectively handle basic linguistic phenomena.

For example, user studies discovered new and actionable bugs in cloud AI offerings from Amazon, Google and Microsoft. The models struggled to identify the sentiments of the following statements, showing unidentified failures when tested for temporal, negation and Semantic Role Labeling (SRL) capabilities, respectively:

  • “I used to hate this airline, although now I like it.”
  • “I thought the plane would be awful, but it wasn’t.”
  • “Do I think the pilot was fantastic? No.”

The models also had problems with an invariance (INV) test, changing the sentiment for the statement “@AmericanAir thank you we got on a different flight to Chicago,” when “Chicago” was replaced with “Dallas.”

This work illustrates the need for systematic testing in addition to standard evaluation. “These tasks may be considered ‘solved’ based on benchmark accuracy results,” note the authors, “but the tests highlight various areas of improvement — in particular, failure to demonstrate basic skills that are de facto needs for the task at hand (e.g. basic negation, agent/object distinction, etc.).”

The authors also evaluated Google’s BERT and Facebook AI’s RoBERTa algorithms, and, as pointed out in a recent Venture Beat article, found that BERT “exhibited gender bias in machine comprehension, overwhelmingly predicting men as doctors.” It also made “positive predictions about people who are straight or Asian and negative predictions when dealing with text about people who are atheist, Black, gay, or lesbian.” Identifying such algorithmic bias is critical to understanding how AI and machine learning can disrupt rather than perpetuate certain stereotypes.

Implementing CheckList
Development teams that participated in the user studies were very receptive to the work. One team responsible for a popular commercial sentiment analysis product said that CheckList helped them test capabilities that they hadn’t considered or that weren’t in the benchmarks. They were also able to more extensively test benchmarked capabilities and identify bugs that they could address in the next model iteration. The team was eager to incorporate CheckList into its development cycle.

“In our chats with NLP practitioners in the industry, it’s clear that current practices in ‘testing’ NLP systems, or machine learning in general, are quite heuristic and ad-hoc,” says Singh. “Checklist provides a structure and the tooling to make it easier to think about testing, and provides a lot of evidence that thinking more about testing can lead to really important insights and discovery of bugs.”

All of the tests presented in the paper are part of CheckList’s open source release and can be easily incorporated to complement existing benchmarks. An online video walks viewers through the findings. Furthermore, CheckList can be used to collectively create more exhaustive test suites for a variety of tasks to help developers gain a more accurate understanding of the true capabilities — and flaws — of their systems.

Shani Murray