An Infrastructure for Testing Embodied AI

Virtual experiences are the unit tests of Embodied AI. Here we explore the infrastructure and tooling required to unlock rapid insights, through testing with virtual experiences.

Why We are Rebuilding Software Testing for AI

The first blog post in this Testing AI series - “Testing Times Ahead for Embodied AI” looked at how advancements in AI are fundamentally changing the nature of software and especially how we test it.

Here’s a lightning summary, in case you missed the first post.

Attempts to embody AI in robots, drones, and self-driving cars raise the bar on the reliability that we require of that AI software. Reliability goes hand-in-hand with great testing, but the stalwart of Software 1.0 testing, the unit test, does not help the modern, Software 2.0 AI stack, where the logic under test is increasingly encoded in data, not source-code.

Nonetheless, the core tenets of Software 1.0 testing - Coverage, Automation, and Rapid Insights, are more relevant than ever. In this, the second post in our testing AI series, we will explore how tools being developed at ReSim and elsewhere are providing Coverage, Automation, and Rapid Insights for the world of Software 2.0 and embodied AI.

Virtual Experiences are the Unit Tests of Embodied AI

Evaluating the behavior of Embodied AI requires feeding sequences of input data into the AI software to analyze its planned behavior. Take the example of a drone landing to deliver a package. The sequences of input data in this case are the images from the drone’s camera and any other onboard sensors that generate thousands or millions of data points during the few seconds required to successfully land. Those data are fed into the drone’s AI software to analyze the trajectory it plans to fly.

One method for running that test is to load the AI software onto a real drone and fly an actual delivery! This, however, lacks the convenience, repeatability, and scalability of software unit tests, which Software 1.0 development teams run in the thousands to evaluate every code change.

It is more efficient to fly the mission once, record the data and store it - enabling the easy replay of the data through the drone’s AI software at any time. Further, we can do this on any computer, from a developer’s laptop to the cloud. And we can do it over-and-over, as often as needed, including on new versions of the AI. We call these virtual experiences.

This type of virtual experience is known as a ‘replay’, however, virtual experiences can also be built using a simulator. Another alternative is to use generative AI models, trained on replays, to produce new virtual experiences.

There are, of course, pros and cons for each approach to creating virtual experiences, which we will break down in a future installment of this blog series. In this post, we focus on the foundational infrastructure and tooling needed for AI developers to run their preferred virtual experiences with the same scale and ease as the unit tests of Software 1.0.

Rebuilding CI for AI

Software 1.0 developers have access to an incredible array of mature, high-quality tools to support their testing workflows. Continuous Integration (CI) tools help developers analyze the coverage of their unit tests. They leverage automation to run unit tests on every code change. They provide rapid insights back to the developer when tests fail.

If virtual experiences are the unit tests of AI software, we must explore how this changes the requirements on our tooling for Coverage, Automation, and Rapid Insights.

Coverage

In the case of Software 1.0, the concept of coverage maps neatly to lines of code (a unit test covers a small logical unit of code). In the case of AI software (Software 2.0), that mapping is less clear. Virtual experiences don’t really correspond to any particular subset of the model parameters in the AI under test. Instead, the concept of coverage means the virtual experiences cover the full range of real experiences the AI might encounter.

This can be a daunting task, especially if the AI is operating in a complicated environment. Few AI developers have experienced these challenges more acutely than those working on self-driving vehicles, which operate in complex environments with a vast range of potential experiences.

Inevitably this has led self-driving vehicle manufacturers to build vast numbers of virtual experiences in an effort to obtain good coverage. This is why you hear stories about Waymo driving 20 billion miles in simulation and Aurora running five million virtual experiences every day. While impressive, these numbers should not surprise us. In fact, they are consistent with the number of unit tests that are run daily in comparably large Software 1.0 teams.

Of course, this problem is not special to self-driving. Other embodied AI applications like humanoid robots and delivery drones in urban environments are staring down the same challenge. In fact, any AI that seeks behavior with high reliability in a complex domain faces this challenge, particularly where the safety of humans is a risk.

The ability to understand and monitor coverage of virtual experiences is extremely important and starts with powerful tools. Embodied AI developers need a library to store and manage volumes of virtual experiences and categorize them into relevant groups.

Returning to the delivery drone example, the developer may want a group of virtual “landing” experiences, as well as experiences where it is “dark” or “raining” so they can easily search, filter and compose testing groups (test suites), for example, “all the landing experiences where it is both dark and raining.”

AI developers also need to easily correlate their experiences with tests - running the experiences through versions of their AI software. They need to understand how coverage maps to performance. In fact, this is key to understanding whether coverage is sufficient. So, the experience library must be integrated with other tooling for Automation of test runs and Rapid Insights into their results.

Automation

With the help of an experience library, developers can organize their virtual experiences into Test Suites, which may be automatically run against new versions of their AI software.

Actually running the tests, however, brings a few more challenges. Unit tests tend to be very small - just a few lines of code. As a result, many thousands can be run quickly on a single machine. Virtual experiences tend to be much more computationally demanding, involving the AI running over a period of time with the inference steps often requiring a GPU.

Running thousands of virtual experiences in a reasonable time requires leveraging the inherent parallelism of cloud computing, orchestrating many thousands of tests scheduled on hundreds of machines, while gracefully managing prioritization, resource conflicts and unexpected failures.

ReSim’s scheduler is designed to manage this parallel execution. It doesn’t matter whether the test suite contains ten virtual experiences or ten million. ReSim’s scheduler deals with them just the same.

ReSim Reporting Page — Automating runs of Test Suites allows teams to track performance and coverage over time as they update their AI software.

Rapid Insights

Extracting rapid insights from unit tests on Software 1.0 is straightforward, with good CI tools making it easy for the developer to find and read test output error messages to understand exactly why the test failed.

Virtual experiences are more complicated because analyzing the planned behavior of an AI in a virtual experience can be subtle.

In the example of a drone landing to deliver a package, the developer cares not only about the success of the delivery, but how it was accomplished. Did the drone fly safely? Did it detect and avoid people? How efficient was the delivery?

ReSim’s customers consistently tell us they want more than just a simple pass/fail result from their virtual experiences. They want performance metrics and the ability to drill down and visualize the behavior during the test on their experience.

When a Test Suite is run, the AI developer needs to be able to navigate to a results dashboard that shows them metrics, charts and visualizations, but the required metrics vary widely depending on the particular application or test. There’s no general set of metrics that apply to all Embodied AI applications.

ReSim handles this by making the tooling very flexible and customizable, allowing users to write their own code to define metrics, charts and visualizations. ReSim also supports common open source data analysis and charting libraries, while integrating with popular embodied AI visualizers like Foxglove to make ReSim flexible enough to host almost any user-generated content on the results dashboards.

Rapid insights can be achieved with tooling that helps the user arrange their own content in useful ways that aid in drawing comparisons. In the A/B testing page example below, ReSim enables users to compare two tests over virtual experiences by viewing their visualizations and charts side-by-side or overlaid, helping them quickly identify differences in their metrics over the two test runs.

Develop Embodied AI Faster

Great CI tooling has helped Software 1.0 developers achieve high coverage, automate their testing and get rapid insights into the performance of their code. With the new wave of embodied AI tooling being developed at ReSim and elsewhere, embodied AI developers can enjoy these same benefits.

This helps AI developers move much faster and ensure their embodied AI’s behavior is reliable, safe and capable. Ultimately this unlocks robotics and embodied AI products, enabling them to move quickly from cool prototypes with impressive YouTube videos to production quality systems than can be shipped to paying customers. In other words, this unlocks the amazing benefits of embodied AI for all of us.

‍