Software 1.0 testing is insufficient for ensuring embodied AI is reliable and safe. The time is now for software 2.0 testing.
The Case for More Reliable AI
How reliable do we need our AI to be? It’s a very reasonable question and the answer - naturally - depends on the application. Are we asking the AI to help us compose an email, or file our tax return or drive us to school?
AI assistants - like ChatGPT - that leverage Large Language Models (LLMs) have achieved extraordinary utility with imperfect reliability. If the model occasionally returns low quality output, the responsible human can discard it, refine their prompt and try again. In the worst case they can always write the content themselves.
Embodied AI is a whole different story. Autonomous vehicles, drones, and humanoid robots are all required to make decisions autonomously in a world shared with humans. Even the occasional low quality model output can have significant downsides with embodied AI.
The threshold on reliability before autonomous, embodied AIs are useful is much, much higher.
The capability of AI has raced ahead, but the reliability of AI is playing catch-up. This is a very exciting moment, because the potential of more capable and reliable AI is a dizzying prospect. When AI is reliable enough to be embodied and autonomous, industries can achieve greater optimization. We can build things where they’re needed, not where labor is cheapest. Emerging economies can be accelerated. Humans - working alongside their AI colleagues - can be protected from dangerous tasks.
The good news is that we humans have been building transformative technologies and making them reliable for some time. We know how to do this. Whether it’s a Saturn V rocket, a new drug, or safety-critical software, the answer is always the same: test the hell out of it.
The bad news is that every new technology challenges the capabilities of the other technologies that we use to test it. Like repeated battles between a virus and antibodies, technologies evolve and the technologies we use to test them, must race to catch up.
The Challenge: How Software 1.0 Testing Falls Short
AI is - fundamentally - software and we have developed very mature working practices and very capable technologies for testing software. So, let’s start there. Here’s a very brief (simplified) overview of good software testing practices that have worked well for years, but which we must re-evaluate in the age of embodied AI.
Step 1 - Coverage
First the good software developer breaks their code up into small logical units. Then - for each logical unit - they create at least one “unit test.” The unit test is simply more code that exercises the unit by applying example inputs (with particular care for edge cases) and asserts the expected outputs. If these outputs are detected the test will pass.
Step 2 - Automation
Developers change their code very frequently. In a medium-sized software company hundreds of changes get merged into the code every day. Companies use Continuous Integration (CI) technology to automatically run the unit tests against every proposed change. Not just the tests on that change, but all the unit tests in the codebase that may be affected by the change. Thus, companies run many thousands - even millions - of unit tests a day.
Step 3 - Rapid Insights
When tests fail, the developer wants to understand why. Good CI technology will guide the developer to a list of tests clearly highlighting which ones failed and then enable the developer to drill down and look at the detailed test outputs to quickly understand what went wrong.
A Breaking Change
If implemented really well, software testing works and was almost considered a solved problem - until neural networks, where this approach breaks down. It’s a problem that has been glossed over for a long time, and is only getting worse as those networks are getting bigger and swallowing up more of the logic within applications.
Andrej Karpathy coined the phrase Software 2.0 to describe the modern AI stack. I think this is a perfect phrase because it really captures how fundamentally the preeminent deep neural net architecture - the Transformer - breaks the current paradigms of Software, including how we test it.
The Transformer breaks software testing at the very start of the first step of software testing: coverage. Recall: ”First the good software developer breaks their code up into small logical units.”
The logic under test in a Transformer cannot be broken down this way. All the tortuous computations and branching decisions that are laid out explicitly in the code of Software 1.0 are now opaquely compressed into millions (sometimes billions) of neural network parameters. They’re not code, they’re data. A huge matrix. An endless field of numbers scrolling by that evokes - well - The Matrix.
Reimagining Testing for Software 2.0
Unit tests have a nice symmetry. You send code to test code. From the earliest days of neural nets it was clear that a similar symmetry exists. Nets replace code with data and we send data to test it. Back when nets were used for constrained tasks like object detection and classification in computer vision applications, we would have humans label data sets, then divide them up, using some to train and some to test. The test being: can the net infer the label given the data?
This approach worked well, but now we are asking AI’s to do complex tasks that unfold over time. We seek to validate their behavior.
Consider the example of a humanoid robot placing a dish in a dishwasher. What is the label here? We could ask a patient human to write out the full sequence of torques that can be applied to the robot’s motors to complete the task, but, there is no single “right” way to accomplish the task. There are infinite permutations that would be okay. We could look only at the end state: is the plate in the dishwasher? But this might lead to missing potential risks such as if the robot knocked over a vase along the way.
The approach to testing the behavior of embodied AI is to feed in sequences of example input data and observe the planned output behavior. Labeled data can certainly be useful here, but so are metrics of performance.
Consider the dishwasher example again. The engineers care that the task is completed successfully, but they also care if the task can be completed more quickly. They care about the movements of the robot and how that might impact humans nearby. AI engineers need detailed metrics, visualizations, and charts to help them understand the outcome of tests, as opposed to the binary pass/fail of a unit test.
If embodied AI is going to seamlessly integrate into our world and improve our lives, we need a better approach to testing that looks very different from Software 1.0’s unit testing. It’s a much more end-to-end style of testing that treats the system under test as an unknown quantity whose behavior needs to be explored empirically—making metrics of performance vital for deploying Software 2.0.
In spite of these differences, the core tenets of Software 1.0 testing remain extremely relevant. Testing for embodied AI should also aspire to have Coverage, Automation and Rapid Insights. AI software developers should be running many thousands of these tests per day - just like their software 1.0 counterparts.
The implementation of these tenets, however, will look very different. We need to radically rethink the tooling and infrastructure that are required to support embodied AI testing workflows.
In the next post in this series, we will begin to explore some of the ways that Coverage, Automation and Rapid Insights are being fundamentally rebuilt for the world of Software 2.0 and embodied AI.