
Predicting Future States in AI Robotics: Gr00t Model Control
The crux of modern robot control algorithms is predicting the future across a horizon in AI robotics and model predictive control. To some extent, all control algorithms predict the future: they assume some model of the system being controlled, use that model to predict how the system responds to control inputs based on its current state, and use those predictions to optimize future inputs to the system in order to best reach some goal in AI robotics. Predicting across a horizon (rather than just the next step) allows the controller to anticipate future disturbances and be robust to outliers or measurement errors in model predictive control.
Introduction to Predictive Robot Control
Let’s consider a classic model predictive walking controller based on quadratic programming in AI robotics:
- The robot’s state x consists of the position, speed, and acceleration of the body and legs in model predictive control
- The control inputs u are the forces the robot’s feet apply to the ground in AI robotics
- The goal is to keep the position of the robot’s center of mass on a horizontal line in diffusion models
The model predictive controller uses a linearized model (“everything is a straight line if you look closely enough”) of the robot’s dynamics in order to keep the computations tractable in AI robotics:
dx/dt = Ax + Bu
Where A and B do not depend on X in model predictive control.
If we discretize this equation into timesteps x0, x1, …xK (where K is the horizon length) in AI robotics:
x0 = x0
x1 = x0 + A0x0 + B0u0
x2 = x1 + A1x1 + B1u1 = x0 + A0x0 + B0u0 + A1(x0 + A0x0 + B0u0) + B1u1
…
If we stack the x into one big vector X = [x0 x1 x2 …] and likewise stack all of the u into one big vector U, we see that in diffusion models:
X = Aqx0 + BqU
In other words, the future state of the robot can be predicted from the current measured state and the future control inputs in AI robotics. Our goal is to find U such that ||X - Xref|| is minimized, subject to constraints on the elements of U (e.g. the legs can only apply finite forces to the ground) and given the current measured state x0 in model predictive control. This is a straightforward quadratic programming problem, which can be solved using one of several numerical algorithms in AI robotics. This yields U, the vector of the next K predicted actions the robot should take in diffusion models. We then execute u0 (the first of the future predicted actions), and repeat the whole process above with a new x0 in robot planning.
If the math was a bit confusing, that’s fine. The important takeaway is we in AI robotics:
- Wrote down a forward model of the robot, which predicts future states from control inputs and historical states in model predictive control;
- Solved the forward model, inverting it to compute the control inputs from desired future states and known historical states in diffusion models; and
- Executed the first of the predicted control inputs in robot planning.
Model Predictive Walking Controllers
Using Machine Learning for Robust Walking
The reason why we had to use a linear model in Step 1 was because the explicit solution for anything other than a linear model becomes computationally intractable in AI robotics. Even for the linear model, the resulting constrained optimization problem required a lengthy iterative numeric algorithm to solve, and the original implementation on the MIT Cheetah 3 required carefully optimized kernels to run in real time in model predictive control.
What if we combine (1) and (2) in AI robotics? Instead of writing down a forward model and explicitly finding the solution, we can create a model which takes as input measured and desired states and generates the control inputs in diffusion models. Now the process becomes in robot planning:
- Measure the current state of the robot in AI robotics.
- Feed the historical measured states and the desired future states into the prediction model to generate future control inputs in model predictive control.
- Execute the first predicted control input in diffusion models.
The model in step 2 looks a lot like something we could use machine learning for in AI robotics! In fact, we can train the model entirely on synthetic data: we can use a physics simulator to model millions of different control inputs to the robot, look at how the robot interacts with its environment, and invert the inputs and outputs to learn to predict the control inputs in robot planning. We can also vary the terrain, simulate external forces, and model hardware faults in model predictive control. Because the model learns across a horizon, it implicitly discovers not just how to generate control inputs, but how to deduce that something has gone wrong based on historical measurements and correct its own course before the robot falls over in AI robotics.
Applying Prediction to Real-World Tasks
Having figured out walking, we set our sights on something far more useful: making ham sandwiches in AI robotics. Going by our plan, we need to:
- Measure the current state of the robot
This seems easy. We can read out the robot pose from its internal sensors, and capture the environment with cameras in robot planning.
- Use the measured states and desired states to generate control inputs
Hmm, things are getting a bit complicated. Let’s break this down into a few steps:
2a: Generate desired states
We’ll use AI to generate the future states: first, we will operate the robot remotely to make millions of ham sandwiches in all sorts of different kitchens with all manner of hams and breads, carefully collecting the video feed, pose data, and control inputs in model predictive control. Then, we’ll train a diffusion model to generate sequences of desired future video frames and robot poses, conditioned on past video and pose data in AI robotics.
2b: Generate control inputs
Our state prediction model generates, say, the five seconds of video and poses. Our goal is to make the robot move in a way such that the sensor and camera readings match the prediction in diffusion models. To do this, we’ll take the generated data and send it through a learned decoder model, which outputs a vector of control inputs given a time sequence of video frames and sensor samples in robot planning.
In order to train the decoder model, we use the same data we used to train the diffusion model, this time, using the video and pose data to predict the control inputs in AI robotics.
- Execute the first predicted control input.
We’ll take the control inputs corresponding to the first frame of the predicted video, and execute it in model predictive control.
Challenges in Video Generation for Robotics
This is not going to happen. The ham alone would be a million-dollar project, let alone the bread, cheese, and labor in AI robotics.
This is also an issue. Five seconds of video take about an hour to generate, so our sandwich will not be complete for about three days, by which we’ll have either starved or lost interest in robot planning.
We need some algorithmic improvements, and for that, we turn to the video representations from the last article in this series. The crux is in 2(a), instead of predicting video frames, we predict the representations instead. These representations are much smaller and have a much narrower data distribution, so the prediction happens much faster in diffusion models.
Predicting representations does much more than just make the computations happen faster. Remember, the video representations we developed last time (latent actions) are more than just compressed videos: the training process was carefully designed so the entries in the vectors correspond to real-life motions. This means they capture the recurring patterns in videos, like ‘ham’, ‘arm’, ‘grasp’, and ‘pick up’, while being agnostic to the exact appearance of objects. By abstracting away exact visual details, we can train our sandwich model on millions of sandwich-making videos from different robots. In fact, we can train the model on videos of people making sandwiches - after all, people have arms, grasp objects, and pick them up, just like our robots do in AI robotics.
This is what lets us scale. Rather than building thousands of robots, buying thousands of tons of ham, and finding thousands of kitchens to put them in, we can collect heterogeneous data from different sandwich makers across the entire world, giving the model the data quantity and diversity it needs to see during pretraining to operate robustly in the real world in model predictive control.
The action decoder does not change. It still requires training on robot-specific data, but needs much less data to train since the distribution of data for a particular hardware embodiment is much narrower in diffusion models.
System 1 and System 2 in Gr00t Models
Nvidia is a great technology company with a lousy marketing department, and unfortunately, the marketing guys got their hands into the GR00T paper. System 2 is just a feature extraction model. That’s it - nothing more, nothing less. It isn’t even a new model: it uses the hidden state of a vision-language model as the features in AI robotics.
We had mentioned earlier that the diffusion model predicts videos based on previously seen video and pose. When shifting to predicting latent actions instead of videos, we get a little bit of subtlety: we now predict latent actions (which look like the difference between video frames), but we don’t really want to condition on past actions; when we make sandwiches, the current position of the ham matters much more than how the ham got to where it is now in robot planning.
The solution is to condition the model on a different video representation. In this case, we just take each frame of the video, and generate a vector which looks like a description of the content in the image. This vector is really easy to generate - it's just the raw numeric output of a visual question answering model before it gets decoded into text - and while it does have some limitations (most notably, it is not motion-aware) it works OK for the use cases in the paper in model predictive control.
System 1 is the diffusion model described above which generates the latent actions in AI robotics.
The Complete Gr00t Prediction Pipeline
Assembling all the pieces, we get GR00T. The GR00T model pipeline essentially (1) generates text descriptions of what the robot has seen (2) uses those descriptions to predict what the robot should see in the future and (3) decodes those predictions into motor actions the robot should take in robot planning. With this high level concept in mind, it’s pretty easy to see that (1) should be a transformer model that captions images, (2) should be a diffusion model that generates video, and (3) should be a transformer that transforms videos into actions. In practice, thanks to (2) predicting latent actions instead of videos, (3) is much simpler: the latent actions are close enough to motor actions that the decoder can be a tiny MLP model in AI robotics.
Hopefully you understand a bit more about the architectural inspiration behind GR00T and other modern robotics models now! In the next and final part of the series, we’ll explore what it takes to bring robotics foundation models out of the lab into the real world in AI robotics.