
The Gr00t Article Series Part 3: How to Predict the Future
The crux of modern robot control algorithms is predicting the future across a horizon. To some extent, all control algorithms predict the future: they assume some model of the system being controlled, use that model to predict how the system responds to control inputs based on its current state, and use those predictions to optimize future inputs to the system in order to best reach some goal. Predicting across a horizon (rather than just the next step) allows the controller to anticipate future disturbances and be robust to outliers or measurement errors.
How to Walk
Let’s consider a classic model predictive walking controller based on quadratic programming:
- The robot’s state x consists of the position, speed, and acceleration of the body and legs
- The control inputs u are the forces the robot’s feet apply to the ground
- The goal is to keep the position of the robot’s center of mass on a horizontal line
The model predictive controller uses a linearized model (“everything is a straight line if you look closely enough”) of the robot’s dynamics in order to keep the computations tractable:
dx/dt = Ax + Bu
Where A and B do not depend on x.
If we discretize this equation into timesteps x0, x1, …xK (where K is the horizon length):
x0 = x0
x1 = x0 + A0x0 + B0u0
x2 = x1 + A1x1 + B1u1 = x0 + A0x0 + B0u0 + A1(x0 + A0x0 + B0u0) + B1u1
…
If we stack the x into one big vector X = [x0 x1 x2 …] and likewise stack all of the u into one big vector U, we see that:
X = Aqx0 + BqU
In other words, the future state of the robot can be predicted from the current measured state and the future control inputs. Our goal is to find U such that ||X - Xref|| is minimized, subject to constraints on the elements of U (e.g. the legs can only apply finite forces to the ground) and given the current measured state x0. This is a straightforward quadratic programming problem, which can be solved using one of several numerical algorithms. This yields U, the vector of the next K predicted actions the robot should take. We then execute u0 (the first of the future predicted actions), and repeat the whole process above with a new x0.
If the math was a bit confusing, that’s fine. The important takeaway is we:
1. Wrote down a forward model of the robot, which predicts future states from control inputs and historical states;2. Solved the forward model, inverting it to compute the control inputs from desired future states and known historical states; and
Executed the first of the predicted control inputs.
Learning to Walk
The reason why we had to use a linear model in Step 1 was because the explicit solution for anything other than a linear model becomes computationally intractable. Even for the linear model, the resulting constrained optimization problem required a lengthy iterative numeric algorithm to solve, and the original implementation on the MIT Cheetah 3 required carefully optimized kernels to run in real time.
What if we combine (1) and (2)? Instead of writing down a forward model and explicitly finding the solution, we can create a model which takes as input measured and desired states and generates the control inputs. Now the process becomes:
- Measure the current state of the robot.
- Feed the historical measured states and the desired future states into the prediction model to generate future control inputs.
- Execute the first predicted control input.
The model in step 2 looks a lot like something we could use machine learning for! In fact, we can train the model entirely on synthetic data: we can use a physics simulator to model millions of different control inputs to the robot, look at how the robot interacts with its environment, and invert the inputs and outputs to learn to predict the control inputs. We can also vary the terrain, simulate external forces, and model hardware faults. Because the model learns across a horizon, it implicitly discovers not just how to generate control inputs, but how to deduce that something has gone wrong based on historical measurements and correct its own course before the robot falls over.
Learning to Make Ham Sandwiches.
Having figured out walking, we set our sights on something far more useful: making ham sandwiches. Going by our plan, we need to:
- Measure the current state of the robot
This seems easy. We can read out the robot pose from its internal sensors, and capture the environment with cameras.
- Use the measured states and desired states to generate control inputs
Hmm, things are getting a bit complicated. Let’s break this down into a few steps:
2a: Generate desired states
We’ll use AI to generate the future states: first, we will operate the robot remotely to make millions of ham sandwiches in all sorts of different kitchens with all manner of hams and breads, carefully collecting the video feed, pose data, and control inputs. Then, we’ll train a diffusion model to generate sequences of desired future video frames and robot poses, conditioned on past video and pose data.
2b: Generate control inputs
Our state prediction model generates, say, the five seconds of video and poses. Our goal is to make the robot move in a way such that the sensor and camera readings match the prediction. To do this, we’ll take the generated data and send it through a learned decoder model, which outputs a vector of control inputs given a time sequence of video frames and sensor samples.
In order to train the decoder model, we use the same data we used to train the diffusion model, this time, using the video and pose data to predict the control inputs.
- Execute the first predicted control input.
We’ll take the control inputs corresponding to the first frame of the predicted video, and execute it.
Hang On, Something is Wrong!
This is not going to happen. The ham alone would be a million-dollar project, let alone the bread, cheese, and labor.
This is also an issue. Five seconds of video take about an hour to generate, so our sandwich will not be complete for about three days, by which we’ll have either starved or lost interest.
We need some algorithmic improvements, and for that, we turn to the video representations from the last article in this series. The crux is in 2(a), instead of predicting video frames, we predict the representations instead. These representations are much smaller and have a much narrower data distribution, so the prediction happens much faster.
Predicting representations does much more than just make the computations happen faster. Remember, the video representations we developed last time (latent actions) are more than just compressed videos: the training process was carefully designed so the entries in the vectors correspond to real-life motions. This means they capture the recurring patterns in videos, like ‘ham’, ‘arm’, ‘grasp’, and ‘pick up’, while being agnostic to the exact appearance of objects. By abstracting away exact visual details, we can train our sandwich-making model on millions of sandwich-making videos from different robots. In fact, we can train the model on videos of people making sandwiches - after all, people have arms, grasp objects, and pick them up, just like our robots do.
This is what lets us scale. Rather than building thousands of robots, buying thousands of tons of ham, and finding thousands of kitchens to put them in, we can collect heterogeneous data from different sandwich makers across the entire world, giving the model the data quantity and diversity it needs to see during pretraining to operate robustly in the real world.
The action decoder does not change. It still requires training on robot-specific data, but needs much less data to train since the distribution of data for a particular hardware embodiment is much narrower.
System 1 and System 2
Nvidia is a great technology company with a lousy marketing department, and unfortunately, the marketing guys got their hands into the GR00T paper. System 2 is just a feature extraction model. That’s it - nothing more, nothing less. It isn’t even a new model: it uses the hidden state vector from a vision-language model as the features.
We had mentioned earlier that the diffusion model predicts videos based on previously seen video and pose. When shifting to predicting video representations instead of videos, we get a little bit of subtlety: we now predict latent actions (which look like the difference between video frames), but we don’t really want to condition on past actions; when we make sandwiches, the current position of the ham matters much more than how the ham got to where it is now.
The solution is to condition the model on a different video representation. In this case, we just take each frame of the video, and generate a vector which looks like a description of the content in the image. This vector is really easy to generate - it's just the raw numeric output of a visual question answering model before it gets decoded into text - and while it does have some limitations (most notably, it is not motion-aware) it works OK for the use cases in the paper.
System 1 is the diffusion model described above which generates the latent actions.
Putting it All Together
Assembling all the pieces, we get GR00T. The GR00T model pipeline essentially (1) generates text descriptions of what the robot has seen (2) uses those descriptions to predict what the robot should see in the future and (3) decodes those predictions into motor actions the robot should take. With this high level concept in mind, it’s pretty easy to see that (1) should be a transformer model that captions images, (2) should be a diffusion model that generates video, and (3) should be a transformer that transforms videos into actions. In practice, thanks to (2) predicting latent actions instead of videos, (3) is much simpler: the latent actions are close enough to motor actions that the decoder can be a tiny MLP model.
Hopefully you understand a bit more about the architectural inspiration behind GR00T and other modern robotics models now! In the next and final part of the series, we’ll explore what it takes to bring robotics foundation models out of the lab into the real world.