
The Gr00t Article Series Part 4: From GPT-2 to GPT-3
In the fourth and final part of this series, we explore ways to bring robotics foundation models from the GPT-2 era to the GPT-3 era - real models that regular users can interact with to do something compelling, with the robustness needed to ship in real applications. There are plenty of ways to bring this around, so we’ll go through a bunch of them in no particular order.
Synthetic Data
Nvidia’s favorite elephant in the room. It’s important to distinguish between two types of model:
- Kinematics models should be trained on synthetic data, and anyone who says otherwise should be promptly removed from your life. It is trivial to generate huge amounts of synthetic data for manipulation and locomotion - even with very accurate simulator models, GPU-accelerated physics simulation happens faster than real time. Even if it didn’t, it would still be orders of magnitude cheaper than burning robots and experimenter hours to collect real data. Fast simulators also allow for in-the-loop reinforcement learning.
- Task models are much tricker. Under current training paradigms, action generation models are structurally very similar to video generation models, but synthetic data is generated by taking images from real tasks and using video generation models to extend the videos. Ergo, the synthetic data problem is much closer to distillation than it is to data generation.
There’s nothing wrong with distillation, but to distill, we would need a video generation model good at generating task videos, which requires…training on task videos. We end up with a circular problem where if we had the data to train a robotics model, we would be able to train a video model to train a robotics model, but in that case, why not skip to the robotics model in the first place?
Challenges aside, there are going to be some cool semi-synthetic approaches; for example, a video model trained on a ton of third-person view videos and some first-person view data should be able to generate first-person view videos of tasks it only has seen third-person video of, which is useful for obvious reasons.
Simulator Data
Strictly speaking, this is synthetic data, but the way it is generated is very different (technically speaking, kinematics models are trained on simulator data as well). The idea here is if we use a deterministic simulator to generate samples, we can perhaps escape the data problem altogether. The simulator is a lot like an open world game: the robot/player character and its environment interact according to a set of rules to create hopefully endless possibilities.
The hard part isn’t the game engine, or the rendering, it's the set of rules. Putting together a set of rules that can generate all of reality sounds a lot like finding a dimensionally reduced representation of the real world, which sounds a lot like the kind of problem that can only be solved using machine learning. In fact, the most general version of this simulator is semantically equivalent to…a video generation model…so once again we are back at square one.
It seems like simulations can help when operating in very constrained domains. The biggest successes so far have been neural networks that can generate environments such as the games Doom and Minecraft after being trained on gameplay videos, and these models do appear to display sensible degrees of generalization and emergent behavior. One could imagine a similar route to build robots that service specific tasks in specific environments: for example, completing a manufacturing task or inspecting an offshore drilling platform. The downside, of course, is the need to train an entire foundation model for every customer, as well as the risk of model collapse and highly pathological edge cases.
“High Quality Data”
People often throw this one around. Basically, there was a discovery in mid 2023 that if you drastically constrain the training distribution of a language model, you could converge orders of magnitude more quickly than training on a randomly selected subset of the Internet. The initial groundbreaking release was a dataset called tinystories, which was a collection of stories for four-year-olds generated by OpenAI frontier models. Models trained on tinystories could be trained to generate coherent English text three orders of magnitude faster than models trained on the Common Crawl.
tinystories helped kick off a whole slew of lightweight models trained on high quality datasets, beginning with the original Phi models from Microsoft and extending to models such as Gemma and the edge Qwen models. It is very important here to avoid ascribing too many biomimetic attributes to the model: the model isn’t “getting smarter” because it “sees better learning material” during training; rather, it is converging more quickly with fewer parameters because it models a narrower data distribution during training.
The reason this works for language modeling is because the base model (trained on a huge corpus of web crawl data) output distribution is then collapsed during post-training to create the instruct model. This narrow distribution is then learned by a smaller model,