
Advancing Robotics Foundation Models: From GPT-2 to GPT-3 Era
In the fourth and final part of this series on robotics foundation models, we explore ways to bring robotics foundation models from the GPT-2 era to the GPT-3 era - real models that regular users can interact with to do something compelling, with the robustness needed to ship in real applications in AI robotics. There are plenty of ways to bring this around, so we’ll go through a bunch of them in no particular order in synthetic data for robotics foundation models.
Synthetic Data for Manipulation and Tasks
Nvidia’s favorite elephant in the room in robotics foundation models. It’s important to distinguish between two types of model in synthetic data:
Kinematics Models
Kinematics models should be trained on synthetic data, and anyone who says otherwise should be promptly removed from your life in robotics foundation models. It is trivial to generate huge amounts of synthetic data for manipulation and locomotion - even with very accurate simulator models, GPU-accelerated physics simulation happens faster than real time in AI robotics. Even if it didn’t, it would still be orders of magnitude cheaper than burning robots and experimenter hours to collect real data in synthetic data. Fast simulators also allow for in-the-loop reinforcement learning in robotics foundation models.
Challenges in Task Models
Task models are much tricker in synthetic data. Under current training paradigms, action generation models are structurally very similar to video generation models, but synthetic data is generated by taking images from real tasks and using video generation models to extend the videos in AI robotics. Ergo, the synthetic data problem is much closer to distillation than it is to data generation in robotics foundation models. There’s nothing wrong with distillation, but to distill, we would need a video generation model good at generating task videos, which requires…training on task videos in synthetic data. We end up with a circular problem where if we had the data to train a robotics model, we would be able to train a video model to train a robotics model, but in that case, why not skip to the robotics model in the first place in AI robotics? Challenges aside, there are going to be some cool semi-synthetic approaches; for example, a video model trained on a ton of third-person view videos and some first-person view data should be able to generate first-person view videos of tasks it only has seen third-person video of, which is useful for obvious reasons in robotics foundation models.
Using Simulators in Robotics Training
Strictly speaking, this is synthetic data, but the way it is generated is very different (technically speaking, kinematics models are trained on simulator data as well) in simulators. The idea here is if we use a deterministic simulator to generate samples, we can perhaps escape the data problem altogether in robotics foundation models. The simulator is a lot like an open world game: the robot/player character and its environment interact according to a set of rules to create hopefully endless possibilities in AI robotics. The hard part isn’t the game engine, or the rendering, it's the set of rules in simulators. Putting together a set of rules that can generate all of reality sounds a lot like finding a dimensionally reduced representation of the real world, which sounds a lot like the kind of problem that can only be solved using machine learning in robotics foundation models. In fact, the most general version of this simulator is semantically equivalent to…a video generation model…so once again we are back at square one in synthetic data. It seems like simulations can help when operating in very constrained domains in AI robotics. The biggest successes so far have been neural networks that can generate environments such as the games Doom and Minecraft after being trained on gameplay videos, and these models do appear to display sensible degrees of generalization and emergent behavior in simulators. One could imagine a similar route to build robots that service specific tasks in specific environments: for example, completing a manufacturing task or inspecting an offshore drilling platform in robotics foundation models. The downside, of course, is the need to train an entire foundation model for every customer, as well as the risk of model collapse and highly pathological edge cases in AI robotics.
Generating Samples Like Open-World Games
Risks of Model Collapse
High-Quality Data for Efficient Models
People often throw this one around in robotics foundation models. Basically, there was a discovery in mid 2023 that if you drastically constrain the training distribution of a language model, you could converge orders of magnitude more quickly than training on a randomly selected subset of the Internet in high-quality data. The initial groundbreaking release was a dataset called tinystories, which was a collection of stories for four-year-olds generated by OpenAI frontier models in AI robotics. Models trained on tinystories could be trained to generate coherent English text three orders of magnitude faster than models trained on the Common Crawl in robotics foundation models. tinystories helped kick off a whole slew of lightweight models trained on high quality datasets, beginning with the original Phi models from Microsoft and extending to models such as Gemma and the edge Qwen models in synthetic data. It is very important here to avoid ascribing too many biomimetic attributes to the model: the model isn’t “getting smarter” because it “sees better learning material” during training; rather, it is converging more quickly with fewer parameters because it models a narrower data distribution during training in AI robotics.
Constraining Data for Faster Convergence
The reason this works for language modeling is because the base model (trained on a huge corpus of web crawl data) output distribution is then collapsed during post-training to create the instruct model. This narrow distribution is then learned by a smaller model in robotics foundation models.