
The Gr00t Article Series Part 1: A brief history of foundation models
Recently, Nvidia released a “humanoid foundation model” with the whimsical and somewhat unfortunate name ‘Gr00t-1’ (imagine telling your customer your robot is powered by “Nvidia Groot technology”!). Gr00t-1 is the GPT2 moment of robotics - clear and demonstrable proof that a foundation-scale approach to robotics controls, trained on a highly scalable data modality (video), can generalize across tasks and hardware configurations. While the model itself is limited in what it can do (it only saw 8000 hours of video during training), the writing is on the wall that we’re one generation and perhaps 18 months away from a ChatGPT moment where robotics start solving real-world problems better than humans can.
What do we mean by a “GPT2 moment”? What exactly is a “foundation model”? To answer those questions, a brief journey into the prehistoric days of AI is in order.
In the Beginning: ImageNet and Classification
Modern “AI” (as opposed to data science or machine learning) is widely acknowledged to have started in the ImageNet image classification challenge, a machine vision competition where computers have to classify images into one of 1,000 classes. Broadly speaking, the top approaches to the challenge were all similar: programs would look for features common to large groups of images, then by examining how frequently or strongly these features appeared in specific images, use that information to perform classification. Previous approaches had either used traditional machine learning, manual engineering, or a combination of the two to decide what these features were.
The watershed moment for image classification was AlexNet (2012), which used a tunable mathematical function called a convolutional neural network to automatically determine what the best features to extract were. During training, the neural network sees millions of images labeled with the right class, and adjusts its parameters to maximize the chance that it guesses the right class. During inference, the neural network is shown new images - the idea is the new images will be similar to the millions of images it already saw, so it is very likely to also make the right guess.
Then, a miracle happened. Researchers discovered that the features learned during training on the ImageNet dataset (pretraining) were so universal that they could also be used to group objects into classes the model had never seen before. For example, a model trained on the ImageNet classes could also, with a small amount of additional data, be used to distinguish diseased plants from healthy ones. In effect (though we didn’t have the name yet), these were the first foundation models. This triggered a huge wave of early AI applications, mostly focused on detecting objects, but they weren’t very exciting - yes, it might be useful to detect whether people were smiling with 96% accuracy, but it was also kind of creepy and completely unclear how to monetize this. It also didn’t really scale, because it relied on huge amounts of labeled data for pretraining, and labeled data doesn’t appear in real life.
BERT and Unsupervised Training
Simultaneous with the rapid advances in machine vision, something else was simmering at Google. Well before consumers could directly interact with language models through chat interfaces, statistical language modeling was being used at Google for everything from machine translation to page retrieval. This culminated in 2018’s BERT (Bidirectional Encoder Representations with Transformers), and the real hero here was the representations, not the transformers.
You see, BERT was an AI model trained on a seemingly useless task: it randomly guessed hidden words in Wikipedia articles. The groundbreaking discovery was that this model could be easily adapted to complete completely unrelated tasks: classify review sentiment, generate summaries, autocomplete emails, and much more.
What was going on? Well, it turns out that by being forced to fill in words in a huge dataset of articles, the model internally learns a strong representation of language: it begins to understand logic, relationships between words, and the meanings of words. All this understanding exists in an abstract numerical space (the latent space), and by augmenting the model with a small amount of additional, labeled training data, very meaningful, human readable features could be extracted from the latent space.
This was a huge deal. BERT’s pretraining process requires no labels whatsoever; a computer can randomly mask out words from articles and try to predict them. By eliminating labels, BERT’s training process opened up a whole new world of scale, limited only by data and computing resources. Still, because the model wasn’t very interactive, only engineers and language modeling researchers cared; for most people, these advances were hidden behind the scenes inside search engines and auto-completion tools.
GPT2, Autoregressive Language Modeling, and the First Foundation Models
As Google was working on BERT and encoder representations, a little startup called OpenAI was building a different kind of language model. The idea was the same: train a language model on a huge dataset of text, but the strategy was different: rather than guessing hidden words, OpenAI’s model was an autoregressive generative model, predicting future text that didn’t exist yet. Specifically, the model was trained to guess the next word in sentences and articles.
This simple training objective created the first foundation model: an AI trained on a highly scalable data source which could generalize to a huge array of downstream tasks. By exposing the model to a small amount of task-specific data (fine tuning), it could be taught to write emails, stories, poems, and much more.
But still, for the most part, the public didn’t care: the model required extra training data to complete useful tasks, which was too clunky for end users, and was too small, so it tended to generate random, unrelated text (what we now know as hallucination).
Fortunately for the industry, researchers and engineers saw that GPT2 was a huge deal, big enough that Nvidia built a custom supercomputing appliance - the DGX-2 - to help OpenAI scale up their training for a model that was 100x bigger. The rest is history: we discovered that once models became that big, the scope of their knowledge was such that fine tuning was no longer needed. That led to ChatGPT, modern generative AI, and the trillion-dollar AI industry as we know it today.
GPT2 but for Robots
So what do we mean when we say “Gr00t-1 is the GPT2 of robotics”? Well:
- Gr00t-1 shows clear generalization capabilities: by training on a fairly generic dataset, the model is capable of controlling multiple robotics hardware platforms in a wide variety of settings.
- The model has a scalable architecture: performance can easily be increased by increasing the training data scale and number of parameters.
- The model is trained using an unsupervised approach on a relatively straightforward to collect data modality (video), without requiring position data, force sensors, or other hardware-specific data.
But (and this is why Gr00t-1 is the GPT2 and not the GPT3 of robotics), the model also has some shortcomings:
- The model is very small: it is only capable of completing tabletop manipulation tasks
- Fine tuning is required for any real-world tasks.
These shortcomings mean significant engineering knowledge is required to interact with the model, limiting its mainstream appeal and real-world impact. But the path forward has been blazed: with additional scaling, a world where robots can intuitively interact with humans is now within reach.
Up next, we explore latent representations and learn why they are so important for scaling AI up. We’ll also take a deep dive into Gr00t-1’s system architecture, and try to understand what happens under the hood. Finally, we’ll also explore what real-world robots need to ship to real customers, and discuss ways to improve robotics foundation models’ accuracy.