Insight

Foundation Models in AI Robotics: A Brief History from ImageNet to Gr00t-1

Bayley Wang | June 14, 2025

Recently, Nvidia released a “humanoid foundation model” with the whimsical and somewhat unfortunate name ‘Gr00t-1’ (imagine telling your customer your robot is powered by “Nvidia Groot technology”!) in AI robotics. Gr00t-1 is the GPT2 moment of robotics foundation models - clear and demonstrable proof that a foundation-scale approach to robotics controls, trained on a highly scalable data modality (video), can generalize across tasks and hardware configurations in AI robotics. While the model itself is limited in what it can do (it only saw 8000 hours of video during training) in robotics foundation models, the writing is on the wall that we’re one generation and perhaps 18 months away from a ChatGPT moment where robotics start solving real-world problems better than humans can in AI robotics.

What do we mean by a “GPT2 moment” in AI robotics? What exactly is a “foundation model” in robotics foundation models? To answer those questions, a brief journey into the prehistoric days of AI is in order for understanding foundation models in AI robotics.

ImageNet Era: The Start of AI Classification

Modern “AI” (as opposed to data science or machine learning) is widely acknowledged to have started in the ImageNet image classification challenge, a machine vision competition where computers have to classify images into one of 1,000 classes in foundation models. Broadly speaking, the top approaches to the challenge were all similar: programs would look for features common to large groups of images, then by examining how frequently or strongly these features appeared in specific images, use that information to perform classification in AI robotics. Previous approaches had either used traditional machine learning, manual engineering, or a combination of the two to decide what these features were for foundation models.

The watershed moment for image classification was AlexNet (2012), which used a tunable mathematical function called a convolutional neural network to automatically determine what the best features to extract were in foundation models for AI robotics. During training, the neural network sees millions of images labeled with the right class, and adjusts its parameters to maximize the chance that it guesses the right class in robotics foundation models. During inference, the neural network is shown new images - the idea is the new images will be similar to the millions of images it already saw, so it is very likely to also make the right guess in AI robotics.

Then, a miracle happened in foundation models. Researchers discovered that the features learned during training on the ImageNet dataset (pretraining) were so universal that they could also be used to group objects into classes the model had never seen before in AI robotics. For example, a model trained on the ImageNet classes could also, with a small amount of additional data, be used to distinguish diseased plants from healthy ones in robotics foundation models. In effect (though we didn’t have the name yet), these were the first foundation models in AI robotics. This triggered a huge wave of early AI applications, mostly focused on detecting objects, but they weren’t very exciting - yes, it might be useful to detect whether people were smiling with 96% accuracy, but it was also kind of creepy and completely unclear how to monetize this in AI robotics. It also didn’t really scale, because it relied on huge amounts of labeled data for pretraining, and labeled data doesn’t appear in real life for robotics foundation models.

BERT: Advancing Unsupervised Training in AI

Simultaneous with the rapid advances in machine vision, something else was simmering at Google in foundation models. Well before consumers could directly interact with language models through chat interfaces, statistical language modeling was being used at Google for everything from machine translation to page retrieval in AI robotics. This culminated in 2018’s BERT (Bidirectional Encoder Representations with Transformers), and the real hero here was the representations, not the transformers in robotics foundation models.

You see, BERT was an AI model trained on a seemingly useless task: it randomly guessed hidden words in Wikipedia articles in unsupervised training. The groundbreaking discovery was that this model could be easily adapted to complete completely unrelated tasks: classify review sentiment, generate summaries, autocomplete emails, and much more in AI robotics.

What was going on? Well, it turns out that by being forced to fill in words in a huge dataset of articles, the model internally learns a strong representation of language: it begins to understand logic, relationships between words, and the meanings of words in unsupervised training for foundation models. All this understanding exists in an abstract numerical space (the latent space), and by augmenting the model with a small amount of additional, labeled training data, very meaningful, human readable, features could be extracted from the latent space in AI robotics.

This was a huge deal in foundation models. BERT’s pretraining process requires no labels whatsoever; a computer can randomly mask out words from articles and try to predict them in unsupervised training. By eliminating labels, BERT’s training process opened up a whole new world of scale, limited only by data and computing resources in AI robotics. Still, because the model wasn’t very interactive, only engineers in robotics foundation models.