The Gr00t Article Series Part 2: Labels without Labels

Written by Bayley Wang | Jun 14, 2025 6:41:02 AM

In the last article in this series, I talked a lot about “unsupervised training”: training AI models without the use of labels or categories. Unsupervised training is critical for AI at scale because (1) labels are expensive and (2) it is often unclear what the labels should be.

Training Objectives

If we don’t label the data, how do we train? The crux here is to give the model a trivial task (the training objective) that forces it to learn a deep understanding of recurring patterns found in the training data. For example:

Autoregressive generation (next-token prediction): predicting the next word or data point in a sequence (GPT3, OpenAI 4o image generation, most other language models)

Masked modeling: predicting missing patches in images, or missing words in sentences (BERT, MAGVIT, DINO)
Image compression: compressing an image or video into a fixed data size, or by a fixed amount (VQ-VAE and similar)

These simple tasks generate features that are incredibly powerful:

Text models trained on next-token prediction on the Internet famously exhibit in-context learning (if the initial sequence consists of a few examples, they tend to generate similar samples as they continue the sequence). They can also be fine-tuned (trained on a small number of samples of a specific format) to answer questions.
Mask modeling as a standalone objective is useless. However, the resulting features are highly representative of the meaning of the underlying text - they can be used for sentiment analysis, summarization, question answering, and more. Likewise, image and video features from masked modeling are very strong drivers for downstream tasks such as segmentation, classification, and depth extraction.
Compression is itself a useful task, but oftentimes, the emergent features correspond to physically meaningful concepts in images and video.

So, What Are Features?

We’ve been talking about features for a while without actually explaining what they are. Features are an abstract representation of the most important information in a dataset. Good features have the following hallmarks:

Compact: the size of the feature should be as small as possible, while still being able to capture all of the important information in the data.
Semantically meaningful: mathematically important components of the features should correspond to physically (or linguistically, logically, etc.) important concepts in the data.
Clustered: data which are ‘similar’ in some way should exist ‘close to each other’ in the feature space. Likewise, data which are dissimilar to each other should be ‘far away’. Distinct clusters of data in feature space should correspond to semantically meaningful data categories.
Linear: this one is not always possible, but ideally, ‘half-way’ between two features should correspond to some concept of ‘half-way-ness’ between two data points.

These representations are usually vectors or matrices, in which case mathematical underpinnings of the above hallmarks are straightforward: the dimension of the vectors should be minimal, each component should correspond to something meaningful about the data, similar data should correspond to vectors pointing in the same direction, and ideally, interpolating down an axis of the vectors should correspond to some consistent change in the data itself.

This is all a bit confusing to talk about in the abstract, so let’s take a look at some examples. One of my favorites is DINOv2, an image foundation model that does not get as much attention as it should. DINOv2 is trained using a clever and somewhat complicated objective: predicting that different crops of an image come from the same image, and guessing hidden patches masked away in an image. As a standalone task, this objective is useless, but it forces the model to learn very general features. For example, DINOv2 generates features for every 14x14 patch of pixels in the image. What if we plot the principal components of these features?

We get some striking emergent behavior. The black background is computed by thresholding the first principal component, which turns out to correspond to the main subject of the image. The colors visualize the next three components, which correspond to progressively finer semantic features in the image. That’s already pretty neat - after all, DINOv2 was trained to match images and guess patches, not segment images.

Now look at the leftmost column: it consists of “things which have wings”. DINOv2 generates red features for ‘bodies’, green features for ‘the leading edge of the wing’, blue features for ‘the trailing edge of the wing’, and purple features for ‘tails’. Out of nowhere, the model has figured out that airplanes and birds are structurally similar, and finds the correspondence between the images.

Similarly, the second column is ‘elephants’. DINOv2 correctly learns that the first three images are elephants, including the second image which is from a difficult angle. What is more remarkable is it learns that the final image, a stylized statue of an elephant, is in fact an elephant, and furthermore correctly finds correspondences between the trunk, ears, body, and legs of the statue and the real elephants.

Here’s why these features are awesome. Because they correspond to semantically meaningful components of the image in an elegant way, it is easy to use them for tasks such as monocular depth prediction (center image) or object segmentation (right image). All we need is a simple linear model (we scale, shift, and add the features) to transform these representations into something which is useful for a lot of real-world applications.

Why do we care?

Robotics is a label-deficient space. We can easily scale huge video datasets, but it is difficult and time-consuming to generate fine-grained video labels which capture the full richness of the underlying data. We need to rely on the model itself to learn the underlying features, rather than rely on humans, who are expensive, unreliable, and biased, to write those features down.

Fortunately, there is a clever way to do this: we pose feature extraction as a video compression problem. It turns out most useful robotics problems can be formulated as “move the robot, then have the robot move some objects”. In order to learn the sequence of actions that solves the problem, the model needs to be able to extract motion from training data.

Readers who know a bit about video compression can probably see where this is going. Codecs like H265 which are used to transmit videos over the web compress video 200x or more by estimating the motion between consecutive frames. In the H26x family, this is done by subdividing frames into 8x8 or 16x16 blocks, performing block matching to find the best matched blocks and offset between two frames, differencing the block-estimated frame with the real frame, and compressing the error (which is hopefully close to zero in most places, therefore low in information).

Unfortunately, this motion estimate is not useful for training AI models: it is not variationally stable (small, random errors introduced into the compressed data stream result in catastrophic changes in the decompressed video) and the motion is predicted at the block level, which is not semantically meaningful. The former is important because AI algorithms learn approximations of their inputs, and these approximations may contain small errors. The latter is important because we are applying the motion features to a downstream task (a robot interacting with the real world), not using them to decompress videos, so they need to somehow correspond to reality.

Peeling Bananas and Learning Motion

Consider a man peeling bananas, and imagine for a moment you are a resident of Britain in 1850 who wants to get a head start on the tech industry by building the worlds’ first video sharing platform. The only problem is, the internet doesn’t exist yet, and neither do videos, so you’re stuck distributing stacks of glass plates by horse to your users. Quickly, you discover that the British horse network lacks the bandwidth you need, so you investigate ways to compress the videos.

“Easy!”, you think. “We’ll just send a photo and a description.” You pivot to distributing a photo of a man with a banana, and the description “the man is peeling the banana”. Unfortunately, this is 1850s Britain, and bananas won’t be invented until 1888, so none of your users know what bananas are, let alone how to peel them.

Faced with tanking user retention, you improve the compression algorithm. The compressed video is now a photo and the description “the man grabs the stem at the top, breaks it, and pulls the skin down. he then grabs the remaining flap of skin two times to remove all the peel sections”. This is a success: users now have enough information to visualize the banana peeling process.

With efficient distribution infrastructure in place, users flock to your platform. Everyone is excited about this new ‘banana’ phenomenon, and banana uploads skyrocket. This poses a new problem: the descriptions are not accurate enough to reconstruct different banana videos and capture all of their uniqueness. In order to capitalize on the invention of viral banana peeling videos, you improve the compression algorithm further: the description now reads “the man grabs the stem at the top 4 mm down, holds it for 1 second, snaps it downwards to a 45 degree angle over the course of two seconds. he then grabs the bar and over the course of the next four seconds, pulls the flap of skin down to 40% of the way from the top. the process is repeated twice with the remaining skin”. With these new fine-grained descriptions, users are now able to reconstruct videos with precision and enjoy many distinct banana-peeling experiences.

See what’s happening here? The best way to compress the banana video also generates instructions on how to peel bananas. In a nutshell, that is the link between video compression and robotics: compact ways to compress video have emergent features that correspond to real-world, physical actions.

Astute readers will have observed something else: it is surprisingly difficult to explain how to peel a banana. English is a great way to tell a story, but isn’t the densest or clearest way to describe spatiotemporal features. That’s where the learning comes in: rather than fixing the vocabulary as “all of the words in the English language”, we can learn the “best” vocabulary to represent objects and actions.

Enough about bananas, let’s reduce this to implementation.

Putting it All Together

*stock VQ-VAE image, needs updating

Here’s how we make this all happen in practice. We start with two frames from a video, which we will call “Before” and “After”. Before and After are sent through a black box, the encoder, which outputs a list of feature vectors. These feature vectors are quantized: there is a fixed vocabulary of vectors (the codebook) and we replace each feature with the nearest vector in the codebook. The codebook changes as training progresses, but is shared across all of the training data. The quantization process creates an information bottleneck that forces the encoder to learn a compact representation.

Then, we add a little noise to the quantized vectors to encourage variational stability (actually, the feature vectors represent the mean and standard deviation of a Gaussian distribution from which a random sample is drawn). The resulting noisy features, along with the Before image, are sent through a decoder, which tries to use them to generate the After image. The training process tries to match the real After image and the decoded one as well as possible.

Finally, the feature vectors from the encoder are used as a motion representation for further model training. Because motion extraction looks like video compression, the features we’ve learned from this process end up matching pretty well with real, manually labeled motion collected from real robots, allowing us to scale up training by training on human videos, not just robot videos, which are scarce and very costly to collect.

And that’s it! Hopefully the article was informative and you now know a bit more about features and representations, which are the heart of modern machine learning. Next up, we’ll take a step back, break down the overall architecture of Gr00t-N1, and look at some of the inspiration behind the model.

View full post