Noureddine RAMDI / Action100M: Hierarchical Tree-of-Captions for Multi-Scale Video Understanding

Created Tue, 05 May 2026 13:37:39 +0000 Modified Sat, 23 May 2026 20:41:27 +0000

facebookresearch/Action100M

Action recognition datasets have evolved beyond flat labels to richer, multi-scale annotations that reflect the complexity of real-world activities. Action100M tackles this by providing a massive video dataset with hierarchical, temporally segmented captions that capture actions at different levels of detail. This hierarchical Tree-of-Captions enables training models that understand not just what happens in a video, but how actions relate across time and granularity.

What Action100M offers: hierarchical captions for large-scale video action data

Action100M is a large-scale video action dataset developed by Meta FAIR (Facebook AI Research). Its defining feature is a Tree-of-Captions annotation structure applied over video segments at multiple temporal scales. Instead of a single caption per video or uniform clips, the dataset segments videos hierarchically and annotates each segment with captions that reflect the action content at that level.

Technically, the dataset uses a multi-level temporal segmentation approach. Each video is split into parent segments and child segments forming a tree hierarchy. Captions are generated at each node of this tree, allowing the representation of actions from coarse (whole video or large segments) to fine-grained (short action steps).

Annotations are generated using large language models (LLMs). The dataset includes captions from PLM-3B, detailed middle-frame captions from Llama-3.2-Vision-11B, and GPT-generated summaries that provide structured descriptions including actions, actors, and instructions. This layered annotation enriches the video data with semantic detail at multiple temporal resolutions.

The full dataset promises around 100 million annotations, but the publicly accessible preview (about 10%) is hosted on HuggingFace in parquet format with streaming support. This makes it practical to explore and experiment with the dataset without downloading the entire corpus.

The architecture behind the dataset is primarily focused on data curation and annotation rather than a modeling framework. However, the hierarchical captions enable research into multi-scale video understanding tasks—such as coarse video summarization, step identification, and action recognition—within a unified framework.

Technical strengths and tradeoffs: hierarchical temporal segmentation and LLM-generated captions

The key technical strength of Action100M is its hierarchical Tree-of-Captions concept. This approach captures the nested structure of human activities, where high-level activities decompose into sequences of smaller action units.

From a technical perspective, representing video annotations as a tree rather than flat labels or isolated captions introduces complexity in data handling and model design. Models trained on this data need to understand temporal relationships and hierarchical dependencies, which is more challenging but also more expressive.

The use of large language models to generate captions at multiple levels provides rich semantic content. However, this also comes with tradeoffs:

  • LLM-generated captions may reflect biases or inconsistencies inherent in the models.
  • The quality of annotations depends on the prompt engineering and the LLM’s understanding of video frames.
  • The hierarchical segmentation requires more complex data structures and processing pipelines.

Code quality and tooling around the dataset appear focused on data loading and exploration. The dataset is provided in parquet format, which is efficient for large-scale data processing and compatible with popular Python data science tools.

Streaming support through HuggingFace’s datasets library improves scalability and developer experience by allowing users to work with the dataset without local storage bottlenecks.

Overall, the repo concentrates on providing a robust, semantic-rich dataset for video action understanding research rather than end-to-end modeling or training code.

How to get started exploring Action100M

Accessing the Action100M preview dataset is straightforward using the HuggingFace datasets library. The streaming capability means you can iterate over samples without downloading the full dataset upfront.

Here’s the official quickstart snippet from the repo:

from datasets import load_dataset

dataset = load_dataset(
    "parquet",
    data_files=f"hf://datasets/facebook/Action100M-preview/data/*.parquet",
    streaming=True,
)
it = iter(dataset["train"])

sample = next(it)

This snippet loads the dataset in streaming mode, iterates over the training split, and fetches one sample. Each sample includes hierarchical caption data tied to specific video segments.

Practitioners can build on this to train multi-scale video understanding models or perform analysis on the hierarchical annotations. Since the dataset is large and complex, streaming is essential for practical experimentation.

Beyond the code snippet, the repo’s README and dataset card provide useful context on annotation schema, segment hierarchy, and caption types.

Verdict: a resource for multi-scale video understanding research with hierarchical semantic annotations

Action100M is a specialized dataset aimed at advancing video action understanding by providing hierarchical, richly annotated video segments. Its scale and structured temporal segmentation are rare in public datasets, positioning it well for multi-scale learning approaches.

It’s not a plug-and-play model repo but a data resource with clear technical strengths and some complexity. The reliance on LLM-generated annotations is a double-edged sword: it enriches the data but requires scrutiny regarding annotation quality and biases.

If you’re working on video understanding models that need to handle hierarchical temporal structures or want to experiment with LLM-augmented video annotations, Action100M is worth exploring. Its streaming support lowers the barrier to entry for large-scale experiments.

For those new to video datasets or looking for simpler action recognition corpora, this may be overkill. The technical complexity and dataset size suggest it’s best suited for research teams or engineers comfortable with large-scale data processing and multi-level annotations.


→ GitHub Repo: facebookresearch/Action100M ⭐ 461 · Python