Google DeepMind’s representations4d bundles three self-supervised video learning approaches using transformers, including a novel object-centric tracking method with latent tokens moving off the pixel grid.
NAS3R enables self-supervised 3D geometry and camera parameter estimation without ground-truth data, using Gaussian splatting and a VGGT backbone. It supports multi-view setups and optional pretrained initialization.