SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining

Authors: *Yue Li1, *Qi Ma2,3, *Runyi Yang3, *Huapeng Li2, *Mengjiao Ma3,4, Bin Ren3,5,6, Nikola Popovic3, Nicu Sebe6, Ender Konukoglu2, Theo Gevers1, Luc Van Gool2,3, Martin R. Oswald1, Danda Pani Paudel3

*Indicates equal contribution, and indicates corresponding author.

Affiliations: 1University of Amsterdam, 2ETH Zürich, 3INSAIT, 4Nanjing University of Aeronautics and Astronautics, 5University of Pisa, 6University of Trento

Links: arXiv Link GitHub Link SceneSplat-7K

Motivation

Our project originates from the following observations:
  • 3D Gaussian Splatting provides a compact formulation for 2D synthesis, while the underlying Gaussians softly encode rich 3D structures (positions, scales, and opacities). This makes it a unique candidate for vision-language pretraining, as it naturally fuses geometry and appearance cues.
  • Existing language gaussian splatting methods are all based on per-scene processing, which limits their scalability and generalization. There lacks a method that directly processes 3D Gaussians end-to-end.

To this end, we introduce SceneSplat, a generalizable, open-vocabulary 3DGS encoder that operates natively on 3DGS. Powered by our SceneSplat-7K dataset, the vision-language pretraining and the self-supervised training scheme unlocks rich 3DGS semantic learning.

Dataset

Metric ScanNet ScanNet++ ScanNet++ v2 Replica Hypersim 3RScan ARKitScenes Matterport3D SceneSplat-7K
Raw Scenes 1613 380 1006 8 461 1482 (scans) 1970 2194 (regions) 9114
GS Scenes 1613 330 956 8 448 632 (scans) 1947 1982 (regions) 7916
RGB Frames 2.5 M 228 K 1.1 M 16 K 77 K 156 K 450 K 194 K 4.72 M
Storage 600 GB 152 GB 447 GB 7 GB 251 GB 235 GB 577 GB 492 GB 2.76 TB
PSNR 29.07 29.49 29.11 41.25 25.93 27.46 29.18 32.34 29.64
Depth Loss 0.031 0.019 0.015 0.002 0.228 0.018 0.0131 0.033 0.035
SSIM 0.869 0.924 0.933 0.980 0.894 0.881 0.885 0.916 0.897
LPIPS 0.236 0.133 0.116 0.0396 0.157 0.335 0.294 0.145 0.212
GS per scene 1.50 M 1.56 M 1.89 M 1.50 M 2.84 M 1.50 M 1.19 M 1.0 M 1.42 M
Total GS 2419.5 M 513.4 M 1810.3 M 12.0 M 1 237.5 M 948.0 M 2316.9 M 1982 M 11.27 B
GPU Time (L4) 593 h 177 h 594 h 4 h 176 h 576 h 811 h 661 h 3592 h

SceneSplat-7K generates 3D Gaussian Splatting scenes from ScanNet, ScanNet++, Replica, Hypersim, 3RScan, ARKitScenes and Matterport3D. It in total contains 7916 scenes and 11.27 billion Gaussians, requiring about 150 GPU-days (NVIDIA L4) to construct, attaining an average reconstruction fidelity of 29.64 dB PSNR and depth_l1 loss of 0.035 m.

Framework

Framework Figure

For vision-language pretraining, we associate each 3D Gaussian primitive with semantic features based on our label collection process and train a generalizable open-vocabulary learner that predict per-gaussian embeddings. For self-supervised pretraining, we employ Masked Gaussian Modeling to reconstruct masked primitives, Self-Distillation Learning for augmentation-invariant features, and Language-Gaussian Alignment for scenes with collected labels.

Qualitative Results

SceneSplat Inference

3DGS Render Inference (PCA) Zeo-shot Sem. Seg. Sem. Seg. GT

Text-based Scene Query

Scene Query Result - vacation / art
Query words: vacation / art
Scene Query Result - toy / add flavour
Query words: toy / add flavour

Additional Segmentation Results

Additional Segmentation Results

Bibtex

@article{li2025scenesplat,
  title  = {SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining},
  author = {Li, Yue and Ma, Qi and Yang, Runyi and Li, Huapeng and Ma, Mengjiao and Ren, Bin and Popovic, Nikola and Sebe, Nicu and Konukoglu, Ender and Gevers, Theo and others},
  journal= {arXiv preprint arXiv:2503.18052},
  year   = {2025}
}

Acknowledgement

We sincerely thank all the author teams of the original datasets for their contributions and for making their data publicly available. Our 3DGS scenes are optimized using gsplat repository.