SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining
*Indicates equal contribution, and ✉ indicates corresponding author.
Affiliations: 1University of Amsterdam, 2ETH Zürich, 3INSAIT, Sofia University “St. Kliment Ohridski”, 4Nanjing University of Aeronautics and Astronautics, 5University of Pisa, 6University of Trento
Motivation
Our project originates from the following observations:- 3D Gaussian Splatting provides a compact formulation for 2D synthesis, while the underlying Gaussians softly encode rich 3D structures (positions, scales, and opacities). This makes it a unique candidate for vision-language pretraining, as it naturally fuses geometry and appearance cues.
- Existing language gaussian splatting methods are all based on per-scene processing, which limits their scalability and generalization. There lacks a method that directly processes 3D Gaussians end-to-end.
To this end, we introduce SceneSplat, a generalizable, open-vocabulary 3DGS encoder that operates natively on 3DGS. Powered by our SceneSplat-7K dataset, the vision-language pretraining and the self-supervised training scheme unlocks rich 3DGS semantic learning.
Dataset
Metric | ScanNet | ScanNet++ | ScanNet++ v2 | Replica | Hypersim | 3RScan | ARKitScenes | Matterport3D | SceneSplat-7K |
---|---|---|---|---|---|---|---|---|---|
Raw Scenes | 1613 | 380 | 1006 | 8 | 461 | 1482 (scans) | 1970 | 2194 (regions) | 9114 |
GS Scenes | 1613 | 330 | 956 | 8 | 448 | 632 (scans) | 1947 | 1982 (regions) | 7916 |
RGB Frames | 2.5 M | 228 K | 1.1 M | 16 K | 77 K | 156 K | 450 K | 194 K | 4.72 M |
Storage | 600 GB | 152 GB | 447 GB | 7 GB | 251 GB | 235 GB | 577 GB | 492 GB | 2.76 TB |
PSNR | 29.07 | 29.49 | 29.11 | 41.25 | 25.93 | 27.46 | 29.18 | 32.34 | 29.64 |
Depth Loss | 0.031 | 0.019 | 0.015 | 0.002 | 0.228 | 0.018 | 0.0131 | 0.033 | 0.035 |
SSIM | 0.869 | 0.924 | 0.933 | 0.980 | 0.894 | 0.881 | 0.885 | 0.916 | 0.897 |
LPIPS | 0.236 | 0.133 | 0.116 | 0.0396 | 0.157 | 0.335 | 0.294 | 0.145 | 0.212 |
GS per scene | 1.50 M | 1.56 M | 1.89 M | 1.50 M | 2.84 M | 1.50 M | 1.19 M | 1.0 M | 1.42 M |
Total GS | 2419.5 M | 513.4 M | 1810.3 M | 12.0 M | 1 237.5 M | 948.0 M | 2316.9 M | 1982 M | 11.27 B |
GPU Time (L4) | 593 h | 177 h | 594 h | 4 h | 176 h | 576 h | 811 h | 661 h | 3592 h |
SceneSplat-7K generates 3D Gaussian Splatting scenes from ScanNet, ScanNet++, Replica, Hypersim, 3RScan, ARKitScenes and Matterport3D. It in total contains 7916 scenes and 11.27 billion Gaussians, requiring about 150 GPU-days (NVIDIA L4) to construct, attaining an average reconstruction fidelity of 29.64 dB PSNR and depth_l1 loss of 0.035 m.
Framework

For vision-language pretraining, we associate each 3D Gaussian primitive with semantic features based on our label collection process and train a generalizable open-vocabulary learner that predict per-gaussian embeddings. For self-supervised pretraining, we employ Masked Gaussian Modeling to reconstruct masked primitives, Self-Distillation Learning for augmentation-invariant features, and Language-Gaussian Alignment for scenes with collected labels.
Qualitative Results
SceneSplat Inference




Text-based Scene Query


Additional Segmentation Results

Bibtex
@article{li2025scenesplat, title = {SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining}, author = {Li, Yue and Ma, Qi and Yang, Runyi and Li, Huapeng and Ma, Mengjiao and Ren, Bin and Popovic, Nikola and Sebe, Nicu and Konukoglu, Ender and Gevers, Theo and others}, journal= {arXiv preprint arXiv:2503.18052}, year = {2025} }
Acknowledgement
We sincerely thank all the author teams of the original datasets for their contributions and for making their data publicly available. Our 3DGS scenes are optimized using gsplat repository.