SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining
*Indicates equal contribution, and ✉ indicates corresponding author.
Affiliations: 1University of Amsterdam, 2ETH Zürich, 3INSAIT, 4Nanjing University of Aeronautics and Astronautics, 5University of Pisa, 6University of Trento
Motivation
Our project originates from the following observations:- 3D Gaussian Splatting provides a compact formulation for 2D synthesis, while the underlying Gaussians softly encode rich 3D structures (positions, scales, and opacities). This makes it a unique candidate for vision-language pretraining, as it naturally fuses geometry and appearance cues.
- Existing language gaussian splatting methods are all based on per-scene processing, which limits their scalability and generalization. There lacks a method that directly processes 3D Gaussians end-to-end.
To this end, we introduce SceneSplat, a generalizable, open-vocabulary 3DGS encoder that operates natively on 3DGS. Powered by our SceneSplat-7K dataset, the vision-language pretraining and the self-supervised training scheme unlocks rich 3DGS semantic learning.
Dataset
Metric | ScanNet | ScanNet++ | ScanNet++ v2 | Replica | Hypersim | 3RScan | ARKitScenes | Matterport3D | SceneSplat-7K |
---|---|---|---|---|---|---|---|---|---|
Raw Scenes | 1613 | 380 | 1006 | 8 | 461 | 1482 (scans) | 1970 | 2194 (regions) | 9114 |
GS Scenes | 1613 | 330 | 956 | 8 | 448 | 632 (scans) | 1947 | 1982 (regions) | 7916 |
RGB Frames | 2.5 M | 228 K | 1.1 M | 16 K | 77 K | 156 K | 450 K | 194 K | 4.72 M |
Storage | 600 GB | 152 GB | 447 GB | 7 GB | 251 GB | 235 GB | 577 GB | 492 GB | 2.76 TB |
PSNR | 29.07 | 29.49 | 29.11 | 41.25 | 25.93 | 27.46 | 29.18 | 32.34 | 29.64 |
Depth Loss | 0.031 | 0.019 | 0.015 | 0.002 | 0.228 | 0.018 | 0.0131 | 0.033 | 0.035 |
SSIM | 0.869 | 0.924 | 0.933 | 0.980 | 0.894 | 0.881 | 0.885 | 0.916 | 0.897 |
LPIPS | 0.236 | 0.133 | 0.116 | 0.0396 | 0.157 | 0.335 | 0.294 | 0.145 | 0.212 |
GS per scene | 1.50 M | 1.56 M | 1.89 M | 1.50 M | 2.84 M | 1.50 M | 1.19 M | 1.0 M | 1.42 M |
Total GS | 2419.5 M | 513.4 M | 1810.3 M | 12.0 M | 1 237.5 M | 948.0 M | 2316.9 M | 1982 M | 11.27 B |
GPU Time (L4) | 593 h | 177 h | 594 h | 4 h | 176 h | 576 h | 811 h | 661 h | 3592 h |
SceneSplat-7K generates 3D Gaussian Splatting scenes from ScanNet, ScanNet++, Replica, Hypersim, 3RScan, ARKitScenes and Matterport3D. It in total contains 7916 scenes and 11.27 billion Gaussians, requiring about 150 GPU-days (NVIDIA L4) to construct, attaining an average reconstruction fidelity of 29.64 dB PSNR and depth_l1 loss of 0.035 m.
Framework

For vision-language pretraining, we associate each 3D Gaussian primitive with semantic features based on our label collection process and train a generalizable open-vocabulary learner that predict per-gaussian embeddings. For self-supervised pretraining, we employ Masked Gaussian Modeling to reconstruct masked primitives, Self-Distillation Learning for augmentation-invariant features, and Language-Gaussian Alignment for scenes with collected labels.
Qualitative Results
SceneSplat Inference




Text-based Scene Query


Additional Segmentation Results

Bibtex
@article{li2025scenesplat, title = {SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining}, author = {Li, Yue and Ma, Qi and Yang, Runyi and Li, Huapeng and Ma, Mengjiao and Ren, Bin and Popovic, Nikola and Sebe, Nicu and Konukoglu, Ender and Gevers, Theo and others}, journal= {arXiv preprint arXiv:2503.18052}, year = {2025} }
Acknowledgement
We sincerely thank all the author teams of the original datasets for their contributions and for making their data publicly available. Our 3DGS scenes are optimized using gsplat repository.