SceneSplat

SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining

ICCV 2025 Oral

Links:

Authors: ^*Yue Li¹, ^*Qi Ma^2,3, Runyi Yang³, Huapeng Li², Mengjiao Ma^3,4, ^✉Bin Ren^3,5,6, Nikola Popovic³, Nicu Sebe⁶, Ender Konukoglu², Theo Gevers¹, Luc Van Gool^2,3, Martin R. Oswald¹, Danda Pani Paudel³

^*Indicates equal contribution, and ^✉ indicates corresponding author.

Affiliations: ¹University of Amsterdam, ²ETH Zürich, ³INSAIT, Sofia University “St. Kliment Ohridski”, ⁴Nanjing University of Aeronautics and Astronautics, ⁵University of Pisa, ⁶University of Trento

Motivation

Our project originates from the following observations:

3D Gaussian Splatting provides a compact formulation for 2D synthesis, while the underlying Gaussians softly encode rich 3D structures (positions, scales, and opacities). This makes it a unique candidate for vision-language pretraining, as it naturally fuses geometry and appearance cues.
Existing language gaussian splatting methods are all based on per-scene processing, which limits their scalability and generalization. There lacks a method that directly processes 3D Gaussians end-to-end.

To this end, we introduce SceneSplat, a generalizable, open-vocabulary 3DGS encoder that operates natively on 3DGS. Powered by our SceneSplat-7K dataset, the vision-language pretraining and the self-supervised training scheme unlocks rich 3DGS semantic learning.

Dataset

Metric	ScanNet	ScanNet++	ScanNet++ v2	Replica	Hypersim	3RScan	ARKitScenes	Matterport3D	SceneSplat-7K
Raw Scenes	1613	380	1006	8	461	1482 (scans)	1970	2194 (regions)	9114
GS Scenes	1613	330	956	8	448	632 (scans)	1947	1982 (regions)	7916
RGB Frames	2.5 M	228 K	1.1 M	16 K	77 K	156 K	450 K	194 K	4.72 M
Storage	600 GB	152 GB	447 GB	7 GB	251 GB	235 GB	577 GB	492 GB	2.76 TB
PSNR	29.07	29.49	29.11	41.25	25.93	27.46	29.18	32.34	29.64
Depth Loss	0.031	0.019	0.015	0.002	0.228	0.018	0.0131	0.033	0.035
SSIM	0.869	0.924	0.933	0.980	0.894	0.881	0.885	0.916	0.897
LPIPS	0.236	0.133	0.116	0.0396	0.157	0.335	0.294	0.145	0.212
GS per scene	1.50 M	1.56 M	1.89 M	1.50 M	2.84 M	1.50 M	1.19 M	1.0 M	1.42 M
Total GS	2419.5 M	513.4 M	1810.3 M	12.0 M	1 237.5 M	948.0 M	2316.9 M	1982 M	11.27 B
GPU Time (L4)	593 h	177 h	594 h	4 h	176 h	576 h	811 h	661 h	3592 h

SceneSplat-7K generates 3D Gaussian Splatting scenes from ScanNet, ScanNet++, Replica, Hypersim, 3RScan, ARKitScenes and Matterport3D. It in total contains 7916 scenes and 11.27 billion Gaussians, requiring about 150 GPU-days (NVIDIA L4) to construct, attaining an average reconstruction fidelity of 29.64 dB PSNR and depth_l1 loss of 0.035 m.

Framework

For vision-language pretraining, we associate each 3D Gaussian primitive with semantic features based on our label collection process and train a generalizable open-vocabulary learner that predict per-gaussian embeddings. For self-supervised pretraining, we employ Masked Gaussian Modeling to reconstruct masked primitives, Self-Distillation Learning for augmentation-invariant features, and Language-Gaussian Alignment for scenes with collected labels.

Qualitative Results

SceneSplat Inference

Text-based Scene Query

Query words: vacation / art

Query words: toy / add flavour

Interactive scene query demo

Additional Segmentation Results

Bibtex

              @inproceedings{li2025scenesplat,
                title={SceneSplat: Gaussian Splatting-based Scene Understanding With Vision-Language Pretraining},
                author={Li, Yue and Ma, Qi and Yang, Runyi and Li, Huapeng and Ma, Mengjiao and Ren, Bin and Popovic, Nikola and Sebe, Nicu and Konukoglu, Ender and Gevers, Theo and others},
                booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
                year={2025}
              }

Acknowledgement

We sincerely thank all the author teams of the original datasets for their contributions and for making their data publicly available. Our 3DGS scenes are optimized using gsplat repository.