LUDVIG: Learning-free Uplifting of 2D Visual features to Gaussian Splatting scenes (ICCV 2025)
This repository contains code for LUDVIG: Learning-free Uplifting of 2D Visual features to Gaussian Splatting scenes, published at ICCV 2025. LUDVIG uses a learning-free approach to uplift visual features from models such as DINOv2, SAM, and CLIP into 3D Gaussian Splatting scenes. It refines 3D features, such as coarse segmentation masks, based on a graph diffusion process that incorporates the 3D geometry of the scene and DINOv2 feature similarities. We evaluate on foreground/background and open-vocabulary object segmentation tasks.
Illustration of the inverse and forward rendering between 2D visual features (produced by DINOv2) and a 3D Gaussian Splatting scene. In the inverse rendering (or uplifting) phase, features are created for each 3D Gaussian by aggregating coarse 2D features over all viewing directions. For forward rendering, the 3D features are projected on any given viewing direction as in regular Gaussian Splatting.
git clone git@github.com:naver/ludvig.git
cd ludvig
<details>
<summary>Download instructions for SPIn-NeRF data</summary>
</details>
<details>
<summary>Configuration files for foreground/background segmentation on SPIn-NeRF and NVOS</summary>
</details>
Run the following script to set the paths to your cuda dependencies, e.g. cuda_path=/usr/local/cuda-11.8:
bash script/set_cuda.sh ${cuda_path}
Modify the pytorch-cuda version in environment.yml to match your CUDA version, and create the ludvig environment:
mamba env create -f environment.yml
Our code has been tested on Ubuntu 22, CUDA 11.8 with GPU A6000 ADA (48GB of memory).
Project Structure
The project in ludvig/ is as follows:
ludvig_*.py: Main scripts for uplifting, graph diffusion and evaluation.
scripts/: Bash scripts calling ludvig_*.py
configs/: Configuration files for different models and evaluation tasks.
diffusion/: Classes for graph diffusion.
evaluation/: Classes for evaluation, including segmentation on NVOS and SPIn-NeRF with SAM and DINOv2.
Additionally, you should have the following folders in ludvig/ (e.g. as symbolic links to storage locations):
For this demo, we use the stump and bonsai scenes from Mip-NeRF 360, with the pretrained Gaussian Splatting representation provided by the authors of Gaussian Splatting. <br>
First, download the scene and weights:
bash script/demo_download.sh
This saves the data in dataset/stump,dataset/bonsai and model weights in dataset/stump/gs, dataset/bonsai/gs.
Demo for feature uplifting
The following script will uplift DINOv2 features and save visualizations of the uplifted features:
python demo.py
The script creates an instance of ludvig_uplift.LUDVIGUplift based on paths to the data and on the configuration configs/demo.yaml. <br>
It then runs uplifting through model.uplift() and saves 3D features and visualizations through model.save().
Feature map generation and uplifting
The method model.uplift():
creates a dataset from a subclass of predictors.base.BaseDataset that generates the feature maps to be uplifted
For constructing 2D DINOv2 feature maps, we use the dataset predictors.dino.DINOv2Dataset (as indicated in demo.yaml). <br>
The dataset loads the scene images, predicts DINOv2 features and performs dimensionality reduction. <br>
<br>
Currently supported numbers of features are {1, 2, 3, 10, 20, 30, 40, 50, 100, 200, 256, 512}. If you need to uplift features with another dimension, you can add the option at line 421 here and compile again.
Directly uplifting existing 2D feature maps. If you directly have features or masks to uplift, you can use predictors.base.BaseDataset instead. The path to your features should be given as directory argument to the dataset, as in configs/demo_rgb.yaml. As a mock example, running python demo.py --rgb will directly uplift and reproject RGB images. <br>
Note that the name of your features should match camera names (i.e. the names of RGB images used for training).
Visualization and evaluation
The method model.save() saves uplifted features and visualizations in logs/demo/. <br>
You can also define your own postprocessing or evaluation procedure by subclassing evaluation.base.EvaluationBase and adding it to the configuration file under evaluation, as done in our experimental setup for SPIn-NeRF and NVOS and in the demo below.
Demo for open-vocabulary object removal
In this demo, we will remove the bonsai from the downloaded scene using the text query "bonsai in a ceramic pot".
To this end, we:
Steps 1 and 2 are performed as in the previous demo, calling predictors.dino.DINOv2Dataset and predictors.clip.CLIPDataset. Note that CLIP feature uplifting takes longer due to the sliding window mechanism used to generate CLIP feature maps.
Step 3 calls evaluation.removal.clip_diffusion.CLIPDiffusionRemoval, which computes 3D relevancy scores for the text query "bonsai in a ceramic pot". It then constructs a graph based on DINOv2 feature similarities, using the 3D CLIP relevancies to initialize the node weights and define the regularization term. <br>
The resulting 3D weights are thresholded to obtain a 3D segmentation mask. To remove the object, we simply remove all Gaussians pertaining to the 3D mask.
Various visualizations are saved in logs/bonsai, including:
DINOv2 features in logs/bonsai/dinov2/features/ (top-right image)
CLIP features in logs/bonsai/clip/features/ (bottom-right image)
2D RGB images rendered without the object in logs/bonsai/removal (bottom-left image).
Reproducing results
The datasets should be stored in ludvig/dataset. <br>
All experiments require a trained Gaussian Splatting representation of the scene saved in a gs/ folder under dataset/${scene_path}/gs, as indicated in the structures below.
Foreground/background segmentation
Data
SPIn-NeRF
lego_real_night_radial:
Download from Google Drive.
The LLFF dataset is available here.
The scribbles and test masks are provided here.
<br> The data should have the following structure, with gs/ containing the Gaussian Splatting logs:
scene: The scene to evaluate, e.g., trex, horns_center, etc.
cfg: Configuration file for evaluation, e.g., dif_NVOS, sam_SPIn (see below).
sam_[NVOS|SPIn]: segmentation on NVOS/SPIn-NeRF with SAM.
dif_[NVOS|SPIn]: segmentation on NVOS/SPIn-NeRF with DINOv2 and graph diffusion.
xdif_SPIn: segmentation on SPIn-NeRF with DINOv2, without graph diffusion.
depth_SPIn: segmentation on SPIn-NeRF with mask uplifting and reprojection.
singleview_[sam|dinov2]_SPIn: single-view segmentation with DINOv2 or SAM.
Open-vocabulary object segmentation
Data
We evaluate on the extended version of the LERF dataset introduced by LangSplat.
Download their data and save Gaussian Splatting logs in a gs/ folder as indicated in the structure below.
with cfg either lerf_eval_sam for segmentation with SAM or lerf_eval otherwise (automatic thresholding). Pass --no_diffusion to disable graph diffusion based on DINOv2 features. <br>
To reproduce our results on object localization, you can run bash scripts/lerf_eval.sh $scene lerf_eval --no_diffusion. <br>
The evaluation results (IoU and localization accuracy) are saved in logs/lerf/$scene/iou.txt, the mask predictions in logs/lerf/$scene/masks*, and the localization heatmaps in logs/lerf/$scene/localization.
Open-vocabulary semantic segmentation
Data
We evaluate on the ScanNet dataset, following OpenGaussian's evaluation protocol, with all data directly provided by them: the images (color), 2D feature maps (language_features), camera poses (transforms_train.json, transforms_test.json), and initial point cloud (points3d.ply).
We train Gaussian Splatting exactly as in OpenGaussian's Stage 0, i.e., with fixed Gaussian positions and the densification process disabled.
After Gaussian Splatting reconstruction, the data should match the following structure:
The evaluated scenes are scene0000_00, scene0062_00, scene0070_00, scene0097_00, scene0140_00, scene0200_00, scene0347_00, scene0400_00, scene0590_00, and scene0645_00.
Uplifting
We replace OpenGaussian's quantization-based feature training with our uplifting, starting from the same 2D feature maps. These feature maps are obtained by assigning a CLIP feature to each instance mask generated using SAM in everything mode.
They are directly provided by OpenGaussian under language_features.
To uplift the features, run:
bash script/scannet.sh $scene
Evaluation
We evaluate 3D semantic segmentation using OpenGaussian's protocol. Specifically, each Gaussian is assigned the textual label with the highest CLIP similarity. Note that OpenGaussian only evaluates Gaussians with opacity greater than 0.1.
python scripts/eval_scannet.py
Citing LUDVIG
If you find our work useful, please consider citing us:
@inproceedings{marrie2025ludvig,
title={LUDVIG: Learning-Free Uplifting of 2D Visual Features to Gaussian Splatting Scenes},
author={Marrie, Juliette and Menegaux, Romain and Arbel, Michael and Larlus, Diane and Mairal, Julien},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year={2025}
}
For any inquiries or contributions, please reach out to jltmarrie@gmail.com.