This study proposes a method to enhance the quality of indoor 3D reconstruction based on 3D Gaussian Splatting (3DGS) using Polycam. The approach generates novel view camera poses, improves them with DIFIX, and incorporates geometry-aware loss terms to further refine reconstruction quality. The geometry-aware loss includes a perceptual loss applied only to novel views and normal and depth consistency losses applied to all views. These improvements enhance the accuracy of geometry reconstruction, strengthen multi-view consistency, and reduce artifacts in the reconstructed scenes. Experimental results show that the proposed method increases PSNR from 20.423 to 21.675 and SSIM from 0.856 to 0.862 compared to the original 3DGS.
Captured the indoor space using the Polycam application and downloaded the raw dataset required for 3D reconstruction.
[iPhone 13 Pro]
Download
Raw data
raw.glb, corrected_images/, corrected_cameras/ was used for this project.
⇣
We propose a method to generate novel view camera poses by interpolating training camera poses.
Intermediate positions along the paths between the original viewpoints are calculated to create new viewpoints. This approach enables smooth camera transitions, making rendering changes appear more natural.
$$\mathbf{p}(t) = (1 - t)\,\mathbf{p}_a + t\,\mathbf{p}_b,\quad t \in [0,1]$$
$$\mathbf{q}(t)=\frac{\sin((1-t)\theta)}{\sin\theta}\,\mathbf{q}_a +\frac{\sin(t\theta)}{\sin\theta}\,\mathbf{q}_b$$
$$where, \theta=\cos^{-1}(\mathbf{q}_a \cdot \mathbf{q}_b)$$
Novel view camera poses often contain artifacts; therefore, we enhance their quality using DIFIX , a diffusion-based model. The improved novel views are then used as inputs when retraining 3DGS.
The perceptual loss (LPIPS) is applied only to novel views. While DIFIX enhances novel view images by removing artifacts, it can cause smoothing that reduces fine details. LPIPS encourages structural similarity between the enhanced novel view and the target image.
$$\mathcal{L}_{\text{LPIPS}} = \lambda_{\text{LPIPS}} \cdot \text{LPIPS}(I, \hat{I})$$
$$\text{LPIPS}(x, \hat{x}) = \sum_l \frac{1}{H_l W_l} \sum_{h=1}^{H_l} \sum_{w=1}^{W_l} \| w_l \odot (f_l(x)_{h,w} - f_l(\hat{x})_{h,w}) \|_2^2$$
Encourages the depth map to vary smoothly, producing more consistent surfaces for objects and scenes.
$$\mathcal{L}^w_{\text{smooth}} = \lambda_{\text{smooth}} \left[ \frac{1}{N_x} \sum_{i,j} w_{i,j} \cdot \left| D_{i,j} - D_{i,j+1} \right| + \frac{1}{N_y} \sum_{i,j} w_{i,j} \cdot \left| D_{i,j} - D_{i+1,j} \right| \right]$$
$$ \mathcal{L}^w_{\text{normal}} = \lambda_{\text{normal}} \left[ \frac{1}{N_x} \sum_{i,j} w_{i,j} \cdot \left\| \mathbf{n}_{i,j} - \mathbf{n}_{i,j+1} \right\|_1 + \frac{1}{N_y} \sum_{i,j} w_{i,j} \cdot \left\| \mathbf{n}_{i,j} - \mathbf{n}_{i+1,j} \right\|_1 \right] $$
1. Inverse Depth Normalization
$$ \tilde{D}_{i,j} = \frac{D_{i,j}}{\mathrm{median}(D) + 10^{-6}} $$2. Near Weight
$$ w_{\mathrm{near},i,j} = \frac{\tilde{D}_{i,j}}{1 + \tilde{D}_{i,j}} $$3. Far Weight
$$ w_{\mathrm{far},i,j} = \frac{1}{1 + \tilde{D}_{i,j}} $$4. Edge Aware Gating (RGB on) - Gradient
$$ g_{i,j} = \frac{\left\| I_{:,i,j+1} - I_{:,i,j} \right\|_1 + \left\| I_{:,i+1,j} - I_{:,i,j} \right\|_1}{2} $$5. Edge Gate
$$ \text{edge\_gate}_{i,j} = \max \left( 0.1, e^{-\gamma g_{i,j}} \right) $$6. Final Weight
$$ w_{i,j} \leftarrow w_{i,j} \cdot \text{edge\_gate}_{i,j} $$$$ \mathcal{L}_{\text{total}} = (1 - \lambda) \cdot \mathcal{L}_1 + \lambda_{\text{SSIM}} \mathcal{L}_{D\text{-}SSIM} + \mathbf{1}_{\text{novel}} \cdot \lambda_{\text{LPIPS}} \cdot \mathcal{L}_{\text{LPIPS}} + \lambda_{\text{smooth}} \cdot \mathcal{L}_{\text{smooth}} + \lambda_{\text{normal}} \cdot \mathcal{L}_{\text{normal}} $$
| method | initial point# | PSNR↑ | SSIM↑ | Training time | frame# |
|---|---|---|---|---|---|
| 3DGS | 100000 | 20.423 | 0.856 | 2h 13m | 168 |
| 2DGS | 100000 | 19.219 | 0.828 | 2h 1m | 168 |
| 2DGS_novel | 100000 | 20.375 | 0.842 | 1h 59m | 208 |
| Ours_novel | 100000 | 21.605 | 0.861 | 2h 6m | 208 |
| Ours_novel_loss | 100000 | 21.675 | 0.862 | 3h 55m | 208 |