3D Scene Reconstruction with Multi-layer Depth and Epipolar Transformers

University of California, Irvine
In ICCV, 2019

Abstract

We tackle the problem of automatically reconstructing a complete 3D model of a scene from a single RGB image. This challenging task requires inferring the shape of both visible and occluded surfaces. Our approach utilizes viewer-centered, multi-layer representation of scene geometry adapted from recent methods for single object shape completion. To improve the accuracy of view-centered representations for complex scenes, we introduce a novel "Epipolar Feature Transformer" that transfers convolutional network features from an input view to other virtual camera viewpoints, and thus better covers the 3D scene geometry. Unlike existing approaches that first detect and localize objects in 3D, and then infer object shape using category-specific models, our approach is fully convolutional, end-to-end differentiable, and avoids the resolution and memory limitations of voxel representations. We demonstrate the advantages of multi-layer depth representations and epipolar feature transformers on the reconstruction of a large database of indoor scenes.

Updates

  • 2019-08: Paper title changed to 3D Scene Reconstruction with Multi-layer Depth and Epipolar Transformers

Goal: 3D scene reconstruction from a single RGB image

This challenging task requires inferring the shape of both visible and occluded surfaces.
  1. Depth maps provide an efficient representation of the scene geometry but are incomplete.
  2. We propose a viewer-centered multi-layer representation that enables fully convolutional inference of
    3D scene geometry and shape completion.
  3. We introduce epipolar transformer networks that provide geometrically consistent transfer of CNN features and make
    view-based predictions from virtual viewpoints.

Supplemental Video

Figures and Qualitative Results (please see our paper and supplement for more details)

Overview of our system and evaluation metric:

Results on SUNCG (PBRS renderings):

Voxelization and semantic scene reconstruction on SUNCG:


Real-world evaluation on NYUv2: Our network model is trained entirely on synthetically generated images.

Real-world evaluation on ScanNet: Please see our paper for quantitative evaluation of synthetic-to-real transfer of 3D scene geometry prediction.

Citing this work

If you find this work useful in your research, please consider citing:

@inproceedings{shin20193d,
  title={3D Scene Reconstruction with Multi-layer Depth and Epipolar Transformers},
  author={Shin, Daeyun and Ren, Zhile and Sudderth, Erik B and Fowlkes, Charless C},
  booktitle={Proceedings of the IEEE International Conference on Computer Vision (ICCV)},
  year={2019}
}

Acknowledgements

This research was supported by NSF grants IIS-1618806, IIS-1253538, CNS-1730158, and a hardware donation from NVIDIA.