Abstract:
3D scene analysis can play a crucial role in different 3D vision-related applications,
where depth information is pivotal. However, accurate dense depth sensing through active
depth sensors (e.g. Laser depth scanner) is costly. An alternative is to employ low-cost depth sensors, which yields noisy depth information. Another common alternative which is still prevalent is that of depth estimation from intensity images using stereo. However, in this case, establishing correspondences among the multiple viewpoints is often not accurate due to various issues such as illumination, occlusion and so on. Thus, in recent years learning based depth estimation from single intensity images has been explored. However, intensity images can be noisy due to sensor characteristics.
On these lines, we propose an approach to estimate depth from a single intensity image
using a learning-based strategy. Here, we have developed a novel convolutional neural network (CNN) encoder-decoder architecture, which learns the depth information using example
pairs of color images and their corresponding depth maps. The proposed model is based
on an integration of residual connections within pooling (down-sampling) and up-sampling
layers, and hourglass module which operates on the encoded features, thus processing these
at various scales. Furthermore, the model is optimized under the constraints of perceptual
loss as well as the mean squared error loss. The perceptual loss considers the high-level
features, thus operating at a different scale of abstraction, which is complementary to the
mean squared error loss that considers a pixel-to-pixel error.
Considering that the training and testing dataset can be noisy, the estimated depth may
not be accurate. Although our depth estimation framework can handle low-level noise in the
intensity test image, a higher level of noise can distract the estimated depth map. For this
scenario, we propose a denoising algorithm for both intensity images and depth maps that
can address higher levels of noise. It has been shown that for denoising, non-local similar
patches play an important role. Nevertheless, noise may create ambiguity in finding similar patches, hence it may degrade the results. However, most of the non-local similarity-based
approaches do not consider the issue of noisy patch grouping. Hence, we propose to denoise
an image by mitigating the issue of grouping non-local similar patches in the presence of
noise in the transform domain using sparsity and edge-preserving constraints. The e ectiveness
of the transform domain grouping of patches is utilized for learning dictionaries
and is further extended for achieving an initial approximation of sparse coe cient vector
for the clean image patches. The results are further improved by employing edge preserving
constraints and processing at coarser scales. Our technique is useful to preserve the surface
discontinuities and prominent details in depth and intensity images while suppressing noise,
and we demonstrate clear benefits of denoising.
Another aspect that is considered in this work, is whether an apriori knowledge of scene
type can benefit in depth estimation. We demonstrate the improvement in estimating the
depth map by classifying di erent indoor scenes and building di erent depth estimation
models for scene types. Such an approach may be useful in an application involving a
small and fixed number of scenes. In order to build a classifier, we have used a smaller
version of Residual Convolutional Neural Network (ResNet-18) that discriminates between
di erent indoor scenes (e.g. bookstore, dining, bathroom, classroom, and kitchen, etc.) even
in presence of noise in testing images. Here, our denoising method can help in accurate
estimation of the depth map. Such an approach can not only serve as an initial step of depth
estimation but it can also be useful in scene classification/retrieval application.