Highlights of Research at InnoPeak
Published at: IEEE Transactions on Visualization and Computer Graphics (TVCG), 2022
Augmented Reality (AR) applications aim to provide realistic blending between the real-world and virtual objects. One of the important factors for realistic AR is the correct lighting estimation. In this paper, we present a method that estimates the real-world lighting condition from a single image in real-time, using information from an optional support plane provided by advanced AR frameworks (e.g. ARCore, ARKit, etc.). By analyzing the visual appearance of the real scene, our algorithm could predict the lighting condition from the input RGB photo. In the first stage, we use a deep neural network to decompose the scene into several components: lighting, normal, and BRDF. Then we introduce differentiable screen-space rendering, a novel approach to providing the supervisory signal for regressing lighting, normal, and BRDF jointly. We recover the most plausible real-world lighting condition using Spherical Harmonics and the main directional lighting. Through a variety of experimental results, we demonstrate that our method could provide improved results than prior works quantitatively and qualitatively, and it could enhance the real-time AR experiences.
Published at: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2022, pp. 1240-1249
In this paper, a real-time method called PoP-Net is proposed to predict multi-person 3D poses from a depth image. PoP-Net learns to predict bottom-up part representations and top-down global poses in a single shot. Specifically, a new part-level representation, called Truncated Part Displacement Field (TPDF), is introduced which enables an explicit fusion process to unify the advantages of bottom-up part detection and global pose detection. Meanwhile, an effective mode selection scheme is introduced to automatically resolve the conflicting cases between global pose and part detections. Finally, due to the lack of high-quality depth datasets for developing multi-person 3D pose estimation, we introduce Multi-Person 3D Human Pose Dataset (MP-3DHP) as a new benchmark. MP-3DHP is designed to enable effective multi-person and background data augmentation in model training, and to evaluate 3D human pose estimators under uncontrolled multi-person scenarios. We show that PoP-Net achieves the state-of-the-art results both on MP-3DHP and on the widely used ITOP dataset, and has significant advantages in efficiency for multi-person processing. MP-3DHP Dataset and the evaluation code have been made available at: https://github.com/oppo-us-research/PoP-Net.
Published at: EuroXR 2021: Virtual Reality and Mixed Reality, pp 51-64
The emergence of Augmented Reality (AR) has brought new challenges to the design of text entry interfaces. When wearing a pair of head-mounted AR glasses, a user’s visual focus could be anywhere 360 ∘∘C around her. For example, a technician is looking up at an airplane engine, meanwhile sharing her view with remote technicians through the sensors on the AR glasses. In such a scenario, the technician has to keep her gaze at the parts and look away from input devices such as a wireless keyboard or a touchscreen. Thus, she will have limited ability to input text for tasks like taking notes about a certain engine part. In this work, we designed and developed two innovative text entry interfaces: Continuous-touch T9 (CTT9) and Continuous-touch Dual Ring (CTDR). Our methods employ a smartphone touchscreen and a projected text entry layout in AR space to help the users input texts without looking at the smartphone. Our user studies suggest the effectiveness of CTT9 and CTDR and provide clues on how to optimize them. Based on the user study results, we provide insights about applying the proposed Continuous-touch (CT) paradigms to text entry for AR glasses.
Published at: 29th ACM International Conference on Multimedia, 2021, Pages 126–134
Given a set of multiple view videos, which records the motion trajectory of an object, we propose to find out the objects' kinematic formulas with neural rendering techniques. For example, if the input multiple view videos record the free fall motion of an object with different initial speed v, the network aims to learn its kinematics: Δ=vt-1over 2 gt2, where Δ, g and t are displacement, gravitational acceleration and time. To achieve this goal, we design a novel framework consisting of a motion network and a differentiable renderer. For the differentiable renderer, we employ Neural Radiance Field (NeRF) since the geometry is implicitly modeled by querying coordinates in the space. The motion network is composed of a series of blending functions and linear weights, enabling us to analytically derive the kinematic formulas after training. The proposed framework is trained end to end and only requires knowledge of cameras' intrinsic and extrinsic parameters. To validate the proposed framework, we design three experiments to demonstrate its effectiveness and extensibility. The first experiment is the video of free fall and the framework can be easily combined with the principle of parsimony, resulting in the correct free fall kinematics. The second experiment is on the large angle pendulum which does not have analytical kinematics. We use the differential equation controlling pendulum dynamics as a physical prior in the framework and demonstrate that the convergence speed becomes much faster. Finally, we study the explosion animation and demonstrate that our framework can well handle such black-box-generated motions.
Published at: IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 12787-12796
Self-supervised depth estimation for indoor environments is more challenging than its outdoor counterpart in at least the following two aspects: (i) the depth range of indoor sequences varies a lot across different frames, making it difficult for the depth network to induce consistent depth cues, whereas the maximum distance in outdoor scenes mostly stays the same as the camera usually sees the sky; (ii) the indoor sequences contain much more rotational motions, which cause difficulties for the pose network, while the motions of outdoor sequences are pre-dominantly translational, especially for driving datasets such as KITTI. In this paper, special considerations are given to those challenges and a set of good practices are consolidated for improving the performance of self-supervised monocular depth estimation in indoor environments. The proposed method mainly consists of two novel modules, i.e., a depth factorization module and a residual pose estimation module, each of which is designed to respectively tackle the aforementioned challenges. The effectiveness of each module is shown through a carefully conducted ablation study and the demonstration of the state-of-the-art performance on three indoor datasets, i.e., EuRoC, NYUv2 and 7-Scenes.
Published at: Computers & Graphics, Volume 95, 2021, Pages 81-91
In this paper, we propose a pipeline that reconstructs a 3D human shape avatar from a single image. Our approach simultaneously reconstructs the three-dimensional human geometry and whole body texture map with only a single RGB image as input. (…) Comprehensive experiments demonstrate that our solution is robust and effective on both public and our own datasets. Our human avatars can be easily rigged and animated using MoCap data. We have developed a mobile application that demonstrates this capability for AR applications.
Published at: 2020 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Porto de Galinhas, Brazil, 2020, pp. 156-163
In this paper, we propose a novel approach that combines the geometric information from VIO with semantic information from object detectors to improve the performance of object detection on mobile devices. (…) The results show that our approach can improve on the accuracy of generic object detectors by 12% on our dataset.
Published at: European Conference on Computer Vision (ECCV) 2020, pp 35-51
We propose a 3D-aware generative network along with a hybrid embedding module and a non-linear composition module. Through modeling the head motion and facial expressions (In our setting, facial expression means facial movement (e.g., blinks, and lip & chin movements).) explicitly, manipulating 3D animation carefully, and embedding reference images dynamically, our approach achieves controllable, photo-realistic, and temporally coherent talking-head videos with natural head movements.
Published at: IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), Sep 2020.
We present an RGBD-based globally-consistent dense 3D reconstruction approach, accompanying high-resolution (< 1 cm) geometric reconstruction and high-quality (the spatial resolution of the RGB image) texture mapping, both of which work online using the CPU computing of a portable device merely.
Published at: European Conference on Computer Vision – ECCV 2020 Workshops pp 327-342
In this paper, we propose a global information aware (GIA) module, which is capable of extracting and integrating the global information into the network to improve the performance of low-light imaging. (…) Experimental results show that the proposed GIA-Net outperforms the state-of-the-art methods in terms of four metrics, including deep metrics that measure perceptual similarities. Extensive ablation studies have been conducted to verify the effectiveness of the proposed GIA-Net for low-light imaging by utilizing global information.
Published at: European Conference on Computer Vision – ECCV 2020 Workshops pp 699-713
In this paper, we introduce the Gated Clip Fusion Network (GCF-Net) that can greatly boost the existing video action classifiers with the cost of a tiny computation overhead. (…) On a large benchmark dataset (Kinetics-600), the proposed GCF-Net elevates the accuracy of existing action classifiers by 11.49% (based on central clip) and 3.67% (based on densely sampled clips) respectively.
Published at: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 2020, pp. 1852-1861
The proposed RCA-GAN yields consistently better visual quality with more detailed and natural textures than baseline models; and achieves comparable or better performance compared with the state-of-the-art methods for real-world image super-resolution.
Published at: 2020 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), Atlanta, GA, USA, 2020, pp. 575-576
In this paper, we present a method that estimates the real-world lighting condition from a single RGB image of an indoor scene, with information of support plane provided by commercial Augmented Reality (AR) frameworks (e.g., ARCore, ARKit, etc.)
Published at: VRCAI '19: The 17th International Conference on Virtual-Reality Continuum and its Applications in Industry, November 2019 Article No. 12 (best paper award)
In this paper, we propose a pipeline that reconstructs 3D human shape avatar at a glance. Our approach simultaneously reconstructs the three-dimensional human geometry and whole body texture map with only a single RGB image as input.
Published at: International Symposium on Visual Computing (ISVC) 2019, pp 141-153
In this paper, we propose practical methods to process ToF depth maps in real time and enable occlusion handling and collision detection for AR applications simultaneously. Our experimental results show real time performance and good visual quality for both occlusion rendering and collision detection.