Research — InnoPeak Technology

Real-Time Lighting Estimation for Augmented Reality via Differentiable Screen-Space Rendering

Published at: IEEE Transactions on Visualization and Computer Graphics (TVCG), 2022

Augmented Reality (AR) applications aim to provide realistic blending between the real-world and virtual objects. One of the important factors for realistic AR is the correct lighting estimation. In this paper, we present a method that estimates the real-world lighting condition from a single image in real-time, using information from an optional support plane provided by advanced AR frameworks (e.g. ARCore, ARKit, etc.). By analyzing the visual appearance of the real scene, our algorithm could predict the lighting condition from the input RGB photo. In the first stage, we use a deep neural network to decompose the scene into several components: lighting, normal, and BRDF. Then we introduce differentiable screen-space rendering, a novel approach to providing the supervisory signal for regressing lighting, normal, and BRDF jointly. We recover the most plausible real-world lighting condition using Spherical Harmonics and the main directional lighting. Through a variety of experimental results, we demonstrate that our method could provide improved results than prior works quantitatively and qualitatively, and it could enhance the real-time AR experiences.

Read More →

PoP-Net: Pose Over Parts Network for Multi-Person 3D Pose Estimation From a Depth Image

Published at: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2022, pp. 1240-1249

In this paper, a real-time method called PoP-Net is proposed to predict multi-person 3D poses from a depth image. PoP-Net learns to predict bottom-up part representations and top-down global poses in a single shot. Specifically, a new part-level representation, called Truncated Part Displacement Field (TPDF), is introduced which enables an explicit fusion process to unify the advantages of bottom-up part detection and global pose detection. Meanwhile, an effective mode selection scheme is introduced to automatically resolve the conflicting cases between global pose and part detections. Finally, due to the lack of high-quality depth datasets for developing multi-person 3D pose estimation, we introduce Multi-Person 3D Human Pose Dataset (MP-3DHP) as a new benchmark. MP-3DHP is designed to enable effective multi-person and background data augmentation in model training, and to evaluate 3D human pose estimators under uncontrolled multi-person scenarios. We show that PoP-Net achieves the state-of-the-art results both on MP-3DHP and on the widely used ITOP dataset, and has significant advantages in efficiency for multi-person processing. MP-3DHP Dataset and the evaluation code have been made available at: https://github.com/oppo-us-research/PoP-Net.

Read More →

Continuous-Touch Text Entry for AR Glasses

Published at: EuroXR 2021: Virtual Reality and Mixed Reality, pp 51-64

The emergence of Augmented Reality (AR) has brought new challenges to the design of text entry interfaces. When wearing a pair of head-mounted AR glasses, a user’s visual focus could be anywhere 360 ∘∘C around her. For example, a technician is looking up at an airplane engine, meanwhile sharing her view with remote technicians through the sensors on the AR glasses. In such a scenario, the technician has to keep her gaze at the parts and look away from input devices such as a wireless keyboard or a touchscreen. Thus, she will have limited ability to input text for tasks like taking notes about a certain engine part. In this work, we designed and developed two innovative text entry interfaces: Continuous-touch T9 (CTT9) and Continuous-touch Dual Ring (CTDR). Our methods employ a smartphone touchscreen and a projected text entry layout in AR space to help the users input texts without looking at the smartphone. Our user studies suggest the effectiveness of CTT9 and CTDR and provide clues on how to optimize them. Based on the user study results, we provide insights about applying the proposed Continuous-touch (CT) paradigms to text entry for AR glasses.

Read More →

Learning Kinematic Formulas from Multiple View Videos

Published at: 29th ACM International Conference on Multimedia, 2021, Pages 126–134

Given a set of multiple view videos, which records the motion trajectory of an object, we propose to find out the objects' kinematic formulas with neural rendering techniques. For example, if the input multiple view videos record the free fall motion of an object with different initial speed v, the network aims to learn its kinematics: Δ=vt-1over 2 gt2, where Δ, g and t are displacement, gravitational acceleration and time. To achieve this goal, we design a novel framework consisting of a motion network and a differentiable renderer. For the differentiable renderer, we employ Neural Radiance Field (NeRF) since the geometry is implicitly modeled by querying coordinates in the space. The motion network is composed of a series of blending functions and linear weights, enabling us to analytically derive the kinematic formulas after training. The proposed framework is trained end to end and only requires knowledge of cameras' intrinsic and extrinsic parameters. To validate the proposed framework, we design three experiments to demonstrate its effectiveness and extensibility. The first experiment is the video of free fall and the framework can be easily combined with the principle of parsimony, resulting in the correct free fall kinematics. The second experiment is on the large angle pendulum which does not have analytical kinematics. We use the differential equation controlling pendulum dynamics as a physical prior in the framework and demonstrate that the convergence speed becomes much faster. Finally, we study the explosion animation and demonstrate that our framework can well handle such black-box-generated motions.

Read More →

MonoIndoor: Towards Good Practice of Self-Supervised Monocular Depth Estimation for Indoor Environments

Published at: IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 12787-12796

Self-supervised depth estimation for indoor environments is more challenging than its outdoor counterpart in at least the following two aspects: (i) the depth range of indoor sequences varies a lot across different frames, making it difficult for the depth network to induce consistent depth cues, whereas the maximum distance in outdoor scenes mostly stays the same as the camera usually sees the sky; (ii) the indoor sequences contain much more rotational motions, which cause difficulties for the pose network, while the motions of outdoor sequences are pre-dominantly translational, especially for driving datasets such as KITTI. In this paper, special considerations are given to those challenges and a set of good practices are consolidated for improving the performance of self-supervised monocular depth estimation in indoor environments. The proposed method mainly consists of two novel modules, i.e., a depth factorization module and a residual pose estimation module, each of which is designed to respectively tackle the aforementioned challenges. The effectiveness of each module is shown through a carefully conducted ablation study and the demonstration of the state-of-the-art performance on three indoor datasets, i.e., EuRoC, NYUv2 and 7-Scenes.

Read More →

Animated 3D human avatars from a single image with GAN-based texture inference

Published at: Computers & Graphics, Volume 95, 2021, Pages 81-91

In this paper, we propose a pipeline that reconstructs a 3D human shape avatar from a single image. Our approach simultaneously reconstructs the three-dimensional human geometry and whole body texture map with only a single RGB image as input. (…) Comprehensive experiments demonstrate that our solution is robust and effective on both public and our own datasets. Our human avatars can be easily rigged and animated using MoCap data. We have developed a mobile application that demonstrates this capability for AR applications.

Read More →

Object Detection in the Context of Mobile Augmented Reality

Published at: 2020 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Porto de Galinhas, Brazil, 2020, pp. 156-163

In this paper, we propose a novel approach that combines the geometric information from VIO with semantic information from object detectors to improve the performance of object detection on mobile devices. (…) The results show that our approach can improve on the accuracy of generic object detectors by 12% on our dataset.

Read More →

Talking-Head Generation with Rhythmic Head Motion

Published at: European Conference on Computer Vision (ECCV) 2020, pp 35-51

We propose a 3D-aware generative network along with a hybrid embedding module and a non-linear composition module. Through modeling the head motion and facial expressions (In our setting, facial expression means facial movement (e.g., blinks, and lip & chin movements).) explicitly, manipulating 3D animation carefully, and embedding reference images dynamically, our approach achieves controllable, photo-realistic, and temporally coherent talking-head videos with natural head movements.

Read More →

Real-time Globally Consistent Dense 3D Reconstruction with Online Texturing

Published at: IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), Sep 2020.

We present an RGBD-based globally-consistent dense 3D reconstruction approach, accompanying high-resolution (< 1 cm) geometric reconstruction and high-quality (the spatial resolution of the RGB image) texture mapping, both of which work online using the CPU computing of a portable device merely.

Read More →

GIA-Net: Global Information Aware Network for Low-Light Imaging

Published at: European Conference on Computer Vision – ECCV 2020 Workshops pp 327-342

In this paper, we propose a global information aware (GIA) module, which is capable of extracting and integrating the global information into the network to improve the performance of low-light imaging. (…) Experimental results show that the proposed GIA-Net outperforms the state-of-the-art methods in terms of four metrics, including deep metrics that measure perceptual similarities. Extensive ablation studies have been conducted to verify the effectiveness of the proposed GIA-Net for low-light imaging by utilizing global information.

Read More →

GCF-Net: Gated Clip Fusion Network for Video Action Recognition

Published at: European Conference on Computer Vision – ECCV 2020 Workshops pp 699-713

In this paper, we introduce the Gated Clip Fusion Network (GCF-Net) that can greatly boost the existing video action classifiers with the cost of a tiny computation overhead. (…) On a large benchmark dataset (Kinetics-600), the proposed GCF-Net elevates the accuracy of existing action classifiers by 11.49% (based on central clip) and 3.67% (based on densely sampled clips) respectively.

Read More →

Residual Channel Attention Generative Adversarial Network for Image Super-Resolution and Noise Reduction

Published at: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 2020, pp. 1852-1861

The proposed RCA-GAN yields consistently better visual quality with more detailed and natural textures than baseline models; and achieves comparable or better performance compared with the state-of-the-art methods for real-world image super-resolution.

Read More →

Lighting Estimation via Differentiable Screen-Space Rendering

Published at: 2020 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), Atlanta, GA, USA, 2020, pp. 575-576

In this paper, we present a method that estimates the real-world lighting condition from a single RGB image of an indoor scene, with information of support plane provided by commercial Augmented Reality (AR) frameworks (e.g., ARCore, ARKit, etc.)

Read More →

3D Human Avatar Digitization from a Single Image

Published at: VRCAI '19: The 17th International Conference on Virtual-Reality Continuum and its Applications in Industry, November 2019 Article No. 12 (best paper award)

In this paper, we propose a pipeline that reconstructs 3D human shape avatar at a glance. Our approach simultaneously reconstructs the three-dimensional human geometry and whole body texture map with only a single RGB image as input.

Read More →

Occlusion and Collision Aware Smartphone AR Using Time-of-Flight Camera

Published at: International Symposium on Visual Computing (ISVC) 2019, pp 141-153

In this paper, we propose practical methods to process ToF depth maps in real time and enable occlusion handling and collision detection for AR applications simultaneously. Our experimental results show real time performance and good visual quality for both occlusion rendering and collision detection.

Read More →

Highlights of Research at InnoPeak