Information-enhanced Network for Noncontact Heart Rate Estimation from Facial Videos

Remote photoplethysmography (rPPG) is a vital way of measuring heart rate (HR) to reflect human physical and mental health, which is useful for diagnosing cardiovascular and neurological diseases. Many non-contact HR estimation methods have been proposed gradually in recent years, but the majority of approaches are based on a single-modal HR information source, resulting in ineffective and unsatisfactory estimation results due to noise and insufficient information. This paper proposes a novel information-enhanced network for HR estimation based on multimodal (e.g., RGB and NIR) sources to address these problems. In the network, context and modal difference information are sequentially enhanced from spatiotemporal and modal views for accurately describing HR-aware features, while maximum frequency information is enhanced for inhibiting heartbeat noise. Specifically, a context-enhanced video Swin-Transformer (CET) module is exploited to extract useful rPPG signal features from facial visible-light and near-infrared videos. Then, a novel modal difference enhanced fusion (MDEF) module is designed to acquire a fused rPPG signal, which is taken as the input of the frequency-enhanced estimation (FEE) module to obtain the corresponding HR value. These three modules are integrated and jointly learned in an end-to-end way, and the multimodal combinations can provide highly complementary information for estimating HR value. Experimental and evaluation results on three multimodal datasets show that the proposed model achieves a superior effect compared to the state-of-the-art methods.