Bidirectional Long Short-Term Memory Variational Autoencoder
Variational Autoencoder (VAE) has achieved promising success since its emergence. In recent years, its various variants have been developed, especially those works which extend VAE to handle sequential data [1, 2, 5, 7]. However, these works either do not generate sequential latent variables, or encode latent variables only based on inputs from earlier time-steps. We believe that in real-world situations, encoding latent variables at a specific time-step should be based on not only previous observations, but also succeeding samples. In this work, we emphasize such fact and theoretically derive the bidirectional Long Short-Term Memory Variational Autoencoder (bLSTM-VAE), a novel variant of VAE whose encoders and decoders are implemented by bidirectional Long Short-Term Memory (bLSTM) networks. The proposed bLSTM-VAE can encode sequential inputs as an equal-length sequence of latent variables. A latent variable at a specific time-step is encoded by simultaneously processing observations from the first time-step till current time-step in a forward order and observations from current time-step till the last timestep in a backward order. As a result, we consider that the proposed bLSTM-VAE could learn latent variables reliably by mining the contextual information from the whole input sequence. In order to validate the proposed method, we apply it for gesture recognition using 3D skeletal joint data. The evaluation is conducted on the ChaLearn Look at People gesture dataset and NTU RGB+D dataset. The experimental results show that combining with the proposed bLSTM-VAE, the classification network performs better than when combining with a standard VAE, and also outperforms several state-of-the-art methods.