Face detection/alignment methods have reached a satisfactory state in static images captured under arbitrary conditions. Such methods typically perform (joint) fitting for each frame and are used in commercial applications; however in the majority of the real-world scenarios the dynamic scenes are of interest. We argue that generic fitting per frame is suboptimal (it discards the informative correlation of sequential frames) and propose to learn person-specific statistics from the video to improve the generic results. To that end, we introduce a meticulously studied pipeline, which we name PD 2 T, that performs person-specific detection and landmark localisation. We carry out extensive experimentation with a diverse set of i) generic fitting results, ii) different objects (human faces, animal faces) that illustrate the powerful properties of our proposed pipeline and experimentally verify that PD 2 T outperforms all the compared methods.