Variation in viewpoints poses significant challenges to action recognition. One popular way of encoding view-invariant action representation is based on the exploitation of epipolar geometry between different views of the same action. Majority of representative work considers detection of landmark points and their tracking by assuming that motion trajectories for all landmark points on human body are available throughout the course of an action. Unfortunately, due to occlusion and noise, detection and tracking of these landmarks is not always robust. To facilitate it, some of the work assumes that such trajectories are manually marked which is a clear drawback and lacks automation introduced by computer vision. In this paper, we address this problem by proposing view invariant action matching score based on epipolar geometry between actor silhouettes, without tracking and explicit point correspondences. In addition, we explore multi-body epipolar constraint which facilitates to work on original action volumes without any pre-processing. We show that multi-body fundamental matrix captures the geometry of dynamic action scenes and helps devising an action matching score across different views without any prior segmentation of actors. Extensive experimentation on challenging view invariant action datasets shows that our approach not only removes long standing assumptions but also achieves significant improvement in recognition accuracy and retrieval.
View-invariant action recognition is one of the most challenging problems in computer vision. Various representations are being devised for matching actions across different viewpoints to achieve view invariance. In this paper, we explore the invariance property of temporal order of action instances during action execution and utilize it for devising a new view-invariant action recognition approach. To ensure temporal order during matching, we utilize spatiotemporal features, feature fusion and temporal order consistency constraint. We start by extracting spatiotemporal cuboid features from video sequences and applying feature fusion to encapsulate within-class similarity for the same viewpoints. For each action class, we construct a feature fusion table to facilitate feature matching across different views. An action matching score is then calculated based on global temporal order constraint and number of matching features. Finally, the action label of the class with the maximum value of the matching score is assigned to the query action. Experimentation is performed on multiple view Inria Xmas motion acquisition sequences and West Virginia University action datasets, with encouraging results, that are comparable to the existing view-invariant action recognition techniques.