`Natural Human-Computer Interaction
`
`YALE SONG, DAVID DEMIRDJIAN, and RANDALL DAVIS,
`Massachusetts Institute of Technology
`
`5
`
`Intelligent gesture recognition systems open a new era of natural human-computer interaction: Gesturing
`is instinctive and a skill we all have, so it requires little or no thought, leaving the focus on the task itself, as
`it should be, not on the interaction modality. We present a new approach to gesture recognition that attends
`to both body and hands, and interprets gestures continuously from an unsegmented and unbounded input
`stream. This article describes the whole procedure of continuous body and hand gesture recognition, from
`the signal acquisition to processing, to the interpretation of the processed signals.
`Our system takes a vision-based approach, tracking body and hands using a single stereo camera. Body
`postures are reconstructed in 3D space using a generative model-based approach with a particle filter,
`combining both static and dynamic attributes of motion as the input feature to make tracking robust to
`self-occlusion. The reconstructed body postures guide searching for hands. Hand shapes are classified into
`one of several canonical hand shapes using an appearance-based approach with a multiclass support vector
`machine. Finally, the extracted body and hand features are combined and used as the input feature for
`gesture recognition. We consider our task as an online sequence labeling and segmentation problem. A latent-
`dynamic conditional random field is used with a temporal sliding window to perform the task continuously.
`We augment this with a novel technique called multilayered filtering, which performs filtering both on
`the input layer and the prediction layer. Filtering on the input layer allows capturing long-range temporal
`dependencies and reducing input signal noise; filtering on the prediction layer allows taking weighted votes
`of multiple overlapping prediction results as well as reducing estimation noise.
`We tested our system in a scenario of real-world gestural interaction using the NATOPS dataset, an
`official vocabulary of aircraft handling gestures. Our experimental results show that: (1) the use of both
`static and dynamic attributes of motion in body tracking allows statistically significant improvement of
`the recognition performance over using static attributes of motion alone; and (2) the multilayered filtering
`statistically significantly improves recognition performance over the nonfiltering method. We also show that,
`on a set of twenty-four NATOPS gestures, our system achieves a recognition accuracy of 75.37%.
`Categories and Subject Descriptors: I.4.8 [Image Processing and Computer Vision]: Scene Analysis—
`Motion; I.5.5 [Pattern Recognition]: Implementation—Interactive systems
`General Terms: Algorithms, Design, Experimentation
`Additional Key Words and Phrases: Pose tracking, gesture recognition, human-computer interaction, online
`sequence labeling and segmentation, conditional random fields, multilayered filtering
`ACM Reference Format:
`Song, Y., Demirdjian, D., and Davis, R. 2012. Continuous body and hand gesture recognition for natural
`human-computer interaction. ACM Trans. Interact. Intell. Syst. 2, 1, Article 5 (March 2012), 28 pages.
`DOI = 10.1145/2133366.2133371 http://doi.acm.org/10.1145/2133366.2133371
`
`This work was funded in part by the Office of Naval Research Science of Autonomy program, contract no.
`N000140910625, and in part by the National Science Foundation grant no. IIS-1018055.
`Authors’ addresses: Y. Song (corresponding author), D. Demirdjian, and R. Davis, Computer Science and
`Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 32 Vassar St., Cambridge, MA
`02139; email: yalesong@csail.mit.edu.
`Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
`without fee provided that copies are not made or distributed for profit or commercial advantage and that
`copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for
`components of this work owned by others than ACM must be honored. Abstracting with credit is permitted.
`To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this
`work in other works requires prior specific permission and/or a fee. Permissions may be requested from
`Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)
`869-0481, or permissions@acm.org.
`c(cid:2) 2012 ACM 2160-6455/2012/03-ART5 $10.00
`DOI 10.1145/2133366.2133371 http://doi.acm.org/10.1145/2133366.2133371
`
`ACM Transactions on Interactive Intelligent Systems, Vol. 2, No. 1, Article 5, Pub. date: March 2012.
`
`Gree Exhibit 2012
`Supercell Oy v. Gree, Inc.
`PGR2020-00063
`Page 00001
`
`
`
`5:2
`
`Y. Song et al.
`
`1. INTRODUCTION
`For more than 40 years, human-computer interaction has been focused on the keyboard
`and mouse. Although this has been successful, as computation becomes increasingly
`mobile, embedded, and ubiquitous, it is far too constraining as a model of interaction.
`Evidence suggests that gesture-based interaction is the wave of the future, with consid-
`erable attention from both the research community (see recent survey articles by Mitra
`and Acharya [2007] and by Weinland et al. [2011]) and from the industry and public
`media (e.g., Microsoft Kinect). Evidence can also be found in a wide range of potential
`application areas, such as medical devices, video gaming, robotics, video surveillance,
`and natural human-computer interaction.
`Gestural interaction has a number of clear advantages. First, it uses equipment
`we always have on hand: there is nothing extra to carry, misplace, or leave behind.
`Second, it can be designed to work from actions that are natural and intuitive, so there
`is little or nothing to learn about the interface. Third, it lowers cognitive overhead, a
`key principle in human-computer interaction: Gesturing is instinctive and a skill we
`all have, so it requires little or no thought, leaving the focus on the task itself, as it
`should be, not on the interaction modality.
`Current gesture recognition is, however, still sharply limited. Most current systems
`concentrate on one source of input signal, for example, body or hand. Yet human
`gesture is most naturally expressed with both body and hands: Examples range from
`the simple gestures we use in everyday conversations, to the more elaborate gestures
`used by baseball coaches giving signals to players, soldiers gesturing for tactical tasks,
`and police giving signals to drivers. Considering only one source of signal (e.g., body
`or hand) severely restricts the expressiveness of the gesture vocabulary and makes
`interaction far less natural.
`Gesture recognition can be viewed as a task of statistical sequence modeling: Given
`example observation sequences, the task is to learn a model that captures spatio-
`temporal patterns in the sequences, so that the model can perform sequence labeling
`and segmentation on new observations. One of the main challenges here is the task
`of online sequence segmentation. Most current systems assume that signal bound-
`aries and/or the length of the whole sequence are known a priori. However, interactive
`gesture understanding should be able to process continuous input seamlessly, that
`is, with no need for awkward transitions, interruptions, or indications of boundaries
`between gestures. We use the terms unsegmented and unbounded to clarify what we
`mean by continuous input. Continuous input is unsegmented, that is, there is no indi-
`cation of signal boundaries, such as the gesture start and end. Continuous input is also
`unbounded, that is, the beginning and the end of the whole sequence are unknown, re-
`gardless of whether the sequence contains a single gesture or multiple gestures. This is
`unlike work in most other areas with continuous input. In speech recognition, for exam-
`ple, most systems rely on having signal segmentation (e.g., by assuming that silence of
`a certain length indicates the end of a sentence) and deal with bounded conversations
`(e.g., making an airline reservation). Interactive gesture understanding from input
`that is continuous (both unsegmented and unbounded) requires that sequence labeling
`and segmentation be done simultaneously with new observations being made.
`This article presents a new approach to gesture recognition that tracks both body
`and hands, and combines the two signals to perform online gesture interpretation
`and segmentation continuously, allowing richer gesture vocabulary and more natural
`human-computer interaction. Our main contributions are threefold: a unified frame-
`work for continuous body and hand gesture recognition; a new error measure, based on
`Motion History Image (MHI) [Bobick and Davis 2001], for body tracking that captures
`dynamic attributes of motion; and a novel technique called multilayered filtering for
`robust online sequence labeling and segmentation.
`
`ACM Transactions on Interactive Intelligent Systems, Vol. 2, No. 1, Article 5, Pub. date: March 2012.
`
`PGR2020-00063 Page 00002
`
`
`
`Continuous Body and Hand Gesture Recognition for Natural Human-Computer Interaction
`
`5:3
`
`We demonstrate our system on the NATOPS body and hand gesture dataset [Song
`et al. 2011b]. Our extensive experimental results show that examining both static and
`dynamic attributes of motion improves the quality of estimated body features, which
`in turn improves gesture recognition performance by 6.3%. We also show that our
`multilayered filtering significantly improves recognition performance by 15.78% when
`added to the existing latent-dynamic conditional random field model. As we show in
`Section 4, these improvements are statistically significant. We also show that our
`continuous gesture recognition system achieves a recognition accuracy of 75.37% on a
`set of twenty-four NATOPS gestures.
`Section 1.1 gives an overview of our system; Section 1.2 reviews some of the most
`related work in pose tracking and gesture recognition, making distinctions to our work;
`Section 2 describes body and hand tracking; Section 3 describes continuous gesture
`recognition; and Section 4 shows experimental results. Section 5 concludes with a
`summary of contributions and suggesting directions for future work.
`Some of the material presented in this article has appeared in earlier conference
`proceedings [Song et al. 2011a, 2011b]. Song et al. [2011a] described gesture recognition
`of segmented input. This article extends our previous work to the continuous input
`domain and presents a new approach to performing online gesture interpretation and
`segmentation simultaneously (Section 3.2). Body and hand tracking was described in
`Song et al. [2011b]. Here, we include a deeper analysis of the body tracking, evaluating
`the performance of an MHI-based error measure we introduced in Song et al. [2011b]
`(Section 4.4). None of the experimental results reported in this article has appeared in
`any of our earlier work. Song et al. [2011b] also introduced a body and hand gesture
`dataset; here we give an experiment protocol on a set of all twenty-four gestures in the
`NATOPS dataset, and report a recognition accuracy of 75.37% (Section 4.7).
`
`1.1. System Overview
`Figure 1 shows an overview of our system. The three main components are a 3D upper-
`body posture estimator, a hand shape classifier, and a continuous gesture recognizer.
`In the first part of the pipeline, image preprocessing (Section 2.1), depth maps are
`calculated using images captured from a stereo camera, and the images are background
`subtracted using a combination of an offline trained codebook background model [Kim
`et al. 2005] and a “depth-cut” method.
`For 3D body posture estimation (Section 2.2), we construct a generative model of the
`human upper-body, and fit the model to observations by comparing various features
`extracted from the model to corresponding features extracted from observations. In
`order to deal with body posture ambiguities that arise from self-occlusion, we examine
`both static and dynamic attributes of motion. The static attributes (i.e., body posture
`features) are extracted from depth images, while the dynamic attributes are extracted
`from MHI [Bobick and Davis 2001]. Poses are then estimated using a particle filter
`[Isard and Blake 1998].
`For hand shape classification (Section 2.3), we use information from body posture
`estimation to make the hand tracking task efficient: Two small search regions are
`defined around estimated wrist joints, and our system searches for hands in only these
`regions. A multiclass SVM classifier [Vapnik 1995] is trained offline using manually-
`segmented images of hands. HOG features [Freeman et al. 1998; Dalal and Triggs
`2005] are extracted from the images and used as an image descriptor.
`In the last part, continuous gesture recognition (Section 3), we form the input feature
`by combining body and hand information. A Latent-Dynamic Conditional Random
`Field (LDCRF) [Morency et al. 2007] is trained offline using a supervised body and
`hand gesture dataset. The LDCRF with a temporal sliding window is used to perform
`online sequence labeling and segmentation simultaneously. We augment this with our
`
`ACM Transactions on Interactive Intelligent Systems, Vol. 2, No. 1, Article 5, Pub. date: March 2012.
`
`PGR2020-00063 Page 00003
`
`
`
`5:4
`
`Y. Song et al.
`
`Fig. 1. A pipeline view of our unified framework for continuous body and hand gesture recognition.
`
`ACM Transactions on Interactive Intelligent Systems, Vol. 2, No. 1, Article 5, Pub. date: March 2012.
`
`PGR2020-00063 Page 00004
`
`
`
`Continuous Body and Hand Gesture Recognition for Natural Human-Computer Interaction
`
`5:5
`
`multilayered filtering to make our task more robust. The multilayered filter acts both
`on the input layer and the prediction layer: On the input layer, a Gaussian temporal-
`smoothing filter [Harris 1978] is used to capture long-range temporal dependencies
`and make our system less sensitive to the noise from estimated time-series data,
`while not increasing the dimensionality of input feature vectors and keeping the model
`complexity the same. The prediction layer is further divided into local and global
`prediction layers, where we use a weighted-average filter and a moving-average filter,
`respectively, to take weighted votes of multiple overlapping prediction results as well
`as reduce noise in the prediction results.
`
`1.2. Related Work
`The topics covered in this article range broadly from body and hand tracking to gesture
`recognition with online sequence labeling and segmentation. This section reviews some
`of the most relevant work; comprehensive survey articles include Poppe [2007] for
`body tracking, Erol et al. [2007] for hand tracking, and Mitra and Acharya [2007] and
`Weinland et al. [2011] for gesture recognition.
`Gesture-based interfaces typically require robust pose tracking. This is commonly
`done by wearing specially designed markers or devices (e.g., Vicon motion capture
`system or colored gloves [Yin and Davis 2010]). However, the most natural form of
`gestural interaction would not require additional markers or sensors attached to the
`body. We take a vision-based approach and perform motion tracking based on data from
`a single stereo camera, not using any special marker device attached to the body.
`Several successful vision-based pose tracking approaches have been reported, falling
`generally into two categories: model-based methods, which try to reconstruct a pose
`model by fitting a kinematic model to the observed image [Deutscher et al. 2000;
`Sminchisescu and Triggs 2003; Lee and Cohen 2006]; and appearance-based meth-
`ods, which assume a pose vocabulary and try to learn a direct mapping from features
`extracted from images to the vocabulary [Brand 1999; Shakhnarovich et al. 2003;
`Mori and Malik 2006]. Model-based methods are in general not affected by a cam-
`era viewpoint, do not require a training dataset, and are generally more robust in
`3D pose estimation. Appearance-based methods require a large training dataset and
`in general are more sensitive to camera viewpoints, but once a mapping function is
`learned, classification is performed efficiently. Recent works take a hybrid approach,
`combining ideas from the two conventional methods and using advanced depth sensing
`cameras for 3D data acquisition (e.g., Time of Flight (ToF) [Gokturk et al. 2004] or struc-
`tured light [Fofi et al. 2004]). Schwarz et al. [2011] use a ToF camera to obtain depth
`images. They detect anatomical landmarks to fit a skeleton body model, solving con-
`strained inverse kinematics. A graph is constructed from the depth data, and geodesic
`distances between body parts are measured, making the 3D positions of anatomical
`landmarks invariant to pose. Similar to our work, they use optical flow between sub-
`sequent images to make tracking robust to self-occlusion. Shotton et al. [2011] obtain
`depth images from a structured light depth sensing camera (i.e., Microsoft Kinect).
`They take an object recognition approach: A per-pixel body part classifier is trained
`on an extensive training dataset. The results are reprojected onto 3D space, and local
`means are used to generate confidence-scored 3D proposals of body joints.
`In this work, we take a model-based approach for body posture estimation, because
`reconstructing body posture in 3D space provides important information, such as point-
`ing direction. Hand shapes, by contrast, are more categorical, that is, it is typically not
`crucial to distinguish fine-grained details of hand shape in order to understand a body
`and hand gesture. Therefore, we take an appearance-based approach to hand shape
`classification.
`
`ACM Transactions on Interactive Intelligent Systems, Vol. 2, No. 1, Article 5, Pub. date: March 2012.
`
`PGR2020-00063 Page 00005
`
`
`
`5:6
`
`Y. Song et al.
`
`Input image (left), depth map (middle), and mask image (right). The “T-pose” shown in the figures is
`Fig. 2.
`used for body tracking initialization.
`
`There have been active efforts to build a principled probabilistic graphical model for
`sequence modeling based on discriminative learning. Lafferty et al. [2001] introduced
`Conditional Random Fields (CRF), a discriminative learning approach that does not
`make conditional independence assumptions. Quattoni et al. [2007] introduced Hidden
`Conditional Random Fields (HCRF), an extension to the CRF that incorporates hidden
`variables. Many other variants of the CRF have been introduced since then [Sutton
`et al. 2004; Gunawardana et al. 2005; Wang and Mori 2009], but most of them could
`not handle continuous input, limiting their use in real-world applications.
`Morency et al. [2007] presented a Latent-Dynamic Conditional Random Field (LD-
`CRF) that is able to perform sequence labeling and segmentation simultaneously. An
`LDCRF assumes a disjoint set of hidden state variables per label, allowing it to do
`parameter estimation and inference efficiently using belief propagation [Pearl 1988].
`They showed that the model is capable of capturing the substructure of a class sequence
`and can learn dynamics between class labels, allowing the model to perform sequence
`labeling and segmentation simultaneously. However, the forward-backward message-
`passing schedule used in belief propagation limited its use to bounded input sequences
`only. In this work, we use an LDCRF with a temporal sliding window to predict se-
`quence labels and perform segmentation online, augmenting the original framework
`with our multilayered filtering, preserving the advantages of belief propagation and
`extending the previous work to the unbounded input domain.
`
`2. OBTAINING BODY AND HAND SIGNALS
`In this section, we describe body and hand tracking, which receives input images from
`a stereo camera and produces body and hand signals by performing 3D body posture
`estimation and hand shape classification.
`We describe image preprocessing in Section 2.1, which produces depth maps and
`mask images (see Figure 2). We describe 3D body posture estimation in Section 2.2 and
`hand shape classification in Section 2.3.
`
`2.1. Image Preprocessing
`The system starts by receiving pairs of time-synchronized images recorded from a
`Bumblebee2 stereo camera, producing 320 x 240 pixel resolution images at 20 FPS.
`While recording video, the system produces depth maps and mask images in real time
`(see Figure 2). Depth maps allow us to reconstruct body postures in 3D space and
`resolve some of the pose ambiguities arising from self-occlusion; mask images allow
`us to concentrate on the objects of interest and ignore the background, optimizing the
`use of available computational resources. We obtain depth maps using a manufacture-
`provided SDK.1
`
`1http://www.ptgrey.com.
`
`ACM Transactions on Interactive Intelligent Systems, Vol. 2, No. 1, Article 5, Pub. date: March 2012.
`
`PGR2020-00063 Page 00006
`
`
`
`Continuous Body and Hand Gesture Recognition for Natural Human-Computer Interaction
`
`5:7
`
`Fig. 3. Generative model of the human upper-body model. The model includes 6 body parts (trunk, head,
`upper and lower arms for both sides) and 9 joints (chest, head, navel, left/right shoulder, elbow, and wrist).
`
`We obtain mask images by performing background subtraction. Ideally, background
`subtraction could be done using depth information alone by the “depth-cut” method:
`Filter out pixels whose distance is further from camera than a foreground object,
`assuming there is no object in between the camera and the subject. However, as shown
`in Figure 2, depth maps typically have lower resolution than color images, meaning that
`the resolution of mask images produced would be equally low resolution. This motivates
`our approach of performing background subtraction using a codebook approach [Kim
`et al. 2005], then refining the result with the depth-cut method.
`The codebook approach works by learning a per-pixel background model from a his-
`tory of 2D color background images sampled over a period of time, then segmenting out
`the “outlier” pixels in new images as foreground. Since this approach uses RGB images,
`it produces high-resolution mask images. One weakness of the codebook approach is,
`however, its sensitivity to illumination and shadows, arising because the codebook de-
`fines a foreground object as any set of pixels whose color values are noticeably different
`from the previously learned background model. To remedy this, after input images are
`background subtracted using the codebook approach, we refine the result using the
`depth-cut method described before.
`
`2.2. 3D Body Posture Estimation
`The goal here is to reconstruct upper-body posture in 3D space given the input images.
`We formulate this as a sequential Bayesian filtering problem, that is, having observed
`a sequence of images Zt = {z1, . . . , zt} and knowing the prior state density p(xt), make
`a prediction about a posterior state density p(xt | Zt), where xt = (x1,t . . . xk,t) is a
`k-dimensional vector representing the body posture we are estimating.
`
`2.2.1. GenerativeUpper-BodyModel. Our generative model of the human upper-body is
`constructed in 3D space, using a skeletal model represented as a kinematic chain and
`a volumetric model described by superellipsoids [Barr 1981] (see Figure 3). The model
`includes 6 body parts (trunk, head, upper and lower arms for both sides) and 9 joints
`(chest, head, navel, left/right shoulder, elbow, and wrist). The shoulder is modeled as a 3
`DOF ball-and-socket joint, and the elbow is modeled as a 1 DOF revolute joint, resulting
`in 8 model parameters in total. Coordinates of each joint are obtained by solving the
`forward kinematics problem, following the Denavit-Hartenberg convention [Denavit
`and Hartenberg 1955], a compact way of representing n-link kinematic structures.
`We prevent the model from generating anatomically implausible body postures by
`constraining joint angles to known physiological limits [NASA 1995].
`The human shoulder has historically been the most challenging part for human
`body modeling [Engin 1980]. It has a complicated anatomical structure, with bones,
`
`ACM Transactions on Interactive Intelligent Systems, Vol. 2, No. 1, Article 5, Pub. date: March 2012.
`
`PGR2020-00063 Page 00007
`
`
`
`5:8
`
`Y. Song et al.
`
`(cid:2)
`
`muscles, skin, and ligaments intertwined, making modeling of shoulder movement
`difficult [Feng et al. 2008].
`We improve on our basic model of the human upper-body by building a more pre-
`cise model of the shoulder, while still not increasing the dimensionality of the model
`parameter vector. To capture arm movement more accurately, after a body model is
`generated, the shoulder model is refined analytically using the relative positions of
`other body joints.
`Therefore, the generation of a body model is a two-step procedure: Given the eight
`joint angle values, we first solve the forward kinematics problem, obtaining the coor-
`dinates of each joint. Then we compute the angle ϕ between the chest-to-shoulder line
`and the chest-to-elbow line, and update the chest-to-shoulder angle θ CS as2
`θ CS + ϕ
`if elbow is higher than shoulder,
`θ CS(cid:2) =
`θ CS − ϕ
`otherwise,
`
`θ CS
`MAX
`
`θ CS
`MIN
`
`(1)
`
`where θ CS
`min and θ CS
`max are minimum and maximum joint angle limits for chest-to-shoulder
`joints [NASA 1995]. Figure 3 illustrates our generative model, rendered after the chest-
`to-shoulder angles θ CS are adjusted (note the left/right chest-to-shoulder angles are
`different). This simplified model mimics shoulder movement in only one dimension, up
`and down, but works quite well if the subject is facing the camera, as is commonly true
`for human-computer interaction.
`With these settings, an upper-body posture is parameterized as
`x = (G R)T ,
`(2)
`where G is a 6-dimensional global translation and rotation vector, and R is an 8-
`dimensional joint angle vector (3 for shoulder and 1 for elbow, for each arm). In practice,
`once the parameters are initialized, we fix all but (x, z) translation elements of G,
`making x a 10-dimensional vector.
`2.2.2. ParticleFilter. Human body movements can be highly unpredictable, so an infer-
`ence that assumes its random variables form a single Gaussian distribution can fall
`into a local minima or completely lose track. A particle filter [Isard and Blake 1998] is
`particularly well suited to this type of task for its ability to maintain multiple hypothe-
`ses during inference, discarding less likely hypotheses only slowly. It represents the
`posterior state density p(xt | Zt) as a multimodal non-Gaussian distribution, which is
`approximated by a set of N weighted particles: {(s(1)
`)}. Each sample
`, π (N)
`), . . . , (s(N)
`t , π (1)t
`(cid:3)
`t
`t
`
`st represents a pose configuration, and the weights π (n)
`t are obtained by computing the
`likelihood p(zt | xt = s(n)
`t = 1.
`π (n)
`t ), and normalized so that
`N
`The joint angle dynamic model is constructed as a Gaussian process.
`e ∼ N (0, σ 2)
`xt = xt−1 + e,
`(3)
`Once N particles are generated, we obtain the estimation result by calculating the
`(cid:6)
`(cid:5)
`Bayesian Least Squares (BLS) estimate.
`
`E[ f (xt)] = N(cid:4)
`
`π (n)
`t
`
`f
`
`n=1
`Iterative methods need a good initialization. We initialize our generative model at the
`first frame: The initial body posture configurations (i.e., joint angles and limb lengths)
`
`s(n)
`t
`
`(4)
`
`2Note that the angle ϕ is not an additional model parameter, because it is computed analytically using joint
`positions.
`
`ACM Transactions on Interactive Intelligent Systems, Vol. 2, No. 1, Article 5, Pub. date: March 2012.
`
`PGR2020-00063 Page 00008
`
`
`
`Continuous Body and Hand Gesture Recognition for Natural Human-Computer Interaction
`
`5:9
`
`Fig. 4. Motion history images of the observation (left) and the estimated model (right). White pixel values
`indicate an object has appeared in the pixel; gray pixel values indicate there was an object in the pixel but
`it has moved; black pixel values indicate there has been no change in the pixel.
`
`(cid:5)
`
`(cid:7)
`
`ε
`
`(cid:6)(cid:8)
`
`(5)
`
`are obtained by having the subject assume a static “T-pose” (shown in Figure 2), and
`fitting the model to the image with exhaustive search. This typically requires no more
`than 0.3 seconds (on an Intel Xeon CPU 2.4 GHz machine with 4GBs of RAM).
`2.2.3. LikelihoodFunction. The likelihood function p(zt|xt = s(n)
`t ) measures the goodness-
`of-fit of an observation zt given a sample s(n)
`. We define it as an inverse of an exponen-
`t
`(cid:6) =
`(cid:5)
`tiated fitting error ε(zt, s(n)
`t ).
`
`zt | xt = s(n)
`
`t
`
`p
`
`exp
`
`1
`zt, s(n)
`t
`The fitting error ε(zt, s(n)
`t ) is a weighted sum of three error terms computed by comparing
`features extracted from the generative model to the corresponding features extracted
`from input images. The three features include a 3D visible-surface point cloud, a 3D
`contour point cloud, and a Motion History Image (MHI) [Bobick and Davis 2001]. The
`first two features capture discrepancies in static poses; the third captures discrepancies
`in the dynamics of motion. We chose the weights for each error term empirically.
`The first two features, 3D visible-surface and contour point clouds, are used fre-
`quently in body motion tracking (e.g., Deutscher et al. [2000]) for their ability to evalu-
`ate how well the generated body posture fits the actual pose observed in the image. We
`measure the fitting error by computing the sum-of-squared Euclidean distance errors
`between the point cloud of the model and the point cloud of the input image (i.e., the
`3D data supplied by the image preprocessing step described earlier).
`The third feature, an MHI, is an image where each pixel value is a function of the
`recency of motion in a sequence of images (see Figure 4). This often provides useful
`information about dynamics of motion, as it indicates where and how the motion has
`occurred. We define an MHI-based error term to measure discrepancies in the dynamics
`of motion.
`An MHI is computed from It−1 and It, two time-consecutive 8-bit unsigned integer
`images whose pixel values span from 0 to 255. For the generative model, It is obtained
`by rendering the model generated by a sample s(n)
`(i.e., rendering an image of what
`t
`body posture s(n)
`t would look), and It−1 is obtained by rendering E[ f (xt−1)], the model
`generated by the estimation result from the previous step (Eq. (4)). For the input
`
`ACM Transactions on Interactive Intelligent Systems, Vol. 2, No. 1, Article 5, Pub. date: March 2012.
`
`PGR2020-00063 Page 00009
`
`
`
`5:10
`
`Y. Song et al.
`
`images, It is obtained by converting an RGB input image to YCrCb color space and
`extracting the brightness channel3, and this is stored to be used as It−1 for the next
`time step. Then an MHI is computed as
`IMH I = thresh(It−1 − It, 0, 127) + thresh(It − It−1, 0, 255),
`(6)
`where thresh(I, α, β) is a binary threshold operator that sets each pixel value to β if
`I(x, y) > α, and zero otherwise. The first term captures pixels that were occupied at the
`previous time step but not in the current time step. The second term captures pixels
`that are newly occupied in the current time step. We chose the values 0, 127, and 255 to
`indicate the time information of those pixels: 0 means there has been no change in the
`pixel, regardless of whether or not there was an object; 127 means there was an object
`in the pixel but it has moved; while 255 means an object has appeared in the pixel.
`This allows us to construct an image that concentrates on the moved regions only (e.g.,
`arms), while ignoring the unmoved parts (e.g., trunk, background). The computed MHI
`images are visualized in Figure 4.
`Given the MHIs of the generative model and the observation, one can define various
`error measures. In this work, we define an MHI error as
`εMH I = Count [thresh(I
`(cid:2), 127, 255)],
`(cid:5)
`(cid:5)
`(cid:9)
`
`(7)
`
`(8)
`
`where
`
`(cid:10)(cid:6)(cid:6)
`
`.
`
`IMH I(zt, zt−1) − IMH I
`(cid:2) = abs
`s(n)
`f (xt−1)
`I
`t , E
`This error function first subtracts an MHI of the model IMH I(s(n)
`t , E[ f (xt−1)]) from an
`MHI of the observation IMH I(zt, zt−1), and computes an absolute-valued image of it
`(Eq. (8)). Then it applies the binary threshold operator with the cutoff value and result
`value (127 and 255, respective), and counts nonzero pixels with Count [·] (Eq. (7)). We
`set the cutoff value to 127 to penalize the conditions in which two MHIs do not match
`at the current time step, independent of the situation at the previous time step.4
`We evaluate the effectiveness of the MHI-based error measure in Section 4.4, where
`we compare gesture recognition accuracy of the models trained on features estimated
`with and without the MHI-based error measure.
`
`2.3. Hand Shape Classification
`The goal of hand shape classification is to classify hand shapes made contemporane-
`ously with gestures into one of several cano