throbber
Continuous Body and Hand Gesture Recognition for
`Natural Human-Computer Interaction
`
`YALE SONG, DAVID DEMIRDJIAN, and RANDALL DAVIS,
`Massachusetts Institute of Technology
`
`5
`
`Intelligent gesture recognition systems open a new era of natural human-computer interaction: Gesturing
`is instinctive and a skill we all have, so it requires little or no thought, leaving the focus on the task itself, as
`it should be, not on the interaction modality. We present a new approach to gesture recognition that attends
`to both body and hands, and interprets gestures continuously from an unsegmented and unbounded input
`stream. This article describes the whole procedure of continuous body and hand gesture recognition, from
`the signal acquisition to processing, to the interpretation of the processed signals.
`Our system takes a vision-based approach, tracking body and hands using a single stereo camera. Body
`postures are reconstructed in 3D space using a generative model-based approach with a particle filter,
`combining both static and dynamic attributes of motion as the input feature to make tracking robust to
`self-occlusion. The reconstructed body postures guide searching for hands. Hand shapes are classified into
`one of several canonical hand shapes using an appearance-based approach with a multiclass support vector
`machine. Finally, the extracted body and hand features are combined and used as the input feature for
`gesture recognition. We consider our task as an online sequence labeling and segmentation problem. A latent-
`dynamic conditional random field is used with a temporal sliding window to perform the task continuously.
`We augment this with a novel technique called multilayered filtering, which performs filtering both on
`the input layer and the prediction layer. Filtering on the input layer allows capturing long-range temporal
`dependencies and reducing input signal noise; filtering on the prediction layer allows taking weighted votes
`of multiple overlapping prediction results as well as reducing estimation noise.
`We tested our system in a scenario of real-world gestural interaction using the NATOPS dataset, an
`official vocabulary of aircraft handling gestures. Our experimental results show that: (1) the use of both
`static and dynamic attributes of motion in body tracking allows statistically significant improvement of
`the recognition performance over using static attributes of motion alone; and (2) the multilayered filtering
`statistically significantly improves recognition performance over the nonfiltering method. We also show that,
`on a set of twenty-four NATOPS gestures, our system achieves a recognition accuracy of 75.37%.
`Categories and Subject Descriptors: I.4.8 [Image Processing and Computer Vision]: Scene Analysis—
`Motion; I.5.5 [Pattern Recognition]: Implementation—Interactive systems
`General Terms: Algorithms, Design, Experimentation
`Additional Key Words and Phrases: Pose tracking, gesture recognition, human-computer interaction, online
`sequence labeling and segmentation, conditional random fields, multilayered filtering
`ACM Reference Format:
`Song, Y., Demirdjian, D., and Davis, R. 2012. Continuous body and hand gesture recognition for natural
`human-computer interaction. ACM Trans. Interact. Intell. Syst. 2, 1, Article 5 (March 2012), 28 pages.
`DOI = 10.1145/2133366.2133371 http://doi.acm.org/10.1145/2133366.2133371
`
`This work was funded in part by the Office of Naval Research Science of Autonomy program, contract no.
`N000140910625, and in part by the National Science Foundation grant no. IIS-1018055.
`Authors’ addresses: Y. Song (corresponding author), D. Demirdjian, and R. Davis, Computer Science and
`Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 32 Vassar St., Cambridge, MA
`02139; email: yalesong@csail.mit.edu.
`Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
`without fee provided that copies are not made or distributed for profit or commercial advantage and that
`copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for
`components of this work owned by others than ACM must be honored. Abstracting with credit is permitted.
`To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this
`work in other works requires prior specific permission and/or a fee. Permissions may be requested from
`Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)
`869-0481, or permissions@acm.org.
`c(cid:2) 2012 ACM 2160-6455/2012/03-ART5 $10.00
`DOI 10.1145/2133366.2133371 http://doi.acm.org/10.1145/2133366.2133371
`
`ACM Transactions on Interactive Intelligent Systems, Vol. 2, No. 1, Article 5, Pub. date: March 2012.
`
`Gree Exhibit 2012
`Supercell Oy v. Gree, Inc.
`PGR2020-00063
`Page 00001
`
`

`

`5:2
`
`Y. Song et al.
`
`1. INTRODUCTION
`For more than 40 years, human-computer interaction has been focused on the keyboard
`and mouse. Although this has been successful, as computation becomes increasingly
`mobile, embedded, and ubiquitous, it is far too constraining as a model of interaction.
`Evidence suggests that gesture-based interaction is the wave of the future, with consid-
`erable attention from both the research community (see recent survey articles by Mitra
`and Acharya [2007] and by Weinland et al. [2011]) and from the industry and public
`media (e.g., Microsoft Kinect). Evidence can also be found in a wide range of potential
`application areas, such as medical devices, video gaming, robotics, video surveillance,
`and natural human-computer interaction.
`Gestural interaction has a number of clear advantages. First, it uses equipment
`we always have on hand: there is nothing extra to carry, misplace, or leave behind.
`Second, it can be designed to work from actions that are natural and intuitive, so there
`is little or nothing to learn about the interface. Third, it lowers cognitive overhead, a
`key principle in human-computer interaction: Gesturing is instinctive and a skill we
`all have, so it requires little or no thought, leaving the focus on the task itself, as it
`should be, not on the interaction modality.
`Current gesture recognition is, however, still sharply limited. Most current systems
`concentrate on one source of input signal, for example, body or hand. Yet human
`gesture is most naturally expressed with both body and hands: Examples range from
`the simple gestures we use in everyday conversations, to the more elaborate gestures
`used by baseball coaches giving signals to players, soldiers gesturing for tactical tasks,
`and police giving signals to drivers. Considering only one source of signal (e.g., body
`or hand) severely restricts the expressiveness of the gesture vocabulary and makes
`interaction far less natural.
`Gesture recognition can be viewed as a task of statistical sequence modeling: Given
`example observation sequences, the task is to learn a model that captures spatio-
`temporal patterns in the sequences, so that the model can perform sequence labeling
`and segmentation on new observations. One of the main challenges here is the task
`of online sequence segmentation. Most current systems assume that signal bound-
`aries and/or the length of the whole sequence are known a priori. However, interactive
`gesture understanding should be able to process continuous input seamlessly, that
`is, with no need for awkward transitions, interruptions, or indications of boundaries
`between gestures. We use the terms unsegmented and unbounded to clarify what we
`mean by continuous input. Continuous input is unsegmented, that is, there is no indi-
`cation of signal boundaries, such as the gesture start and end. Continuous input is also
`unbounded, that is, the beginning and the end of the whole sequence are unknown, re-
`gardless of whether the sequence contains a single gesture or multiple gestures. This is
`unlike work in most other areas with continuous input. In speech recognition, for exam-
`ple, most systems rely on having signal segmentation (e.g., by assuming that silence of
`a certain length indicates the end of a sentence) and deal with bounded conversations
`(e.g., making an airline reservation). Interactive gesture understanding from input
`that is continuous (both unsegmented and unbounded) requires that sequence labeling
`and segmentation be done simultaneously with new observations being made.
`This article presents a new approach to gesture recognition that tracks both body
`and hands, and combines the two signals to perform online gesture interpretation
`and segmentation continuously, allowing richer gesture vocabulary and more natural
`human-computer interaction. Our main contributions are threefold: a unified frame-
`work for continuous body and hand gesture recognition; a new error measure, based on
`Motion History Image (MHI) [Bobick and Davis 2001], for body tracking that captures
`dynamic attributes of motion; and a novel technique called multilayered filtering for
`robust online sequence labeling and segmentation.
`
`ACM Transactions on Interactive Intelligent Systems, Vol. 2, No. 1, Article 5, Pub. date: March 2012.
`
`PGR2020-00063 Page 00002
`
`

`

`Continuous Body and Hand Gesture Recognition for Natural Human-Computer Interaction
`
`5:3
`
`We demonstrate our system on the NATOPS body and hand gesture dataset [Song
`et al. 2011b]. Our extensive experimental results show that examining both static and
`dynamic attributes of motion improves the quality of estimated body features, which
`in turn improves gesture recognition performance by 6.3%. We also show that our
`multilayered filtering significantly improves recognition performance by 15.78% when
`added to the existing latent-dynamic conditional random field model. As we show in
`Section 4, these improvements are statistically significant. We also show that our
`continuous gesture recognition system achieves a recognition accuracy of 75.37% on a
`set of twenty-four NATOPS gestures.
`Section 1.1 gives an overview of our system; Section 1.2 reviews some of the most
`related work in pose tracking and gesture recognition, making distinctions to our work;
`Section 2 describes body and hand tracking; Section 3 describes continuous gesture
`recognition; and Section 4 shows experimental results. Section 5 concludes with a
`summary of contributions and suggesting directions for future work.
`Some of the material presented in this article has appeared in earlier conference
`proceedings [Song et al. 2011a, 2011b]. Song et al. [2011a] described gesture recognition
`of segmented input. This article extends our previous work to the continuous input
`domain and presents a new approach to performing online gesture interpretation and
`segmentation simultaneously (Section 3.2). Body and hand tracking was described in
`Song et al. [2011b]. Here, we include a deeper analysis of the body tracking, evaluating
`the performance of an MHI-based error measure we introduced in Song et al. [2011b]
`(Section 4.4). None of the experimental results reported in this article has appeared in
`any of our earlier work. Song et al. [2011b] also introduced a body and hand gesture
`dataset; here we give an experiment protocol on a set of all twenty-four gestures in the
`NATOPS dataset, and report a recognition accuracy of 75.37% (Section 4.7).
`
`1.1. System Overview
`Figure 1 shows an overview of our system. The three main components are a 3D upper-
`body posture estimator, a hand shape classifier, and a continuous gesture recognizer.
`In the first part of the pipeline, image preprocessing (Section 2.1), depth maps are
`calculated using images captured from a stereo camera, and the images are background
`subtracted using a combination of an offline trained codebook background model [Kim
`et al. 2005] and a “depth-cut” method.
`For 3D body posture estimation (Section 2.2), we construct a generative model of the
`human upper-body, and fit the model to observations by comparing various features
`extracted from the model to corresponding features extracted from observations. In
`order to deal with body posture ambiguities that arise from self-occlusion, we examine
`both static and dynamic attributes of motion. The static attributes (i.e., body posture
`features) are extracted from depth images, while the dynamic attributes are extracted
`from MHI [Bobick and Davis 2001]. Poses are then estimated using a particle filter
`[Isard and Blake 1998].
`For hand shape classification (Section 2.3), we use information from body posture
`estimation to make the hand tracking task efficient: Two small search regions are
`defined around estimated wrist joints, and our system searches for hands in only these
`regions. A multiclass SVM classifier [Vapnik 1995] is trained offline using manually-
`segmented images of hands. HOG features [Freeman et al. 1998; Dalal and Triggs
`2005] are extracted from the images and used as an image descriptor.
`In the last part, continuous gesture recognition (Section 3), we form the input feature
`by combining body and hand information. A Latent-Dynamic Conditional Random
`Field (LDCRF) [Morency et al. 2007] is trained offline using a supervised body and
`hand gesture dataset. The LDCRF with a temporal sliding window is used to perform
`online sequence labeling and segmentation simultaneously. We augment this with our
`
`ACM Transactions on Interactive Intelligent Systems, Vol. 2, No. 1, Article 5, Pub. date: March 2012.
`
`PGR2020-00063 Page 00003
`
`

`

`5:4
`
`Y. Song et al.
`
`Fig. 1. A pipeline view of our unified framework for continuous body and hand gesture recognition.
`
`ACM Transactions on Interactive Intelligent Systems, Vol. 2, No. 1, Article 5, Pub. date: March 2012.
`
`PGR2020-00063 Page 00004
`
`

`

`Continuous Body and Hand Gesture Recognition for Natural Human-Computer Interaction
`
`5:5
`
`multilayered filtering to make our task more robust. The multilayered filter acts both
`on the input layer and the prediction layer: On the input layer, a Gaussian temporal-
`smoothing filter [Harris 1978] is used to capture long-range temporal dependencies
`and make our system less sensitive to the noise from estimated time-series data,
`while not increasing the dimensionality of input feature vectors and keeping the model
`complexity the same. The prediction layer is further divided into local and global
`prediction layers, where we use a weighted-average filter and a moving-average filter,
`respectively, to take weighted votes of multiple overlapping prediction results as well
`as reduce noise in the prediction results.
`
`1.2. Related Work
`The topics covered in this article range broadly from body and hand tracking to gesture
`recognition with online sequence labeling and segmentation. This section reviews some
`of the most relevant work; comprehensive survey articles include Poppe [2007] for
`body tracking, Erol et al. [2007] for hand tracking, and Mitra and Acharya [2007] and
`Weinland et al. [2011] for gesture recognition.
`Gesture-based interfaces typically require robust pose tracking. This is commonly
`done by wearing specially designed markers or devices (e.g., Vicon motion capture
`system or colored gloves [Yin and Davis 2010]). However, the most natural form of
`gestural interaction would not require additional markers or sensors attached to the
`body. We take a vision-based approach and perform motion tracking based on data from
`a single stereo camera, not using any special marker device attached to the body.
`Several successful vision-based pose tracking approaches have been reported, falling
`generally into two categories: model-based methods, which try to reconstruct a pose
`model by fitting a kinematic model to the observed image [Deutscher et al. 2000;
`Sminchisescu and Triggs 2003; Lee and Cohen 2006]; and appearance-based meth-
`ods, which assume a pose vocabulary and try to learn a direct mapping from features
`extracted from images to the vocabulary [Brand 1999; Shakhnarovich et al. 2003;
`Mori and Malik 2006]. Model-based methods are in general not affected by a cam-
`era viewpoint, do not require a training dataset, and are generally more robust in
`3D pose estimation. Appearance-based methods require a large training dataset and
`in general are more sensitive to camera viewpoints, but once a mapping function is
`learned, classification is performed efficiently. Recent works take a hybrid approach,
`combining ideas from the two conventional methods and using advanced depth sensing
`cameras for 3D data acquisition (e.g., Time of Flight (ToF) [Gokturk et al. 2004] or struc-
`tured light [Fofi et al. 2004]). Schwarz et al. [2011] use a ToF camera to obtain depth
`images. They detect anatomical landmarks to fit a skeleton body model, solving con-
`strained inverse kinematics. A graph is constructed from the depth data, and geodesic
`distances between body parts are measured, making the 3D positions of anatomical
`landmarks invariant to pose. Similar to our work, they use optical flow between sub-
`sequent images to make tracking robust to self-occlusion. Shotton et al. [2011] obtain
`depth images from a structured light depth sensing camera (i.e., Microsoft Kinect).
`They take an object recognition approach: A per-pixel body part classifier is trained
`on an extensive training dataset. The results are reprojected onto 3D space, and local
`means are used to generate confidence-scored 3D proposals of body joints.
`In this work, we take a model-based approach for body posture estimation, because
`reconstructing body posture in 3D space provides important information, such as point-
`ing direction. Hand shapes, by contrast, are more categorical, that is, it is typically not
`crucial to distinguish fine-grained details of hand shape in order to understand a body
`and hand gesture. Therefore, we take an appearance-based approach to hand shape
`classification.
`
`ACM Transactions on Interactive Intelligent Systems, Vol. 2, No. 1, Article 5, Pub. date: March 2012.
`
`PGR2020-00063 Page 00005
`
`

`

`5:6
`
`Y. Song et al.
`
`Input image (left), depth map (middle), and mask image (right). The “T-pose” shown in the figures is
`Fig. 2.
`used for body tracking initialization.
`
`There have been active efforts to build a principled probabilistic graphical model for
`sequence modeling based on discriminative learning. Lafferty et al. [2001] introduced
`Conditional Random Fields (CRF), a discriminative learning approach that does not
`make conditional independence assumptions. Quattoni et al. [2007] introduced Hidden
`Conditional Random Fields (HCRF), an extension to the CRF that incorporates hidden
`variables. Many other variants of the CRF have been introduced since then [Sutton
`et al. 2004; Gunawardana et al. 2005; Wang and Mori 2009], but most of them could
`not handle continuous input, limiting their use in real-world applications.
`Morency et al. [2007] presented a Latent-Dynamic Conditional Random Field (LD-
`CRF) that is able to perform sequence labeling and segmentation simultaneously. An
`LDCRF assumes a disjoint set of hidden state variables per label, allowing it to do
`parameter estimation and inference efficiently using belief propagation [Pearl 1988].
`They showed that the model is capable of capturing the substructure of a class sequence
`and can learn dynamics between class labels, allowing the model to perform sequence
`labeling and segmentation simultaneously. However, the forward-backward message-
`passing schedule used in belief propagation limited its use to bounded input sequences
`only. In this work, we use an LDCRF with a temporal sliding window to predict se-
`quence labels and perform segmentation online, augmenting the original framework
`with our multilayered filtering, preserving the advantages of belief propagation and
`extending the previous work to the unbounded input domain.
`
`2. OBTAINING BODY AND HAND SIGNALS
`In this section, we describe body and hand tracking, which receives input images from
`a stereo camera and produces body and hand signals by performing 3D body posture
`estimation and hand shape classification.
`We describe image preprocessing in Section 2.1, which produces depth maps and
`mask images (see Figure 2). We describe 3D body posture estimation in Section 2.2 and
`hand shape classification in Section 2.3.
`
`2.1. Image Preprocessing
`The system starts by receiving pairs of time-synchronized images recorded from a
`Bumblebee2 stereo camera, producing 320 x 240 pixel resolution images at 20 FPS.
`While recording video, the system produces depth maps and mask images in real time
`(see Figure 2). Depth maps allow us to reconstruct body postures in 3D space and
`resolve some of the pose ambiguities arising from self-occlusion; mask images allow
`us to concentrate on the objects of interest and ignore the background, optimizing the
`use of available computational resources. We obtain depth maps using a manufacture-
`provided SDK.1
`
`1http://www.ptgrey.com.
`
`ACM Transactions on Interactive Intelligent Systems, Vol. 2, No. 1, Article 5, Pub. date: March 2012.
`
`PGR2020-00063 Page 00006
`
`

`

`Continuous Body and Hand Gesture Recognition for Natural Human-Computer Interaction
`
`5:7
`
`Fig. 3. Generative model of the human upper-body model. The model includes 6 body parts (trunk, head,
`upper and lower arms for both sides) and 9 joints (chest, head, navel, left/right shoulder, elbow, and wrist).
`
`We obtain mask images by performing background subtraction. Ideally, background
`subtraction could be done using depth information alone by the “depth-cut” method:
`Filter out pixels whose distance is further from camera than a foreground object,
`assuming there is no object in between the camera and the subject. However, as shown
`in Figure 2, depth maps typically have lower resolution than color images, meaning that
`the resolution of mask images produced would be equally low resolution. This motivates
`our approach of performing background subtraction using a codebook approach [Kim
`et al. 2005], then refining the result with the depth-cut method.
`The codebook approach works by learning a per-pixel background model from a his-
`tory of 2D color background images sampled over a period of time, then segmenting out
`the “outlier” pixels in new images as foreground. Since this approach uses RGB images,
`it produces high-resolution mask images. One weakness of the codebook approach is,
`however, its sensitivity to illumination and shadows, arising because the codebook de-
`fines a foreground object as any set of pixels whose color values are noticeably different
`from the previously learned background model. To remedy this, after input images are
`background subtracted using the codebook approach, we refine the result using the
`depth-cut method described before.
`
`2.2. 3D Body Posture Estimation
`The goal here is to reconstruct upper-body posture in 3D space given the input images.
`We formulate this as a sequential Bayesian filtering problem, that is, having observed
`a sequence of images Zt = {z1, . . . , zt} and knowing the prior state density p(xt), make
`a prediction about a posterior state density p(xt | Zt), where xt = (x1,t . . . xk,t) is a
`k-dimensional vector representing the body posture we are estimating.
`
`2.2.1. GenerativeUpper-BodyModel. Our generative model of the human upper-body is
`constructed in 3D space, using a skeletal model represented as a kinematic chain and
`a volumetric model described by superellipsoids [Barr 1981] (see Figure 3). The model
`includes 6 body parts (trunk, head, upper and lower arms for both sides) and 9 joints
`(chest, head, navel, left/right shoulder, elbow, and wrist). The shoulder is modeled as a 3
`DOF ball-and-socket joint, and the elbow is modeled as a 1 DOF revolute joint, resulting
`in 8 model parameters in total. Coordinates of each joint are obtained by solving the
`forward kinematics problem, following the Denavit-Hartenberg convention [Denavit
`and Hartenberg 1955], a compact way of representing n-link kinematic structures.
`We prevent the model from generating anatomically implausible body postures by
`constraining joint angles to known physiological limits [NASA 1995].
`The human shoulder has historically been the most challenging part for human
`body modeling [Engin 1980]. It has a complicated anatomical structure, with bones,
`
`ACM Transactions on Interactive Intelligent Systems, Vol. 2, No. 1, Article 5, Pub. date: March 2012.
`
`PGR2020-00063 Page 00007
`
`

`

`5:8
`
`Y. Song et al.
`
`(cid:2)
`
`muscles, skin, and ligaments intertwined, making modeling of shoulder movement
`difficult [Feng et al. 2008].
`We improve on our basic model of the human upper-body by building a more pre-
`cise model of the shoulder, while still not increasing the dimensionality of the model
`parameter vector. To capture arm movement more accurately, after a body model is
`generated, the shoulder model is refined analytically using the relative positions of
`other body joints.
`Therefore, the generation of a body model is a two-step procedure: Given the eight
`joint angle values, we first solve the forward kinematics problem, obtaining the coor-
`dinates of each joint. Then we compute the angle ϕ between the chest-to-shoulder line
`and the chest-to-elbow line, and update the chest-to-shoulder angle θ CS as2
`θ CS + ϕ
`if elbow is higher than shoulder,
`θ CS(cid:2) =
`θ CS − ϕ
`otherwise,
`
`θ CS
`MAX
`
`θ CS
`MIN
`
`(1)
`
`where θ CS
`min and θ CS
`max are minimum and maximum joint angle limits for chest-to-shoulder
`joints [NASA 1995]. Figure 3 illustrates our generative model, rendered after the chest-
`to-shoulder angles θ CS are adjusted (note the left/right chest-to-shoulder angles are
`different). This simplified model mimics shoulder movement in only one dimension, up
`and down, but works quite well if the subject is facing the camera, as is commonly true
`for human-computer interaction.
`With these settings, an upper-body posture is parameterized as
`x = (G R)T ,
`(2)
`where G is a 6-dimensional global translation and rotation vector, and R is an 8-
`dimensional joint angle vector (3 for shoulder and 1 for elbow, for each arm). In practice,
`once the parameters are initialized, we fix all but (x, z) translation elements of G,
`making x a 10-dimensional vector.
`2.2.2. ParticleFilter. Human body movements can be highly unpredictable, so an infer-
`ence that assumes its random variables form a single Gaussian distribution can fall
`into a local minima or completely lose track. A particle filter [Isard and Blake 1998] is
`particularly well suited to this type of task for its ability to maintain multiple hypothe-
`ses during inference, discarding less likely hypotheses only slowly. It represents the
`posterior state density p(xt | Zt) as a multimodal non-Gaussian distribution, which is
`approximated by a set of N weighted particles: {(s(1)
`)}. Each sample
`, π (N)
`), . . . , (s(N)
`t , π (1)t
`(cid:3)
`t
`t
`
`st represents a pose configuration, and the weights π (n)
`t are obtained by computing the
`likelihood p(zt | xt = s(n)
`t = 1.
`π (n)
`t ), and normalized so that
`N
`The joint angle dynamic model is constructed as a Gaussian process.
`e ∼ N (0, σ 2)
`xt = xt−1 + e,
`(3)
`Once N particles are generated, we obtain the estimation result by calculating the
`(cid:6)
`(cid:5)
`Bayesian Least Squares (BLS) estimate.
`
`E[ f (xt)] = N(cid:4)
`
`π (n)
`t
`
`f
`
`n=1
`Iterative methods need a good initialization. We initialize our generative model at the
`first frame: The initial body posture configurations (i.e., joint angles and limb lengths)
`
`s(n)
`t
`
`(4)
`
`2Note that the angle ϕ is not an additional model parameter, because it is computed analytically using joint
`positions.
`
`ACM Transactions on Interactive Intelligent Systems, Vol. 2, No. 1, Article 5, Pub. date: March 2012.
`
`PGR2020-00063 Page 00008
`
`

`

`Continuous Body and Hand Gesture Recognition for Natural Human-Computer Interaction
`
`5:9
`
`Fig. 4. Motion history images of the observation (left) and the estimated model (right). White pixel values
`indicate an object has appeared in the pixel; gray pixel values indicate there was an object in the pixel but
`it has moved; black pixel values indicate there has been no change in the pixel.
`
`(cid:5)
`
`(cid:7)
`

`
`(cid:6)(cid:8)
`
`(5)
`
`are obtained by having the subject assume a static “T-pose” (shown in Figure 2), and
`fitting the model to the image with exhaustive search. This typically requires no more
`than 0.3 seconds (on an Intel Xeon CPU 2.4 GHz machine with 4GBs of RAM).
`2.2.3. LikelihoodFunction. The likelihood function p(zt|xt = s(n)
`t ) measures the goodness-
`of-fit of an observation zt given a sample s(n)
`. We define it as an inverse of an exponen-
`t
`(cid:6) =
`(cid:5)
`tiated fitting error ε(zt, s(n)
`t ).
`
`zt | xt = s(n)
`
`t
`
`p
`
`exp
`
`1
`zt, s(n)
`t
`The fitting error ε(zt, s(n)
`t ) is a weighted sum of three error terms computed by comparing
`features extracted from the generative model to the corresponding features extracted
`from input images. The three features include a 3D visible-surface point cloud, a 3D
`contour point cloud, and a Motion History Image (MHI) [Bobick and Davis 2001]. The
`first two features capture discrepancies in static poses; the third captures discrepancies
`in the dynamics of motion. We chose the weights for each error term empirically.
`The first two features, 3D visible-surface and contour point clouds, are used fre-
`quently in body motion tracking (e.g., Deutscher et al. [2000]) for their ability to evalu-
`ate how well the generated body posture fits the actual pose observed in the image. We
`measure the fitting error by computing the sum-of-squared Euclidean distance errors
`between the point cloud of the model and the point cloud of the input image (i.e., the
`3D data supplied by the image preprocessing step described earlier).
`The third feature, an MHI, is an image where each pixel value is a function of the
`recency of motion in a sequence of images (see Figure 4). This often provides useful
`information about dynamics of motion, as it indicates where and how the motion has
`occurred. We define an MHI-based error term to measure discrepancies in the dynamics
`of motion.
`An MHI is computed from It−1 and It, two time-consecutive 8-bit unsigned integer
`images whose pixel values span from 0 to 255. For the generative model, It is obtained
`by rendering the model generated by a sample s(n)
`(i.e., rendering an image of what
`t
`body posture s(n)
`t would look), and It−1 is obtained by rendering E[ f (xt−1)], the model
`generated by the estimation result from the previous step (Eq. (4)). For the input
`
`ACM Transactions on Interactive Intelligent Systems, Vol. 2, No. 1, Article 5, Pub. date: March 2012.
`
`PGR2020-00063 Page 00009
`
`

`

`5:10
`
`Y. Song et al.
`
`images, It is obtained by converting an RGB input image to YCrCb color space and
`extracting the brightness channel3, and this is stored to be used as It−1 for the next
`time step. Then an MHI is computed as
`IMH I = thresh(It−1 − It, 0, 127) + thresh(It − It−1, 0, 255),
`(6)
`where thresh(I, α, β) is a binary threshold operator that sets each pixel value to β if
`I(x, y) > α, and zero otherwise. The first term captures pixels that were occupied at the
`previous time step but not in the current time step. The second term captures pixels
`that are newly occupied in the current time step. We chose the values 0, 127, and 255 to
`indicate the time information of those pixels: 0 means there has been no change in the
`pixel, regardless of whether or not there was an object; 127 means there was an object
`in the pixel but it has moved; while 255 means an object has appeared in the pixel.
`This allows us to construct an image that concentrates on the moved regions only (e.g.,
`arms), while ignoring the unmoved parts (e.g., trunk, background). The computed MHI
`images are visualized in Figure 4.
`Given the MHIs of the generative model and the observation, one can define various
`error measures. In this work, we define an MHI error as
`εMH I = Count [thresh(I
`(cid:2), 127, 255)],
`(cid:5)
`(cid:5)
`(cid:9)
`
`(7)
`
`(8)
`
`where
`
`(cid:10)(cid:6)(cid:6)
`
`.
`
`IMH I(zt, zt−1) − IMH I
`(cid:2) = abs
`s(n)
`f (xt−1)
`I
`t , E
`This error function first subtracts an MHI of the model IMH I(s(n)
`t , E[ f (xt−1)]) from an
`MHI of the observation IMH I(zt, zt−1), and computes an absolute-valued image of it
`(Eq. (8)). Then it applies the binary threshold operator with the cutoff value and result
`value (127 and 255, respective), and counts nonzero pixels with Count [·] (Eq. (7)). We
`set the cutoff value to 127 to penalize the conditions in which two MHIs do not match
`at the current time step, independent of the situation at the previous time step.4
`We evaluate the effectiveness of the MHI-based error measure in Section 4.4, where
`we compare gesture recognition accuracy of the models trained on features estimated
`with and without the MHI-based error measure.
`
`2.3. Hand Shape Classification
`The goal of hand shape classification is to classify hand shapes made contemporane-
`ously with gestures into one of several cano

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket