Gender Perception From Gait: A Comparison Between Biological, Biomimetic and Non-biomimetic Learning Paradigms

Sarangi, Viswadeep; Pelah, Adar; Hahn, William Edward; Barenholtz, Elan

doi:10.3389/fnhum.2020.00320

ORIGINAL RESEARCH article

Front. Hum. Neurosci., 27 August 2020

Sec. Motor Neuroscience

Volume 14 - 2020 | https://doi.org/10.3389/fnhum.2020.00320

This article is part of the Research Topic Machine Learning in Neuroscience View all 25 articles

Gender Perception From Gait: A Comparison Between Biological, Biomimetic and Non-biomimetic Learning Paradigms

$\r\nViswadeep Sarangi$ Viswadeep Sarangi¹

Adar Pelah^1*

William Edward Hahn²

Elan Barenholtz²

¹Department of Electronic Engineering, University of York, York, United Kingdom
²Center for Complex Systems and Brain Sciences, Florida Atlantic University, Boca Raton, FL, United States

This paper explores in parallel the underlying mechanisms in human perception of biological motion and the best approaches for automatic classification of gait. The experiments tested three different learning paradigms, namely, biological, biomimetic, and non-biomimetic models for gender identification from human gait. Psychophysical experiments with twenty-one observers were conducted along with computational experiments without applying any gender specific modifications to the models or the stimuli. Results demonstrate the utilization of a generic memory based learning system in humans for gait perception, thus reducing ambiguity between two opposing learning systems proposed for biological motion perception. Results also support the biomimetic nature of memory based artificial neural networks (ANN) in their ability to emulate biological neural networks, as opposed to non-biomimetic models. In addition, the comparison between biological and computational learning approaches establishes a memory based biomimetic model as the best candidate for a generic artificial gait classifier (83% accuracy, p < 0.001), compared to human observers (66%, p < 0.005) or non-biomimetic models (83%, p < 0.001) while adhering to human-like sensitivity to gender identification, promising potential for application of the model in any given non-gender based gait perception objective with superhuman performance.

Introduction

A person’s gait carries information about the individual along multiple dimensions. In addition to indicating biologically intrinsic properties, like gender and identity, the gait of a person changes dynamically based on their emotional state (Pollick et al., 2002) and state of health (Cesari et al., 2005). Humans are adept at identifying whether a given sparse motion pattern is biological or not (Johansson, 1973, 1976) as well as detecting properties such as gender or mood. However, the origin of these abilities remains unclear. One the one hand, the ability to distinguish biological from non-biological motion appears at a very young age (Fox and McDaniel, 1982), suggesting there may be some expert-system capacities present at birth. Indeed, some theorists have suggested that biological motion perception served as an evolutionary and developmental precursors to the theory of the mind (Frith, 1999). However, recognition of biological motion could also be attributed to generic learning systems that are trained with experience (Fox and McDaniel, 1982; Bertenthal and Pinto, 1994; Thompson and Parasuraman, 2012), where adults have been shown to be able identify biological motion which was synthetically created using machines (while the infants could not), indicating a learning system that tunes itself based on experience. One way to address this is to compare the behavior of human observers to computational learning models of different types that can be trained “from scratch,” i.e., without specialized mechanisms pre-tuned to the properties of biological motion. This provides the opportunity to assess whether generic learning models trained on biological motion stimuli serve as a reasonable model of human behavior or whether additional mechanisms, such as pre-tuned expert systems, should be posited. In addition, we classify the computational models into two groups: (1) biomimetic models, that functionally replicate the neural learning systems in humans, especially biological memory, and (2) non-biomimetic models, which utilize statistical techniques to identify discerning features in data for classification. For clarification, we use the term “biomimetic” as, the study of the structure and function of living things as models for the creation of materials or products by reverse engineering (Farber, 2010). A strong resemblance between humans and biomimetic models, but not with non-biomimetic models, would provide further evidence that the mechanisms underlying human biological motion perception are well captured by a generic learning model. To do so, we compared performance in a simple binary biological motion classification task (gender recognition) and compared human performance to a range of computational learning models.

Biological Models

Johansson (1973, 1976) first demonstrated that human observers were sensitive to biological motion through point light displays representing joints of a human walkers. Despite their sparsity, human observers readily interpreted the stimuli as human gait. Subsequent research with point-light walkers demonstrated the ability of humans in the identification of familiar people (Loula et al., 2005). In case of unfamiliarity, observers could extract certain general categories such as approximate age and gender (Kozlowski and Cutting, 1977; Barclay et al., 1978; George and Murdoch, 1994; Lee and Grimson, 2002; Pollick et al., 2005) with significantly higher than chance accuracy. In case of gender identification, humans achieved the best performance when presented with the stimuli in the coronal plane, due to the prevalence of dynamic cues (George and Murdoch, 1994). However, the biological nature of human perception hinders its replicability and transferability. The learning is highly variable, volatile and susceptible to fatigue, illness and mortality, leading to the need for automation of gait classification. Automation of gait classification has been extensively studied with a high focus on performance outcome through classification accuracy. However, mimicking human perception closely would ensure versatility of the artificial classifier, enabling the same classifier to be used in other non-gender related gait classification tasks.

Machine Learning Models

Machine learning (ML) models can be trained to identify the relevant attributes in gait such as gender with high speed and fidelity. The models can broadly be divided into two categories: (1) Memory based models comprised of artificial neural networks (ANN) such as the Long Short Term Memory (LSTM) cells (Graves and Schmidhuber, 2005), which operate on time series data, and (2) Static models, such as the Random Decision Forests (RDFs) (Kaur and Bawa, 2017) and Support Vector Machines (SVMs) (Huang et al., 2014), which operate on static data. The LSTM model shall be referred to as the “biomimetic” models crediting the functional implementation of the biological neural network and memory using artificial neurons, while the SVM and RDF shall be referred to as the “non-biomimetic” models. Prior studied have promising results in terms of the ability of biomimetic ML in being able to mimic human observation of gait (Pelah et al., 2019; Sarangi et al., 2019; Stone et al., 2019). However, an in-depth exploration of the above mentioned models and direct comparison to human observers on the same stimuli has not been conducted.

Biomimetic Models

Artificial neural networks aim to mimic the flow of information in the biological brain by creating a network of neurons, based on the perceptron model (Rosenblatt, 1958). Recurrent neural networks (RNN), an implementation of the ANN, simulates the memory capabilities of the human brain, by creating an additional feedback loop for processing latent network state along with new data (Mikolov and Zweig, 2012). RNNs operate on a sequence of vectors as input data. The sequence resembles the time series information and the vector represents the features of the input at each timestamp. However, RNNs suffer from vanishing and exploding gradient rendering them ineffective in processing long sequences (Graves and Schmidhuber, 2005). LSTM cells overcome this problem by introducing additional gates in the network to regulate the flow of information, enabling them to remember relevant temporal patterns over long periods of time (Graves and Schmidhuber, 2005).

Non-biomimetic Models

The non-biomimetic ML techniques considered in this paper learn to classify information based on (1) linear separability, following non-linear projections, and (2) reduction of information entropy based on feature thresholds. SVM models learn to fit a linear hyperplane to maximize the separation between the classes in the training dataset. Decision trees, learn to classify information based on the learned numerical thresholds of features. RDF is a collection of randomly initialized decision trees with majority vote of the cohort considered as the predicted class. SVMs and RDFs accept static representations of data as input and thus cannot process temporal sequences of information, unlike humans and LSTMs.

Data Collection

Forty one consenting healthy adults (26 male, 15 female) between the ages of 18 and 50 years old were recorded walking on the treadmill. Participants volunteered and received credit toward a participation grade for their class. Appropriate consent forms were signed and anonymity maintained. Gait data was recorded as spatiotemporal three-dimensional joint trajectories for 20 tracked joints of the body. The tracked points on the walker’s skeleton included the head, neck, shoulders, elbows, wrists, fingertips, mid spine, back, hips, knees, hips, ankles, and toes. The collection of the joint positions formed a static frame. Data was captured at 24 frames per second, each frame represented by 60 numbers (3D coordinates of 20 joints) and a corresponding timestamp of capture of the frame. Data was recorded for 6 sessions per participant. Each session consisted of a minute of walking on the treadmill at a self-selected speed followed by a minute’s rest. The joints were extracted utilizing a consumer-level time-of-flight based RGB-D sensor, the Microsoft Kinect v2. The sensor provides an anthropomorphic representation of the human skeleton through 3D joint coordinates. The sensor was placed approximately 1.5 m in front of the treadmill with the front board removed to avoid issues with occlusion. The ML based skeletal motion capture method mentioned in Shi et al. (2018) is used for capturing the PLD representation of the biological motion of the walkers. When compared with the state-of-art optical motion tracking methods [such as Vicon (Clark et al., 2012; Pfister et al., 2014)], the anatomical landmarks from the Kinect-generated point clouds can be measured with high test-retest reliability, and the differences in the interclass coefficient correlation between Kinect and Vicon are <0.16 (Clark et al., 2012, 2013; Pfister et al., 2014; van Diest et al., 2014). Both systems have been shown to effectively capture >90% variance in full-body segment movements during exergaming (van Diest et al., 2014). The validity of biological motion captured using the Kinect v2 sensor is established in Shi et al. (2018) with human observers through reflexive attentional orientation and extraction of emotional information from the upright and inverted PLD.

Experiment 1: Biological Models

Studies have shown humans to require no longer than two complete gait cycles to correctly identify gender from human gait (Huang et al., 2014). In terms of duration, this translates to less than 2.7 s of walking animation. Although humans can decipher biological motion from point light animation of walking human figure within 200 msec, at least 1.6 s of stimulus is required for significantly above chance performance. This experiment aims to establish the change in gender identification performance in humans as a function of increasing duration of stimulus exposure.

Method

Biological Model

Fifteen female and six male healthy observers with age ranging from 20 to 43 years old, participated in the experiment. All had some experience of biological motion displays, although none had been required to make judgments about gender.

Stimuli

A PC-compatible computer monitor with a high performance raster graphics system displayed stimuli on an Iiyama ProLite B2283HS color monitor (1920 × 1080 resolution, 60 Hz refresh rate). Human figures were defined by 20 circular white dots of 5 pixel radius overlaid on a black background, located on the head, neck, shoulders, elbows, wrists, fingertips, back, spine, hips, knees, ankles, and toes. None of the dots were occluded by other subjective parts of the figure. Animated sequences were created by placing the dots at the three-dimensional trajectory of each of the 20 tracked joints, and temporally sampling the coordinates to produce 24 static frames per second, as shown in Figure 1. The stimulus size was 6 degrees wide and 8 degrees long for the whole frame, including zero (black) padding. A degree of visual angle is defined as the subtended angle at the nodal point of the eye. The actual walking clip was 2.5 degrees wide and 4 degrees long. The presented stimuli was height normalized to fit the given aspect ratio, to prevent the observers from identifying gender based on height of the animated walker.

FIGURE 1

Figure 1. Point light representation of a walking stimulus at eight different stages of a gait cycle.

When the static frames were played in quick succession, a vivid impression of a walking person emerged. There was no progressive component to the walking animation, thus the human figure appeared to walk on an unseen treadmill with the walking direction oriented toward the observer. The height range for males was 144–208 cm, and for females was 129–152 cm. None were notably over- or underweight (see Table 1). The x and z component were sampled to display the walker in the coronal plane to emphasize lateral sway and maximize the provision of dynamic cues to the observer (Barclay et al., 1978; Troje, 2002). The recorded gait sequences were converted into an animation sequence in the same fashion to be presented as visual stimuli. The observers were seated in a well-lit room in front of the monitor and had access to a standard computer mouse for interaction. The randomly chosen walker stimuli were presented for exposure durations of 0.4, 1.5, 2.5, and 3.8 s, followed by an on-screen prompt in the form of two buttons requesting the observer prediction of binary gender through a mouse click on either of the labeled buttons. Following the response from the observer, the next stimulus was presented. A total of 200 walking clips were shown per observer per exposure duration and the responses recorded for each.

TABLE 1

Table 1. Description of the walking subjects taking part in the stimulus set.

Results

Human observers correctly identified 63% of all the trials across all exposure durations, t(20) = 7.8, p < 0.001, two tailed. All t-tests reported in this paper are two-tailed, unless otherwise indicated. Chance performance for the binary gender classification is 50% as expected. Correct identification at 0.4 s, which consisted of a quarter of a step cycle, was above chance at 60%, t(20) = 3.7, p < 0.01, conforming with (Barclay et al., 1978; Troje, 2002), however, was in disagreement with (Barclay et al., 1978). This could be attributed to the presentation of the stimulus in the coronal plane as opposed to the sagittal plane (Sarangi et al., 2019), leading to higher emphasis on the dynamic cues. Performance at 1.5 s is 66%, t(20) = 3.8, p < 0.005, which is higher than the performance at 2.5 s of 61%, t(20) = 4.8, p < 0.001. Barclay et al. (1978) explains this anomalous phenomenon due to an additional partial step at 2.5 s by highlighting the preferred perception of velocity over positional cues, where sensitivity to gender identification decreases mid-swing in the gait cycle. Humans were able to identify gender with highest accuracy at 3.8 s with 69%, t(20) = 3.4, p < 0.01. Details of the results obtained have been listed in Table 2. Overall, the performance of the human observers taking part in the experiment conforms to the results of perceptual experiments in literature (Barclay et al., 1978; Troje, 2002), providing a reliable baseline for comparison with the biomimetic perception on the same stimulus set.

TABLE 2

Table 2. Gender identification accuracy in % as a function of exposure duration of the stimulus.

In summary, human observers were able to identify gender from gait with significantly above chance performance from moving dots presentations of joints, while conforming with existing human perception literature. There is a significant increase in gender identification performance between 0.4 and 3.8 s of stimulus exposure duration. The increased gender sensitivity at 1.5 s is attributed to the prevalence of dynamic, velocity based cues at the phase of the step cycle corresponding to that time (George and Murdoch, 1994), thus demonstrating the preference of humans toward dynamic velocity based cues compared to structural position based cues for gender identification.

Experiment 2: Biomimetic Models

Long short term memory network’s capability to process a temporal sequence of data aims to mimic the temporal pattern recognition capabilities of humans. The learning gates inherent in the network parallel the short and long term memory of the human brain, enabling the network to remember the relevant temporal pattern while ignoring patterns that don’t contribute toward the classification objective. This experiment aims to present an LSTM network with the temporal evolution of joint trajectories during human gait and train it for gender identification to evaluate for resemblance with human observers.

Method

Biomimetic Model

A standard LSTM model consisting of 128 hidden states in the cell (as shown in Figure 2), is initialized. The cell state weights were initialized as a random normal distribution. The final cell state was ReLU activated (Maas et al., 2013) and connected to an affine output layer, which represented the one-hot labeled gender identity of the walker during training. During testing, the output layer represented the prediction values. The error of prediction was evaluated using a cross-entropy function (de Brébisson and Vincent, 2015) for updating of the weights using an Adam optimizer (Kingma and Ba, 2014) based on the error differentials and a learning rate of 0.001. The most probable output was taken as the class label during prediction. 10 LSTM models, mimicking 10 random human perceptions, were generated for inferring the gender from gait input.

FIGURE 2

Figure 2. Implementation of the LSTM network architecture for processing gait sequences.

Data Input

The three-dimensional trajectories of each of the 20 tracked joints were concatenated to form a vector representation of a static frame with a cardinality of 60, representing the location of the head, neck, shoulders, elbows, wrists, fingertips, mid-back, hips, knees, ankles, and toes. Gait input to the model consisted of a sequence of vector representations of subsequent static frames, sampled at 24 frames per second. Joint trajectories were size normalized (Troje, 2002) and standardized with a zero mean and unit standard deviation. Model training sessions included, initialization of the model weights, prediction of the output probabilities based on the gait input, propagation of the prediction error and updating the network weights. Model training was executed in batches of 50 and repeated for 100 epochs. Input sequence durations mirrored the exposure durations in the corresponding human perception experiment and varied incrementally for 10 durations from 0.4 to 3.8 s in steps of 0.4 s (10 static frames). 10-fold cross validation was carried out to ensure model generalizability and a total of 250 gender predictions were obtained per input sequence duration. The models trained per session per duration are stored locally for future analyses.

Results

Long Short Term Memory models correctly identified 76% of all the gait inputs presented across all the input durations, t(9) = 9.2, p < 0.001. Chance performance remains same at 50%. Correct identification at a quarter of a step cycle at 0.4 s was 71%, t(9) = 5, p < 0.001, higher than the same with human observers, F(9,20) = 3.6, p < 0.1. The difference in performance indicates a higher inference capacity from a limited amount of available data. The inference performance increases slightly with increase in the amount of information available from 0.4 to 3.8 s, F(9,9) = 2, p < 0.1. At 3.8 s, the model correctly identified gender with 81% accuracy, t(9) = 9.6, p < 0.001, considerably higher than human observers, F(9,20) = 9, p < 0.01. Generalizing across all the input (or exposure) durations, the biomimetic model identified gender with a significantly higher accuracy than the human observers, F(9,20) = 39.9, p < 0.001. Details of results obtained for the LSTM model have been presented in Table 3 with the corresponding trend plotted in Figure 3. As shown in the figure, mean performance peaks temporarily at 1.6 s (halfway completion of one gait step) with 79% accuracy, t(9) = 10.1, p < 0.001 suggesting a dependence on dynamic and velocity cues similar to humans at 1.5 s.

TABLE 3

Table 3. Gender identification accuracy as a function of exposure duration of the stimulus with p < 0.001 for all the durations.

FIGURE 3

Figure 3. Gender identification performance in mean ± standard error % by the models as a function of exposure duration in seconds.

In summary, the biomimetic LSTM model performed significantly better than chance in gender classification from 3D moving point representations of human gait. There was a significant increase in gender identification accuracy from 0.4 to 3.8 s of gait information exposure, corresponding to humans. The increased gender sensitivity at 1.6 s could be attributed to an inherent sensitivity to dynamic velocity based cues in LSTM networks for gender identification, similar to humans. One could argue that the presentation of the skeleton stimulus as facing toward the camera could potentially limit real-world applications. However, although specific deployments would need to be assessed, the availability of 3D data could be leveraged to apply a simple preprocessing rotational step to the skeleton to correct for any misalignment in global skeletal configuration.

Experiment 3: Non-Biomimetic Models

The non-biomimetic models, unlike humans and LSTMs are capable of analyzing static data only. Their reliance on the principles of linear separability and information entropy to create rules for classification, resembles expert systems. The models require a static representation of the spatiotemporal gait data for gait classification. Thus data was represented as, (1) static descriptions of the temporal signals, and (2) extracted metrics used in a clinical setting to describe gait for diagnosis and rehabilitation monitoring. In this experiment, we evaluate the SVMs and RDFs on the two static representations of gait data for resemblance with human observation.