AUTHOR=Haider Fasih , Koutsombogera Maria , Conlan Owen , Vogel Carl , Campbell Nick , Luz Saturnino 

TITLE=An Active Data Representation of Videos for Automatic Scoring of Oral Presentation Delivery Skills and Feedback Generation

JOURNAL=Frontiers in Computer Science

VOLUME=Volume 2 - 2020

YEAR=2020

URL=https://www.frontiersin.org/journals/computer-science/articles/10.3389/fcomp.2020.00001

DOI=10.3389/fcomp.2020.00001

ISSN=2624-9898

ABSTRACT=Public speaking is an important skill, the acquisition of which requires dedicated and time consuming training. In recent years, researchers have started to investigate automatic methods to support public speaking skills training. These methods include assessment of the trainee's oral presentation delivery skills which may be accomplished through automatic understanding and processing of social and behavioural cues displayed by the presenter. In this study, we propose an automatic scoring system for presentation delivery skills using a novel active data representation method to automatically rate segments of a full video presentation. While most approaches have employed a two step strategy consisting of detecting multiple events followed by classification, which involve the annotation of data for building the different event detectors and generating a data representation based on their output for classification, our method does not require event detectors. The proposed data representation is generated unsupervised using low-level audiovisual descriptors and self-organizing mapping, and used for video classification. This representation is also used to analyse video segments within a full video presentation in terms of several characteristics of the presenter's performance. The audio representation provides the best prediction results for self-confidence and enthusiasm, posture and body language, structure and connection of ideas, and overall presentation delivery. The video data representation provides the best results for presentation of relevant information with good pronunciation, usage of language according to audience , and maintenance of adequate voice volume for the audience. The fusion of audio and video data provides the best results for eye contact. Applications of the method to provision of feedback to teachers and trainees are discussed.