Assessing Emotion and Sensitivity of AI Artwork

Agudo, Ujué; Arrese, Miren; Liberal, Karlos G.; Matute, Helena

doi:10.3389/fpsyg.2022.879088

ORIGINAL RESEARCH article

Front. Psychol. , 05 April 2022

Sec. Human-Media Interaction

Volume 13 - 2022 | https://doi.org/10.3389/fpsyg.2022.879088

Assessing Emotion and Sensitivity of AI Artwork

Ujué Agudo^1,2

Miren Arrese²

Karlos G. Liberal²

Helena Matute¹^*

¹Departamento de Psicología, Universidad de Deusto, Bilbao, Spain
²Laboratorio de intervención, Bikolabs/Biko, Pamplona, Spain

Artificial Intelligence (AI) is currently present in areas that were, until recently, reserved for humans, such as, for instance, art. However, to the best of our knowledge, there is not much empirical evidence on how people perceive the skills of AI in these domains. In Experiment 1, participants were exposed to AI-generated audiovisual artwork and were asked to evaluate it. We told half of the participants that the artist was a human and we confessed to the other half that it was an AI. Although all of them were exposed to the same artwork, the results showed that people attributed lower sensitivity, lower ability to evoke their emotions, and lower quality to the artwork when they thought the artist was AI as compared to when they believed the artist was human. Experiment 2 reproduced these results and extended them to a slightly different setting, a different piece of (exclusively auditory) artwork, and added some additional measures. The results show that the evaluation of art seems to be modulated, at least in part, by prior stereotypes and biases about the creative skills of AI. The data and materials for these experiments are freely available at the Open Science Framework: https://osf.io/3r7xg/. Experiment 2 was preregistered at AsPredicted: https://aspredicted.org/fh2u2.pdf.

Introduction

In recent years Artificial Intelligence (AI) has started to contribute to areas and domains that until now were associated solely with human abilities (Wegner and Gray, 2017), such as writing novels (Jozuka, 2016), painting pictures (Christie’s, 2018), devising magic tricks (Williams and McOwan, 2014), or composing music (Adams, 2010; Deah, 2018). However, there is not much empirical evidence on how people perceive the skills of AI in these domains, particularly in the field of art, which is the focus of the present research.

For example, in the case of music, critics and audiences seem not have received the contribution of AI very well. Let us take the case of David Cope as an example. He is a Professor at the University of California who has been generating musical compositions through Artificial Intelligence for more than two decades. His first shows to the public, a piece of music similar to those of Bach, at a contest at the University of Oregon, and another piece with the style of Mozart at the Santa Cruz Baroque Festival (Johnson, 1997), were received with rejection, contempt and even wrath (Friedel, 2018). Cope was unable to have recognized musicians play his compositions publicly, not even years later (Saenz, 2009). The critics described his work as mere imitation, lacking in meaning and soul (Johnson, 1997). Since then, the current technological advancements do not seem to have changed the perception of the artistic ability of AI, at least in the context of classical music. The dissatisfied reactions of the public and the negative criticisms received by the conclusion of the unfinished symphonies of Mahler (Zappei, 2019) and Schubert (Mantilla, 2019) by AI, confirm this general rejection.

This rejection of the artwork of the machine has also been found in the laboratory as indicated by the few existing studies on the subject. For instance, Ragot et al. (2020) asked a large sample of participants to evaluate a series of paintings created by humans or by AI. Those created by humans were evaluated more positively than those created by AI in terms of linking, beauty, novelty, and meaning.

Several studies that use a modified form of the Turing Test as their procedure, also report undervaluation of AI artwork. For example, Moffat and Kelly (2006) showed some musical pieces to a small sample of participants. They asked their participants to evaluate the musical pieces and try to guess if they had been composed by humans or by computers. Regardless of the genre that they listened to, the participants preferred the works that they guessed that had been composed by humans. Similarly, Chamberlain et al. (2018) presented several works of visual art created by humans or by computers and asked their participants to guess their authorship and to evaluate them. When participants liked the artworks, they assumed that the artist was a human. These studies suggest that perhaps the general preference for the artwork created by humans might not rest on the objective quality of the artwork but on the prejudices that people may have against music created by machines.

Importantly, in the aforementioned studies, it is not possible to conclude whether the undervaluation of artworks is due to the artwork itself or to prejudices against the capacity of AI in its role as an artist. According to Sundar (2020), people associate certain negative traits and stereotypes with machines, such as more inflexibility, emotionlessness, and coldness. For example, in a study concerned with how people vale the authenticity of artwork created by AI, Jago (2019) observed that the evaluations of authenticity of the artwork created by AI were better when the experimental participants believed that a human had created the artwork. The author used two different measures of authenticity: (a) type authenticity (i.e., whether the artwork was considered authentic so that it could be classified as art); and (b) moral authenticity (i.e., whether it reflected the values or motivations of its creator). According to this research, when participants believed that the artist was a human, they rated it as more authentic than when they knew that it was the artwork of an AI algorithm, but only in terms of moral authenticity, not type authenticity. That is, participants accepted that the AI algorithm’s work was authentic and could be classified as art, but did not consider that the artwork was authentic in the sense of reflecting the artist’s values, motivation, or essence.

In this line, in a recent study, Hong et al. (2020) measured the perceived quality of musical pieces composed by AI. Once it was known that the piece of music to be heard had been composed by an AI or by a human, the participants listened to the piece. Then they rated its quality in terms of aesthetic appeal, creativity and craftsmanship. The participants also indicated what their attitudes toward creative AI were, and to what extent their pre-listening expectations had been violated. Even though the design of this study included both AI music and human music, the authors did not address whether the artwork of AI and that of humans was valued differently. Nevertheless, they concluded that acceptance of the creative skills of AI would be a necessary requirement for a positive evaluation of their artistic performance.

The discomfort produced by the inclusion of machines in the artistic context could be related to a more general phenomenon known as “algorithm aversion,” observed in decision-making (Shaffer et al., 2013; Dietvorst et al., 2015; Castelo et al., 2019; Yeomans et al., 2019). According to the literature on aversion, people would distrust the recommendation of AI algorithms even in cases when their advice is better than that of humans. However, this is not a simple phenomenon. For example, Agudo and Matute (2021) observed that algorithmic explicit recommendations were able to influence voting preferences, but not dating preferences. Quite possible people may see political decisions as something rational, and therefore susceptible of improvement through algorithm recommendation. Dating preferences, by contrast, may be regarded as something more subjective and free of rationality, which might explain the participants’ resistance to explicit recommendation from machines on this domain. Indeed, the algorithms used by Agudo and Matute were actually able to influence dating preferences, but only when the recommendations were not explicit, but covert (e.g., increasing the number of presentations of certain candidates over the other ones in order to make them look more familiar).

As a counterpoint, the work of other composing AI artists has received better criticisms. This is the case of the death metal Dadabots band, composed by an artificial neural network, that sums a total of 10 albums in the market (Merino, 2019). That is, in some areas, the performance of machines seems to be valued better than that of humans. For example, Liu and Wei (2019) found that news articles written by an algorithm were considered more objective and less emotionally involved than those written by humans. This could be due to the fact that machines are associated to, in addition to negative stereotypes, also to positive traits, such as objectivity, lack of bias, and neutrality (Sundar, 2008). These stereotypes would mean an overvaluation of machine performance over that of humans in some areas, a phenomenon known as algorithmic appreciation (Logg et al., 2018), which would cause the opposite effect to the aforementioned phenomenon of aversion to the algorithm.

Despite these contradictory views on the human response to the performance of AI in art, and despite their wide penetration in art, as Chamberlain et al. (2018) state, there is little understanding of how society reacts to AI in the arts and there is not enough research that addresses in psychological terms the relationship between human–computer interaction (HCI) and aesthetics.

For these reasons, the purpose of our research was to test, first, whether people actually report a different experience of art when they know it was created by AI as compared to when they believe it was created by a human. And second, whether that differential assessment could be attributed to a differential quality of the artwork, or could be due just to prejudices or biases about the authorship. To this end, our Experiment 1 was designed to test whether people exposed to an identical piece of art, composed by AI, attributed a different sensitivity and emotion if they knew the artist was AI, as compared to when they believed the artist was human. The purpose of Experiment 2 was twofold. First, it was designed to replicate Experiment 1. Even though it will reproduce the basic features of Experiment 1, we consider good practice to obtain convergent results in more than one experiment. In addition, Experiment 2 will also extend the results of Experiment 1 by testing a slightly different setting, a different artwork, and by adding some additional measures.

Unlike previous studies in which participants compared human artworks vs. artworks performed by AIs, in both experiments all of our participants were exposed only to artwork created by an AI. The critical manipulation was that some of them were told that the artist was a human while some of them were told that it was an AI. We predicted that people would attribute AI a poorer ability than human artists to perform with sensitivity a piece of artwork and a weaker ability to evoke emotions in the audience. To the best of our knowledge, this hypothesis has not yet been tested.

Ethics Statement

The Ethics Review Board of the University of Deusto approved the procedure for these experiments as part of a larger project on The Influence of Algorithms on Human Decisions and Judgements. Written informed consent was not requested because the research was online and harmless, participation was anonymous, and participants submitted their responses to the questionnaires voluntarily. No personal information was collected.

Experiment 1

Method

Participants and Materials

We recruited a sample of 249 participants (55% women, 8% unknown), through the snowball procedure using a WhatsApp message submitted to several groups in Spain. These groups also contributed to the spread of the message. The WhatsApp message was an invitation to participate in a study on “music and feelings.” It included a link to an online study that was conducted in Spanish and using the Qualtrics platform.

The computer program randomly assigned each participant to one of two groups: AI artist (n = 115), and human artist (n = 134). All participants watched the same video¹ in which an AI improvised piano melodies while painting on canvas following the rhythm of the music, in which the author of the work is not seen (neither AI nor humans).

Design and Procedure

Table 1 shows a summary of the experimental design. After accepting the online informed consent, the participants read different instructions for each group before watching the video. These instructions were our experimental manipulation. We told group AI artist that the artist was an AI (“We introduce you to WCMM, an Artificial Intelligence that improvises at the piano while painting on canvas” /in Spanish:"Te presentamos a WCMM, una Inteligencia Artificial que improvisa al piano mientras pinta sobre un lienzo”), and we told group human artist that the artists were humans.

TABLE 1

Table 1. Design summary of experiment 1.

To create the artwork needed for this research we used a type of recurrent neural network, known as LSTM (Long Short Term Memory), which is a specialist in learning from sequences and allows us to generate a polyphonic music improvisation with expressive tempos and dynamics. We did not specify to the participants the model or the type of AI that we used to create the artwork. We simply referred to it with the term artificial intelligent (AI), rather than other terms, such as neural network or algorithm, because AI is the term most similar to those used in the aforementioned literature in experiments with human participants.

On the other hand, the true authorship of the work was hidden from the other group and attributed to human artists. To control the genre of human artists, half of the participants in the human artist group were told that the composer and the painter were men (“We introduce you to Javier Aldaz and Miguel Beltrán, two artists improvising on the piano while painting on canvas”/“Te presentamos a Javier Aldaz y Miguel Beltrán, dos artistas que improvisan al piano mientras pintan sobre el lienzo”); and the other half was told that the composer and the painter were women (“We introduce you to Ana Aldaz and María Beltrán, two artists improvising on the piano while painting on canvas”/“Te presentamos a Ana Aldaz y María Beltrán, dos artistas que improvisan al piano mientras pintan sobre lienzo”).

We decided to use an audiovisual format instead of just audio format, as the artwork to be evaluated, combining musical and visual composition, to show the overwhelming creative capacity of AI today. Although the use of an audiovisual work implied that participants would evaluate the artwork at a multisensory level and this could have different effects than an exclusively auditory or visual piece, in this experiment we considered that a video where an AI improvises at the piano while painting on canvas would adequately show the current potential of AI and would better respond to the expectations of participants in the AI artist group, as they are likely to associate AI performance in art with multimedia works. In any case, an exclusively auditory piece was then tested in Experiment 2.

After watching the video, all participants were asked about the artists’ performance. Previous experiments have focused on the assessment of the AI-generated artwork (Hong et al., 2020) or on the audience experience (i.e., whether they liked the performance, e.g., Moffat and Kelly, 2006). Therefore, we decided to find out if there were any differences in the emotion evoked during the performance and the artists’ sensitivity when a non-expert audience thought the work had been performed by an AI or by humans. To this end, we used two simple questions. One question asked them to rate the emotion that the play had evoked in them (“Now that you have seen the video of this Artificial Intelligence/ of these artists /of these artists, to what degree would you say that it arose your emotion?”/“Ahora que has visto el vídeo de esta Inteligencia Artificial/ de estas artistas/de estos artistas, ¿hasta qué punto dirías que te ha emocionado?”). The other question asked them to rate the sensitivity that they attributed to the artist (“And how would you rate the artist’s sensitivity?”/“¿Y cómo calificarías su sensibilidad?”). As we were interested in the subjective assessment of a non-expert audience, we did not specify how participants should or should not understand the terms “emotion” and “sensitivity.” Instead, we simply let them provide their default, subjective, answers. The participants provided their answers to each of these two questions using a scale labeled from 0 to 10. The two ratings that they provided were our dependent variables.

We are aware that Internet-based research can in principle raise some concerns regarding important aspects, such as the quality of the audio received by each participant, or even the conditions in which many of them might have performed the experiment (e.g., at home vs. at a cafeteria). It should be noted, however, that our experimental design guarantees that any effect of random or unknown factors should affect both groups of participants equally. That is, because participants were randomly assigned to each group, the only difference between the groups was the independent variable, that is, that they received different instructions. Therefore, any differential results observed between the two groups in the sensitivity ratings or the emotional ratings should allow us to conclude that they are due to the differential instructions. We predicted that the ratings should be lower when participants were instructed that the artist was an AI.

Results and Discussion

First, we made sure that there were no differences between cases in which the artists were women or men, neither with respect to the induced emotion [M_Men = 4.19, SD_Men = 2.66, M_Women = 3.99, SD_Women = 2.81, t(132) = 0.44, p = 0.659, d = 0.08] nor with respect to the attributed sensitivity [M_Men = 5.84, SD_Men = 2.44, M_Women = 5.84, SD_Women = 2.79, t(132) = 0.00, p = 1.00, d = 0.00]. Therefore, in the following analyses we collapsed the data of men and women artists in the human artist group.

Consistent with our hypothesis, and as can be seen in Figure 1, the participants reported stronger emotional arousal when they thought the artist was a human as compared to when they thought it was an AI. This was confirmed by a T-Student² test for independent samples, M_{Human Artists} = 4.09, SD_{Human Artists} = 2.73; M_{AI artist} = 3.17, SD_{AI artist} = 2.45; t(247) = 2.76, p = 0.003, d = 0.35.

FIGURE 1

Figure 1. Judgment of artists’ evoked emotion and sensitivity by group in experiment 1. Error bars represent 95% CI.

Likewise, this figure also shows that participants attributed stronger sensitivity to the artist when they thought it was human as compared to when they thought it was an AI, M_{Human Artists} = 5.84, SD_{Human Artists} = 2.61; M_{AI artist} = 4.03, SD_{AI artist} = 2.66; t(247) = 5.38, p < 0.001, d = 0.68, which is also consistent with our predictions.

The results of this experiment show that knowing that artificial intelligence has been the author of an audiovisual artwork seems to reduce the way in which the artwork is experienced and the artist is valued. These results replicate and extend previous findings on the different appreciation of the art created by humans or by AIs. Importantly, given that the artwork in our experiment was exactly the same in both groups, the results show that the participants’ assessments may not rest on the artwork itself but on their previous prejudices about the artists’ abilities.

Experiment 2

This experiment aims to replicate the results of the Experiment 1, as well as to collect some additional variables that had been included in previous studies, in order to further facilitate the comparison of results and the implications of the present research.

Not many studies have been conducted on how people judge the art generated by artificial intelligence. Most of those studies focus on testing whether the machine can exhibit a behavior as a composer that is indistinguishable from that of a human, with a similar test to the Turing test but in the context of art (Yang et al., 2017). Other studies focus on evaluating whether there are differences in the quality of the musical compositions created by computer models (Pearce and Wiggins, 2007; Chu et al., 2017). The few studies that do focus on evaluating people’s experience with AI-created art, such as our Experiment 1, differ considerably in terms of purposes and methods used.

While some studies reveal the authorship of the AI before the participants judge the artwork (Hong and Curran, 2019; Hong et al., 2020; Ragot et al., 2020), as is the case in our experiments, other studies report it after collecting the participants judgments, therefore using a procedure similar to the aforementioned Turing test (Moffat and Kelly, 2006; Chamberlain et al., 2018).

Moreover, there is no uniformity in the variables collected in these studies. Some studies use judgments on the quality of the artwork as their main dependent variable. They may use measures defined by the authors themselves (e.g., originality or aesthetic value in Hong and Curran, 2019; and perceived beauty and meaning in Ragot et al., 2020). In the case of Hong et al. (2020), they used a validated 9-item scale, based on the original 18-item scale from Hickey (1999). This scale makes it easier for music teachers to evaluate their students’ compositions. Thus, its items require a certain professional knowledge of the subject and are very different from the items used by other researchers for the same variable. Furthermore, other studies focus on evaluating the experience of the participants rather than the quality of the artwork itself, with more subjective measures, such as attractiveness (Chamberlain et al., 2018), liking (Moffat and Kelly, 2006; Ragot et al., 2020), or enjoyment of the artwork (Moffat and Kelly, 2006). This is also the case of our experiments, as we assess the emotion experienced and the sensitivity attributed to the artist.

There are also differences among studies in the type of art evaluated. Hong et al. (2020) and Moffat and Kelly (2006) collected the participants’ judgments on AI music, while Chamberlain et al. (2018), Hong and Curran (2019), and Ragot et al. (2020) on AI painting. These investigations share some aspects of their procedure as well. They all show artworks created by AIs to one group of participants while the other group evaluates artworks created by human artists. We believe that this design does not allow researchers to know whether it is the prejudices about the authorship of the artwork that cause the differences in judgments or whether it is the artwork itself what is qualitatively different. It is for this reason that in our Experiment 1 both groups evaluated the artwork of an AI and we will use this procedure again in the present experiment. In addition, we will incorporate in this experiment some of the aforementioned measures, in order to facilitate comparisons between studies and to better consolidate the results obtained.

In this new experiment, we simplified the artistic stimuli. Instead of using the video artwork, which combined music and painting, we now used a purely musical artwork. In addition, we added a new, final, phase at the end of the experiment. Its purpose was to evaluate whether the participants would hold on to their judgments when we told them that, unlike what they were initially told, the author was actually human (or an AI, depending on the group; see details in the procedure section).

As in Experiment 1, we expected that emotionality and sensitivity would receive lower ratings when participants know that the artist was an AI than when they were told that the artist was a human.