Dynamics of automatized measures of creativity: mapping the landscape to quantify creative ideation

Ul Haq, Ijaz; Pifarré, Manoli

doi:10.3389/feduc.2023.1240962

SYSTEMATIC REVIEW article

Front. Educ., 12 October 2023

Sec. Educational Psychology

Volume 8 - 2023 | https://doi.org/10.3389/feduc.2023.1240962

This article is part of the Research TopicEducation Reimagined: The Impact of Advanced Technologies on LearningView all 13 articles

Dynamics of automatized measures of creativity: mapping the landscape to quantify creative ideation

Ijaz Ul Haq

Manoli Pifarré^*

Faculty of Education, Psychology and Social Work, University of Lleida, Lleida, Spain

The growing body of creativity research involves Artificial Intelligence (AI) and Machine learning (ML) approaches to automatically evaluating creative solutions. However, numerous challenges persist in evaluating the creativity dimensions and the methodologies employed for automatic evaluation. This paper contributes to this research gap with a scoping review that maps the Natural Language Processing (NLP) approaches to computations of different creativity dimensions. The review has two research objectives to cover the scope of automatic creativity evaluation: to identify different computational approaches and techniques in creativity evaluation and, to analyze the automatic evaluation of different creativity dimensions. As a first result, the scoping review provides a categorization of the automatic creativity research in the reviewed papers into three NLP approaches, namely: text similarity, text classification, and text mining. This categorization and further compilation of computational techniques used in these NLP approaches help ameliorate their application scenarios, research gaps, research limitations, and alternative solutions. As a second result, the thorough analysis of the automatic evaluation of different creativity dimensions differentiated the evaluation of 25 different creativity dimensions. Attending similarities in definitions and computations, we characterized seven core creativity dimensions, namely: novelty, value, flexibility, elaboration, fluency, feasibility, and others related to playful aspects of creativity. We hope this scoping review could provide valuable insights for researchers from psychology, education, AI, and others to make evidence-based decisions when developing automated creativity evaluation.

1. Introduction

Creativity as a 21^st century skill is increasingly becoming an explicit part of educational policy initiatives and curricula (Plucker et al., 2023). Creativity is a multifaceted concept, and research in this area has made remarkable progress in understanding the different components embedded in creativity phenomena, such as idea generation through collaborative creative (co-creative) processes (Sawyer, 2011, 2022). Furthermore, research also revealed the significance of another important component of creativity: creativity evaluation (Guo et al., 2023), which is the ability to accurately identify creative ideas, solutions, or characteristics among individuals to understand their creative strengths and potential (Kim et al., 2019). In the educational context, creativity evaluation is an essential step for teachers and students because it is helpful to monitor, refine, and implement creative ideas, which could improve students' creative performance in the creative process (Rominger et al., 2022).

Creativity evaluation poses a challenging problem in creativity research. Creativity evaluation mainly involves four dimensions: fluency (number of meaningful ideas), flexibility (number of different categories), elaboration (detailed ideas), and novelty (uniqueness of ideas) (Bozkurt Altan and Tan, 2021). To evaluate these creativity dimensions, various manual creativity evaluations (paper-based) and psychological tests have been commonly used (Rafner et al., 2022). Examples are the Torrance Tests of Creative Thinking (Torrance, 2008), Creativity Assessment Packet (CAP) (Williams, 1980), and Divergent Production abilities (DP), (Guilford, 1967). Other ways to evaluate creativity include a rating scale (Gong and Zhang, 2017; Birkey and Hausserman, 2019), a survey and questionnaire (De Stobbeleir et al., 2011; Gong et al., 2019), using a grading rubric (Vo and Asojo, 2018), and subjective scoring of creativity dimensions (George and Wiley, 2020). However, these manual creativity evaluations face some challenges, e.g., being error-prone (experts' ratings do not always agree on what is creative) and time-consuming (Said-Metwaly et al., 2017; Doboli et al., 2020). These challenges can be tackled using automated creativity evaluation supported by AI techniques which can also enrich co-creation by providing real-time feedback to guide students to develop novel solutions (George and Wiley, 2020; Kenworthy et al., 2023).

Artificial intelligence (AI) focuses on enabling machines to perform tasks that typically demand human intelligence. Within AI, machine learning (ML) algorithms learn from data to make predictions. Notably, computer vision is used for analyzing figural data, and NLP is used for analyzing textual data. Given our focus on textual ideas, NLP enables machines to comprehend, interpret, analyze, and generate human language (Braun et al., 2017). NLP contains a variety of approaches and techniques such as text similarity, text classification, topic modeling, information extraction, and text generation, each with its computational techniques spanning from statistical methods to predictive and deep learning models. NLP provides different opportunities to compute variables related to creativity dimensions. Among these, the following five variables could be computed in the vector space provided by NLP: (1) Contextual and semantic similarity are applied to measure the uniqueness of ideas and originality (Hass, 2017; Doboli et al., 2020); (2) text clustering could identify different categories in the text; (3) text classification is used to compute novelty (Simpson et al., 2019); (4) keyword searching is mainly used to compute elaboration (Dumas et al., 2021); and (5) information retrieval could be applied to score the level of idea elaboration (Vartanian et al., 2020). These implications of NLP in co-creative processes can be used to automatically evaluate creativity and support co-creation by providing feedback (Bae et al., 2020; Kang et al., 2021; Kovalkov et al., 2021).

Considering the above implications of NLP, current research focuses on studying how different computational techniques can measure creativity dimensions (Doboli et al., 2020). Research on this topic has been very productive and has designed other computational techniques to measure creativity dimensions, e.g., (1) novelty is measured by keyword similarity (Prasch et al., 2020), part of speech tagging (Karampiperis et al., 2014; Camburn et al., 2019), and different ML classifiers, such as Bayesian classifiers, random tree, and Support Vector Machine (SVM) (Manske and Hoppe, 2014; Simpson et al., 2019; Doboli et al., 2020); (2) originality dimension is measured by Latent Semantic Analysis (LSA) (Dunbar and Forster, 2009), Global Vectors for word representation (GloVe) (Dumas et al., 2021), and part of speech tagging (Georgiev and Casakin, 2019); (3) fluency dimension is measured by LSA (Dumas and Dunbar, 2014; LaVoie et al., 2020); (4) elaboration dimension is measured by parts of speech tagging (Dumas et al., 2021); and (5) level of details dimension is measured by text-mining methods (Camburn et al., 2019).

This study aims to tackle the following four main challenges that current research faces when designing computational techniques to measure creativity: (1) a range of computational techniques evaluating various creativity dimensions; (2) there is no consensus about the use of a specific technique for computing a specific creativity dimension; (3) some of the studies do not expose and argue the rationale that supports the use of a specific technique to compute a specific creativity dimension, e.g., evaluation of the category switch dimension of creativity using LSA (Dunbar and Forster, 2009); and (4) the need to consider the limitations of computational techniques that could affect the evaluation of creativity dimensions (Olivares-Rodríguez et al., 2017; Doboli et al., 2020). Considering these challenges, as per our knowledge, no existing literature review addresses the above four challenges. Therefore, this exploration led us to two research questions: (1) What NLP approaches and techniques are used to automatically measure creativity? and (2) What creativity dimensions are computed automatically, and how? These research questions enable us to address the previous four challenges in automatic creativity evaluation. Furthermore, these research questions help to understand the concept of NLP approaches and creativity dimensions, their applications in evaluating creativity dimensions, identify research gaps and limitations, and propose alternative solutions for advancing the evaluation and promotion of creativity. Therefore, we chose a scoping review because it helps to understand key concepts and identify knowledge gaps (Munn et al., 2018) to inspire innovation and improve the education of future generations through advanced technologies.

2. Research objectives

This scoping review aims to meet the following two objectives.

1. To identify and categorize different ML approaches used in automatic creativity evaluation, highlighting their application scenarios and limitations of computational approaches and techniques. This categorization could contribute to a deeper understanding of the contribution that different ML approaches can make to automatic creativity.

2. To analyze the definition and computation of different creativity dimensions used in automatic creativity evaluation research. This analysis can help establish a joint agreement on creativity dimensions and their computation, which will pave the way for advancements in automatic creativity evaluation.

3. Method

This section describes the sampling method we used to collect and compile the state-of-the-art approaches to automatic creativity evaluation. Our methodological framework follows the PRISMA technique (Dickson and Yeung, 2022) by conducting a scoping review to find relevant and significant research papers by identifying the following four core concepts.

1. Creativity: The articles must be related to creativity, especially the creative process (Sawyer, 2011).

2. Measurement/evaluation/assessment of creativity dimensions.

3. Technology: We selected those studies that are assisted or evaluated with technology support. This core concept aims to review the technological support for creativity evaluation and explore future research in the creative process.

4. Domain: We focused on the creativity process applicable in the educational sector that helps to enhance students' creativity. Other fields such as medicine, finance, and business were excluded from the search query.

Exploring the current literature considering the above four core concepts, peer-reviewed journals and conference papers are included in this mapping study. Regarding the time span, we searched from 2005 to 2021, although interestingly, according to our inclusion–exclusion criteria, the oldest study included is from 2009, and most are from recent past years. It indicates that automatic creativity evaluation has recently grabbed researchers' attention and is still an open and active research problem.

We excluded articles focused on the person's or organization's creativity evaluation. We excluded domains other than education, e.g., medicine and finance. Articles in other languages apart from English published before 2005 and articles with no technological role and creativity were also excluded.

For this mapping study, we extracted articles published in Scopus with the search query: [(creativ^* OR “Creative Process” OR “Novelty” OR “Flexibility” OR “Fluency” OR “Elaboration” OR “Originality”) AND (Measur^*OR Evaluat^* OR Asses^* OR Calcul^* OR Analys^* OR Scor^* OR Qunat^*) AND (Automat^* OR Comput^* OR Machin^* OR Natural^* OR Artificial^* OR Deep learning OR Mathemat^* OR Mining) AND (E-learning OR educa^* OR Learn^* OR School OR students^*)].

The search query resulted in 364 research articles. By applying the inclusion and exclusion criteria while reading the title, abstract, keywords, and conclusion, the search is filtered to 65 articles. Furthermore, the authors read, checked, and discussed the selected articles and conducted all the screening stages to answer the two research questions. The consensus among the authors developed by solving discrepancies since member checking is a well-established procedure to build up “trustworthiness” in qualitative research (Toma, 2011). After this process, a total of 26 articles were finally included in this scoping review. The overall article selection procedure through the PRISMA technique is depicted in Figure 1.

FIGURE 1

Figure 1. Screening procedure of the articles using the PRISMA technique.

4. Results

4.1. Approaches and techniques used in automatic creativity evaluation (RQ1)

The compilation of computational approaches and techniques in automatic creativity evaluation research to answer the first research question gives the following three results;

The first result reveals that creativity evaluation research spreads over three different NLP approaches, namely, (1) text similarity, which measures the relatedness and closeness among words, sentences, or paragraphs presented in a numerical space; (2) text classification, which is a supervised learning approach (needs data training) that requires ML algorithms [such as the K-nearest neighbor (KNN) algorithm and random forest] to analyze text automatically and then to assign a set of predefined tags or categories; and (3) text mining that uses NLP to examine and transform extensive unstructured text data to discover new information and patterns. These three NLP approaches and their computational techniques identified in the studies included in this review are displayed in Figure 2.

FIGURE 2

Figure 2. Different NLP approaches in creativity evaluation.

As a second result, the scoping review shows that text similarity is the most common approach (69% of the reviewed studies), followed by text classification (27%), and text mining is less commonly used (only 4% of the studies), as shown in Figure 2.

As a third result, our scoping review has identified and categorized the computation techniques used in the three NLP approaches (text similarity, text classification, and text mining) and the creativity dimensions that were evaluated automatically. In the following sections, we present the mapping that we have built after a thorough analysis of all the studies included in the scoping review.

Regarding the text similarity approach, NLP converts textual ideas into a numerical vector space. To do this conversion, the studies revised the use of a wide range of techniques that could be classified into the next three categories: string-based similarity, corpus-based similarity, and knowledge-based similarity. These three categories and their computational techniques identified in the reviewed studies are shown in Figure 3, and Table 1 maps automatic creativity evaluation studies into the three categories and techniques used.

FIGURE 3

Figure 3. Text similarity approaches, categories, sub-categories, and their computational techniques.

TABLE 1

Table 1. Categorizing of review studies in text similarity approaches and percentages of studies included in the review that use each approach.

In the first category, string-based similarity (6% of the text similarity approach of reviewed studies) matches exact keywords or alphabet strings, e.g., Longest Common Substring (LCS) or N-gram (a subsequence of n items from a given sequence of text). The string similarity of ideas with the existing ideas in the database is computed by using keyword matching (Prasch et al., 2020).

In the second category, corpus-based similarity is mostly used (72% of textual similarity), and the results are presented in Table 1. The corpus-based similarity is classified into two sub-categories: On the one hand, the statistical-based models, e.g., LSA, present corpus in the word-document matrix as words in row vectors and each document as a column vector, and weighting schemes and dimension reduction schemes are applied before calculating the cosine similarity among word vectors (Martin and Berry, 2007; Wagire et al., 2020). On the other hand, the deep learning-based models (both word and sentence embeddings) use supervised (which need to be trained on data), semi-supervised, or unsupervised methods (no prior training) that are trained on a large corpus, e.g., Wikipedia and common crawl dataset. Deep learning models such as Word2Vec (Mikolov et al., 2013) or GloVe (Pennington et al., 2014) use knowledge from large datasets, encode the data, and find similarities in words or sentences. The GloVe model showed reliable results as compared with the experts' scores, especially for single-word creativity tasks (Beaty and Johnson, 2021; Johnson and Hass, 2022).

In the third category, knowledge-based similarity (used in 22% of text similarity approaches in reviewed studies, as presented in Table 1) using the knowledge of ontologies represents the textual data on a semantic network graph consisting of nodes representing semantic memory and lines. Ontologies are the dictionaries of millions of words and are lexically associated, e.g., WordNet, Wikipedia, and DBpedia.

Text classification is the second NLP approach used by 27% of the reviewed studies in automatic creativity evaluation depicted in Figure 1. Classification is an ML technique that categorizes text into predefined categories. The classification consists of four main steps: (1) data collection, pre-processing (data acquisition, cleaning, and labeling), and data presentation (feature selection, dividing into training and testing datasets); (2) applying classifier models; (3) evaluation of classifiers; and (4) prediction (output of the testing data). These four steps are influential factors when applying text classification in automatic creativity evaluation. Table 2 gives an overview of the classification approach, the datasets, classifiers, evaluations, and creativity dimensions in creativity evaluation research.

TABLE 2

Table 2. Text classification-based creativity evaluation studies.

Text mining is the third approach in automatic creativity evaluation, which is the practice of analyzing a vast collection of textual data to capture key concepts, trends, patterns, and hidden relationships. In the scoping review, text mining is used (Dumas et al., 2021). The studies used four mining techniques, e.g., all words count, stop list inclusion (defined terms that are not meaningful), counting part of speech, and applying inverse document frequency (a technique to extract rare and important documents).

4.2. Creativity dimensions are computed automatically (RQ2)

In the studies included in this scoping review of automatic creativity evaluation, we differentiated 25 different creativity dimensions. These 25 dimensions of creativity are displayed in the second column (Manifestation) of Table 3. We analyze the similarities in the conceptual definition and computational approach employed in various studies that consider different dimensions for assessing creativity. This analysis allows us to categorize these 25 manifestations of creativity into seven core creativity dimensions, namely, novelty, value, flexibility, elaboration, fluency, feasibility, and others related to playful aspects of creativity such as humor or recreational efforts, which are displayed in the first column of Table 3 (Core Dimension).

TABLE 3

Table 3. Characterization of 25 creativity dimensions into seven core creativity dimensions (first column) and creativity dimensions manifested (second column) based on similarities in definitions (third column) and computation (fourth column).

Furthermore, the results obtained to answer research question two are illustrated in Figure 4, which displays the percentage of the seven core creativity dimensions identified in this review. These results show that novelty is the most evaluated dimension in the studies compiled in this scoping review.

FIGURE 4

Figure 4. Percentage distribution of each core creativity dimension in the reviewed studies.

5. Discussion

5.1. Approaches and techniques used in automatic creativity evaluation

The scoping review identified three main NLP approaches used in automatic creativity evaluation, namely, (1) text similarity, (2) text classification, and (3) text mining. In the next sections, we discuss the contribution of each computational approach to automatic creativity evaluation, argue their applications, discuss their limitations, identify research gaps, and make further recommendations for automatic creativity evaluations.

Regarding the text similarity approach, the scoping review revealed that it is used in 69% of the studies, which helps understand creative thinking (Li et al., 2023). Our analysis concluded that the widespread use of textual similarity in automatic creativity evaluation is because automatic creativity evaluation is more focused on evaluating originality, novelty, similarity, or diversity dimensions of creativity. The computations of these dimensions involve assessing the similarity of an idea with the existing ideas. The text similarity approach provides a variety of computational techniques to measure the similarity of ideas, as shown in Figure 3.

Concerning the three categories of text similarity, namely, string similarity, corpus-based similarity, and knowledge-based similarity as set out in Table 3, the scoping review shows differences in the process of similarity computation that have an impact on how they are applied. On the one hand, string-based and knowledge-based similarities have limited application in automatic creativity evaluation because string-based only considers syntactic similarity (not semantic) and knowledge-based only extracts from text-specific entities, such as a person's name, place, and money (Camburn et al., 2019). During ideation, the knowledge-based approach might focus on entities rather than technical terms or scientific jargon within the sentence used by sentences solving a scientific challenge. For example, when brainstorming about renewable energy solutions, the knowledge-based approach might not capture specific terms such as “photovoltaics” or “wind turbines.” On the other hand, corpus-based techniques are widely used, so in the following, we elaborate on corpus-based techniques.

Regarding corpus-based similarity, it has been commonly used in automatic evaluation because it provides a wide range of computational techniques, from simple statistical to deep learning models, as shown in Figure 2. Considering that a statistical model such as LSA is applied to examine semantic similarity, memory, and creativity (Beaty and Johnson, 2021), it has shown a more reliable scoring technique of originality on divergent thinking tasks than human ratters (Dunbar and Forster, 2009; Dumas and Dunbar, 2014; LaVoie et al., 2020), as shown in Table 1. We argue that LSA uses statistical techniques, including Probabilistic Latent Semantic Analysis (Hofmann, 1999), Latent Dirichlet Allocation (Blei et al., 2003), and Non-Negative Matrix Factorization (Lee and Seung, 1999), which limit its implication because these consider words statistics (e.g., co-occurrence of words) instead of word contextual and semantic meaning. These limitations are addressed by deep learning models, which we discuss below.

Recently, drastic changes in NLP research with the development of deep learning models based on deep neural architectures have unlocked ways to model text with more nuance and complexity. This advancement started with the development of word embedding models such as GloVe or Word2Vec pre-trained, including Wikipedia, news articles, and web pages. These predictive models use a neural network with one or more hidden layers to learn the vector representations of words. The GloVe showed comparable results to human experts' scores in single-word creativity tasks (Beaty and Johnson, 2021; Olson et al., 2021). However, word embedding models do not differentiate between a list of keywords and a meaningful sentence; hence, they cannot capture the semantic and contextual meaning of the whole sentence (idea) in the vector space. The vectorization of the whole sentence is one major innovation in text modeling: The transformer architecture generally outperforms word embedding models on standard tasks, and often by large margins (Wang et al., 2018, 2019), which utilizes a concept called attention (Vaswani et al., 2017). Attention makes it computationally tractable for a transformer model to consider a long sequence of text by selecting the most important parts of the sequence. Attention allows the training of large models on words and the complex contexts in which those words occur. This development resulted mainly in two kinds of categories, pre-trained sentence embedding models and text generation models which are discussed below.

Sentence embedding models vectorize the whole sentence into a vector space that keeps the semantic and contextual meaning of the entire sentence. The sentence embedding models are unsupervised techniques that do not require external data, e.g., Unsupervised Smooth Inverse Frequency (uSIF) (Ethayarajh, 2018) and Geometric Sentence embedding (GEM) (Yang et al., 2018). Some transformers allowed the tuning of parameters or training on their datasets to improve performance (if a large dataset is available), e.g., Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018), Sentence Transformer (Reimers and Gurevych, 2019), MPNet (Song et al., 2020), Skip-Thought (ST) (Kiros et al., 2015), InferSent (Conneau et al., 2017), and Universal Sentence Encoder (USE) (Cer et al., 2018). In creativity research, the USE model is used to evaluate the novelty of ideas (Kenworthy et al., 2023). We argue that more exploration is needed to apply different, or combinations of sentence embedding models to evaluate creative ideas in an open-ended co-creation.

Text generation models generate new text that is similar to a given text prompt, such as Generative Pre-trained Transformer (GPT-3) (Brown et al., 2020), Text-to-Text Transfer Transformer (T5) (Raffel et al., 2020), and Long Short-Term Memory (LSTM) (Huang et al., 2022). In creativity research, one of the text-generated models, the Generative Adversarial Network (GAN) (Aggarwal et al., 2021), is used by Franceschelli and Musolesi (2022) to evaluate novelty, surprise, and relevance. We present two criticisms regarding using text generation models for evaluating open-ended ideas. First, text generation is specialized to generate text from a given text that could be useful for dialog generation, machine translation, chatbots, and prompt-based learning (Liu et al., 2023). Second, as the model becomes better at generating text with an improved understanding of language, it is more likely to generate text that closely resembles the input data rather than producing more novel or creative outputs. However, we argue that text generation models are not tested on a larger scale in creativity research, so future investigations could help understand these limits.

Finally, two conclusions are drawn from the above discussion. First, for single-word tasks in creativity research, word embedding models can be used, especially the GloVe embedding model, which is widely used. Word embedding models represent words in a high-dimensional vector space, enabling the computation of their contextual and semantic similarity with other words. Second, for open-ended co-creation resulting in ideas of sentence structure, sentence embedding models can be useful in three ways: (a) In open-ended ideation, mostly the ideas are in sentence structure, so these sentence models present the whole sentence in a vector space, capturing the semantic and contextual meaning of the whole sentence; (b) sentence embedding models outperform the word embedding models for textual similarity tasks; and (c) sentence embedding models can also be applied to small datasets and open-ended problems because these models are pre-trained over large corpora. Finally, we recommend not only validating sentence embedding models but also applying text generation models within a broader context of co-creation.

We concluded that sentence embedding models offer a powerful measure that can be used alongside statistical (Acar et al., 2021), word embedding models (Organisciak et al., 2023), and standard subjective scoring methods of the creative process and its output (Kenett, 2019).

Text classification approach refers to the automated categorization or labeling of textual data into predetermined classes or categories using machine learning classifiers. A large dataset is used for text classification, which is divided into training and testing (the usual ratio is 70% training and 30% testing datasets). An ML classifier learns from the training dataset and then uses the knowledge learned during training to categorize the testing dataset. Therefore, integrating text classification into automatic creativity evaluation depends on four key factors: the dataset, the selection of appropriate machine learning classifiers, the accuracy of the ML classifier, and the creativity dimensions being evaluated. These factors in the reviewed studies using the text classification approach are highlighted in Table 2.

Using text classification, it is essential to consider the dataset factor for three reasons: First, the datasets used for classification need pre-processing and labeling. Pre-processing includes removing noisy or irrelevant information, and labeling includes giving a class label to each idea. Second, a large dataset is required to train the ML classifiers. The prediction capability of ML classifiers increases with an increase in the amount of data used for training. All studies reviewed in Table 2 except Stella and Kenett (2019) use more than a thousand ideas or solutions for the classification problem. A smaller dataset may need better or more balanced results. Third, ML classifiers trained on one type of data cannot be applied to another kind of data. For example, classifiers trained on datasets from the linguistic domain cannot be used to test data from the scientific domain.

Furthermore, classifier selection and accuracy are also critical. Regarding classifier selection, the working methods of ML classifiers are different and dependent on the nature of the dataset, e.g., SVM works well for multiclass classification, and random forest excels in scenarios involving numerical and categorical features. Similarly, logistic regression works on linear problems; the K-neighbor classifier is best for text, and SVM can also work for multiclass dataset classification. The Bayesian approach is a simple and fast algorithm. The reviewed studies lack arguments for using a specific classifier in their studies. Regarding the accuracy of ML, there is a risk of not getting high accuracy. Different automatic evaluators are used to evaluate model accuracy, such as confusion matrix, entropy, and sensitivity, as shown in Table 2. It is suggested to apply several classifiers, and one with high accuracy can be used for prediction in a similar domain.

Finally, the text classification approach can be applied to evaluate different dimensions of creativity; however, it requires a large, labeled dataset, which limits its application in creativity research. We also argue that the dataset's preparation and labeling might be expensive, which mitigates the advantages of automatic evaluation over manual creativity evaluation, e.g., accuracy, cost, and time. Furthermore, the text classification problems are domain-dependent. So, for creativity tasks, such as object use tasks and alternate use tasks, some public datasets are available that could apply to similar tasks. However, it is not useful for small and open-ended creative tasks because it is not enough to train an ML classifier and is domain-independent. In short, large dataset preparation, labeling, and domain dependence make the text classification approach less reliable and expensive than manual creativity evaluation.

Text mining employs NLP statistical computation to discover new information and patterns. It uses statistical indicators such as the frequency of words, word patterns, and correlation within words. Dumas et al. (2021) implemented four text-mining techniques and measured the elaboration score in Alternate Use Tasks (AUT). Elaboration was computed in four different ways: (1) unweighted word count method: count the number of words; (2) stop listed inclusion: a preliminary agreed list of stop words; (3) parts of speech include verbs, nouns, adjectives, and adverbs; and (4) inverse frequency weighting: commonness of a word in an initial corpus of text.

The above text-mining techniques are the basic statistical operations in NLP. Text mining holds the potential to handle a massive amount of data to discover new information, patterns, trends, relationships, etc., that could be useful in creativity research. Text-mining applications include search engines, product suggestion analysis, social media analytics, and trend analysis.

5.2. Automatically computed creativity dimensions

The scoping review noted 25 creativity dimensions computed automatically. However, our analysis reveals that these creativity dimensions are not sufficiently based on previous creativity research and theory. Therefore, we have found some theoretical and methodological inconsistencies that should be tackled in future research. In this line of argument, first, we highlight that some of the creativity dimensions studied in the scoping review are defined and computed, building links with the challenges or the creativity tasks designed for the experiment but not with a strong theoretical framework. For example, a category switch is defined as the similarity difference between two successive responses in object use tasks (Dunbar and Forster, 2009). Another example is the creativity dimensions of quality (reusability) and usefulness (Degree of completion) that are defined and computed in the context of programming problems (Manske and Hoppe, 2014). Second, another reason for the inconsistency among the dimensions of creativity is the variation in manifestations employed across the reviewed articles. Specifically, it has been observed that dimensions such as novelty (Prasch et al., 2020), similarity (LaVoie et al., 2020), and originality (Beaty and Johnson, 2021) are defined in a similar manner, a strong focus on the similarity between ideas or solutions. Moreover, these dimensions are often measured using semantic textual similarity, although different computational techniques are performed.

To mitigate these shortcomings, this scoping review has thoroughly analyzed the conceptual and computational framework used in each study and contributed to the emergence of seven core creativity dimensions that could be automatically evaluated and bring more consistency to this research area. These seven core creativity dimensions are novelty, elaboration, flexibility, value, feasibility, fluency, and others related to playful aspects of creativity, such as humor and recreational efforts. Following, we discuss each core creativity dimension identified and highlight the key aspects of its conceptual definition and computational approach.

Novelty is the first core dimension in automatic creativity research that is most evaluated in 59% of the reviewed studies. Despite this high interest, our revision indicates multifariousness in defining and measuring novelty. As a consequence of that, the reviewed studies refer to novelty using the following different words or manifestations, namely, (1) uniqueness: the uniqueness of a concept related to the other concepts (Camburn et al., 2019); (2) originality: how different the outcome is from standard/other solutions (Georgiev and Casakin, 2019) or semantic distance among ideas (Beaty and Johnson, 2021); (3) similarity: the similarity of meaning between multiple texts (LaVoie et al., 2020) or similarity distance between the texts (Olson et al., 2021); (4) diversity: the diversity of users' entered queries; (5) rarity: the rare combination or rare ideas (Karampiperis et al., 2014) or unique solution (Doboli et al., 2020); (6) common use: the difference between common and uncommon solutions; (7) surprise: that how much an artifact is different from existing attributes (Shrivastava et al., 2017); and (8) influence or the comparison of an artifact with other artifacts (Shrivastava et al., 2017).

Nonetheless, the diversity in labeling and defining the novelty dimension, our analysis identified the next six characteristics that could be included in defining novelty and assisting its automatic evaluation: (1) deviation from the standard, routine way of solving a given problem (Manske and Hoppe, 2014); (2) semantic distance between ideas (Beaty and Johnson, 2021); (3) similarity of meaning between multiple texts (LaVoie et al., 2020); (4) Semantic similarity of the user query to the concepts in the challenge; (5) combination of properties (Karampiperis et al., 2014); and (6) surprise and unexpected ideas (Shrivastava et al., 2017). These six characteristics involved in the definition of novelty in the studies reviewed give an account of the complexity of defining the novelty dimension and acknowledge the challenges in developing automatic measures for novelty.

Despite these challenges, the scoping review has highlighted some common computing approaches and techniques to measure novelty as a core dimension and they can be synthesized in the next five characteristics: (1) distance of the new solution to the existing solution (Manske and Hoppe, 2014); (2) semantic distance among ideas (Beaty and Johnson, 2021; Olson et al., 2021); (3) semantic similarity of user queries and relevant concepts in Wikipedia; (4) semantic distance between the clusters in a story; and (5) semantic distance between the consecutive fragments of the story (Karampiperis et al., 2014). It concludes that when developing an automatic evaluation of novelty, the semantic distance of a solution to existing solutions should be considered.

Value is the second core dimension identified in automatic creativity evaluation. The scoping review identified the next four concepts related to value (Shrivastava et al., 2017; Franceschelli and Musolesi, 2022): (1) overall value, which relates how an artifact is perceived by society (Georgiev and Casakin, 2019); (2) quality, this concept is mainly used for programming solutions when they embody specific attributes such as reliability, characterized by error-free operation; maintainability, denoting ease of maintenance; extensibility, encompassing scalability and simplified modification; and, adaptability, reflecting the flexibility to integrate new technologies seamlessly (Manske and Hoppe, 2014); (3) the concept of usefulness which is linked to the notion of correctness; and (4) the concept of adaptiveness, it pertains to useful solutions that effectively address specific problems (Jimenez-Mavillard and Suarez, 2022). In sum, these four concepts share a common meaning of usefulness and quality that could be considered the value dimension of creativity. Furthermore, from a computer science perspective, value, quality, usefulness, adaptiveness, and style are the non-functional characteristics related to quality attributes. These quality attributes have different computations depending on the nature of the task, e.g., quality and useful programming solutions are the reusability and scalability of computer programs (Manske and Hoppe, 2014), and usefulness is the degree of completing the task (Prasch et al., 2020). Therefore, the value dimension needs clear definitions and computation metrics like other dimensions.

The third core dimension used in automatic creativity evaluation is flexibility. Flexibility refers to one of the key executive functions of creative thinking (Boot et al., 2017), which drives individuals to follow diverse directions, dimensions, and pathways (Acar et al., 2021), more likely to produce highly creative ideas (Zhang et al., 2020). Creativity research defines flexibility in two distinct ways. First, it involves category switching (Dunbar and Forster, 2009; Acar et al., 2019; Mastria et al., 2021), which refers to the ability to transition from one semantic concept to another. Second, flexibility is also measured by the number of semantic categories, varieties (Dunbar and Forster, 2009), or topics generated during the creative process. Owing to variations in the definition of flexibility across creativity research, different computational approaches are employed to compute this dimension. On one side, flexibility as a category switch is a measure of the similarity of one idea to all existing ideas. Therefore, semantic similarity approaches are used to evaluate flexibility, such as LSA (Dunbar and Forster, 2009), network graphs (Cosgrove et al., 2021), and sentence embedding models. On the other side, flexibility identifies semantic categories, varieties, or topics that can be evaluated using text clustering (Sung et al., 2022) or topic modeling techniques [e.g., Latent Dirichlet Allocation (LDA); Chauhan and Shah, 2021] to categorize or extract different topics from the textual ideas. We argue that flexibility as a category switch could be the easiest way to compute because it acquires simple text similarities rather than identifying categories in the text, which involve more variables and algorithms.

Regarding elaboration as a core creativity dimension in automatic creativity evaluation, it is defined as the degree of elaboration to which the participants embellish their responses (Camburn et al., 2019; Dumas et al., 2021) or which gives further details on adding reasoning or cause to an idea. Automatic creativity evaluation captures the level of detail of an idea by counting the number of words used in the idea (Camburn et al., 2019). The scoping review has identified four different methods for evaluating the level of idea elaboration: (1) counting all words in an idea (Counting unweighted measures); (2) counting stop words (words that do not have semantic meanings); (3) counting nouns, verbs, and adverbs; and (4) specifying and counting adjectives (parts of speech inclusion) and uncommon words with high weight (inverse frequency weighting). An idea with more words is considered an elaborated idea. We argue that the above-adopted computation of elaboration may not capture conjunctions (Tuzcu, 2021) or reasoning words (Sedova et al., 2019; Hennessy et al., 2020), adding more explanation to the ideas. Therefore, we suggest the semantic search to specify the words that cause reasoning or words that give reason to the idea, such as because, therefore, and since.

Fluency is defined as the number of ideas generated during an ideation process. This scoping review showed that fluency is one of the core dimensions that finds consensus on its conceptual definition (number of ideas) and computational approach (counting ideas) (Dumas and Dunbar, 2014; Stella and Kenett, 2019). Creativity research claims that when there are more ideas, there is a greater chance of producing original ideas or products (Dumas and Dunbar, 2014). Fluency measurement is easy to implement and is independent of other ideas such as elaboration. Compared to novelty and flexibility, which require comparison with different ideas, fluency can be easily computed for each idea.

Feasibility is defined as the solution that is achievable in real practice (Georgiev and Casakin, 2019). The scoping review found transcendence and realization have been used as manifestations of feasibility as they refer to the achievement in real practice or transforming into reality (Jimenez-Mavillard and Suarez, 2022). These dimensions share the same characteristic of transforming an idea or solution into real practice, which is significant in creativity research. The creativity research highlights the significance of putting ideas into practice; however, the automatic computation of feasibility (Georgiev and Casakin, 2019), transcendence, and realization (Jimenez-Mavillard and Suarez, 2022) does not provide any rationale from the creativity research. Feasibility is mostly a product-oriented dimension and is mostly used in the ideation process, but finding transformable ideas into real practice is still a challenge to address. Therefore, it is a dimension that needs further research to automatically measure feasible, transcendent, and realistic ideas.

Finally, other dimensions associated with the playful aspects of creativity, such as humor (Simpson et al., 2019) and recreational effort (Karampiperis et al., 2014), were identified in the reviewed articles. Humor, representing the funniness of ideas, is typically measured through pairwise text comparison techniques. At the same time, recreational effort is defined as a solution that is difficult to achieve and is measured using clustering methods. These dimensions contribute to the playful nature of creativity, so it is essential to establish clear definitions and develop suitable computational approaches from both psychological and computer science perspectives.

6. Conclusion

This article has the objective of conducting a scoping review of automatic creativity evaluation from creativity and computer science perspectives. To meet this objective, we defined two research questions: The first identifies the NLP approaches and techniques used in automatic creativity, and the second analyzes which and how different creativity dimensions are computed.

The first research question's contributions are multi-fold: (1) identifying the existing ML approaches and techniques in automatic creativity evaluation; (2) categorizing the approaches into different groups for deep compilation, e.g., text similarity, text classification, and text mining. Among these, text similarity is commonly used; (3) classifying creativity evaluation studies into different techniques accordingly, e.g., classifying studies in text similarity approaches using various techniques such as string similarity, corpus-based similarity, and knowledge-based similarity. Our results showed that corpus-based methods are widely used for automatic creativity evaluation. Corpus-based techniques, LSA (Dunbar and Forster, 2009; Dumas and Dunbar, 2014; LaVoie et al., 2020) and GloVe algorithm (Beaty and Johnson, 2021; Olson et al., 2021), have shown a positive correlation with human experts' similarity scores; (4) identifying the limitations of the critical challenge and identifying alternative techniques, for example, statistical and word embedding techniques are generally used, but they cannot capture the semantic and contextual meaning of a whole sentence; and (5) providing a broad overview of all existing automatic creativity to give a deeper understanding of all the approaches. We concluded that word embedding models, especially GloVe, work better for single-word tasks, and for open-ended ideas in sentence structure, sentence embedding models could provide promising results.

The second research question's contributions are also multi-fold: first, we have examined what creativity dimensions are automatically evaluated in the different articles analyzed in this scoping review. In contrast to creativity research, which has standardized tests that evaluate four specific dimensions, 25 different creativity dimensions are found in automatic creativity evaluation. Second, the scoping review has analyzed how these dimensions are defined and measured in automatic creativity evaluation. We found similarities in the definitions and computations of different creativity dimensions. Finally, based on a thorough analysis of the definitions and computations used in the studies, we characterized the 25 dimensions into seven core dimensions. This analysis helps elaborate a coherent and consistent framework about core creativity dimensions and their computation.

The overall contributions of this scoping review bridge the realms of computer science and education. For computer scientists, this review provides insights to refine existing NLP approaches and provides opportunities for developing more novel NLP methods for evaluating and promoting creativity. Meanwhile, educators can use these automatic evaluations as pedagogical tools in real-world classroom practices. The implications of automatic creativity evaluation could help assess and nurture creativity, which is becoming an explicit part of educational policy initiatives and curricula. Ultimately, this scoping review leverages AI as a valuable tool in evaluating and enhancing creativity capable of equipping future citizens with the necessary competencies to generate innovative solutions to the world's complex economic, environmental, and social challenges.

6.1. Limitations and future work

This scoping review has two limitations, which may have conditioned our results. The first limitation could be the search keyword strategy, which may be insufficient to include key articles in our field of study. Second, the exclusion and inclusion criteria may suffer from the omission of relevant studies that could have answered our research questions. We tried to mitigate this risk by carefully constructing an inclusive search string and providing explicit inclusion and exclusion criteria with co-authors' consensus.

In future, concluding from this scoping review, we intend to design experimental research to evaluate the reliability of deep learning models such as sentence embedding models to measure the novelty of ideas in an open-ended co-creative process. Furthermore, we also suggest using text generation models to recommend diverse hints to improve divergent thinking in the creative process. Regarding the automatic evaluation of creativity dimensions, our review highlighted that there is still a research gap in studies that fully automate the main core dimensions of creativity. So, we plan to simultaneously measure different core creativity dimensions by evaluating idea datasets with ML techniques. Finally, the development of reliable and automatic evaluation of the different dimensions of creativity could be the seed for the design and the delivery of real-time recommendations during the creative process that could trigger students' creativity.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

Author contributions

IU has contributed in the conceptualization of the paper, methodology and investigation; he has participated in writing the original manuscript, revision and edition. MP is the principal investigator of the research project and she has designed the project, she has also contributed in the conceptualization of the paper, methodology and investigation; she has participated in writing the manuscript, revision and edition. Both authors contributed to the article and approved the submitted version.

Funding

This research has been funded by the Ministry of Science and Innovation of the Government of Spain under Grants EDU2019-107399RB-I00 and PID2022-139060OB-I00.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Acar, S., Berthiaume, K., Grajzel, K., Dumas, D., Flemister, C., and Organisciak, P. (2021). Applying automated originality scoring to the verbal form of torrance tests of creative thinking. Gifted Child Quart. 67, 3–17. doi: 10.1177/00169862211061874