AUTHOR=Goikoetxea Josu , San Martin Itziar , Arantzeta Miren 

TITLE=Bridging Natural Language Processing and psycholinguistics: computationally grounded semantic similarity datasets for Basque and Spanish

JOURNAL=Frontiers in Language Sciences

VOLUME=Volume 3 - 2024

YEAR=2024

URL=https://www.frontiersin.org/journals/language-sciences/articles/10.3389/flang.2024.1458887

DOI=10.3389/flang.2024.1458887

ISSN=2813-4605

ABSTRACT=We present a computationally-grounded word similarity dataset based on two well-known Natural Language Processing resources; text corpora and knowledge bases. This dataset aims to fill a gap in psycholinguistic research by providing a variety of quantifications of semantic similarity in an extensive set of noun pairs controlled by variables that play a significant role in lexical processing. The dataset creation consisted of three steps, 1) computing four key psycholinguistic features for each noun; concreteness, frequency, semantic and phonological neighbourhood density; 2) pairing nouns across these four variables; 3) for each noun pair, assigning three types of word similarity measurements, computed out of text, Wordnet and hybrid embeddings. The present dataset includes noun pairs' information in Basque and European Spanish, but further work intends to extend it to more languages.