Skip to main content

ORIGINAL RESEARCH article

Front. Lang. Sci.
Sec. Language Processing
Volume 3 - 2024 | doi: 10.3389/flang.2024.1458887
This article is part of the Research Topic Community Series: Spanish Psycholinguistics - Volume II View all 3 articles

Bridging Natural Language Processing and Psycholinguistics: computationally grounded semantic similarity datasets for Basque and Spanish

Provisionally accepted
  • University of the Basque Country, Bilbao, Spain

The final, formatted version of the article will be published soon.

    We present a computationally-grounded word similarity dataset based on two well-known Natural Language Processing resources; text corpora and knowledge bases. This dataset aims to fill a gap in psycholinguistic research by providing a variety of quantifications of semantic similarity in an extensive set of noun pairs controlled by variables that play a significant role in lexical processing. The dataset creation consisted of three steps, 1) computing four key psycholinguistic features for each noun; concreteness, frequency, semantic and phonological neighbourhood density; 2) pairing nouns across these four variables; 3) for each noun pair, assigning three types of word similarity measurements, computed out of text, Wordnet and hybrid embeddings. The present dataset includes noun pairs' information in Basque and European Spanish, but further work intends to extend it to more languages.

    Keywords: WordNet, text, psycholinguistic features, Word similarity, Embeddings, nouns

    Received: 10 Aug 2024; Accepted: 28 Oct 2024.

    Copyright: © 2024 Goikoetxea Salutregi, Arantzeta and SanMartin. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

    * Correspondence: Josu Goikoetxea Salutregi, University of the Basque Country, Bilbao, Spain

    Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.