AUTHOR=Le Ngoc D. , Nguyen Nhung T. H.
TITLE=A metric learning-based method for biomedical entity linking
JOURNAL=Frontiers in Research Metrics and Analytics
VOLUME=8
YEAR=2023
URL=https://www.frontiersin.org/journals/research-metrics-and-analytics/articles/10.3389/frma.2023.1247094
DOI=10.3389/frma.2023.1247094
ISSN=2504-0537
ABSTRACT=
Biomedical entity linking task is the task of mapping mention(s) that occur in a particular textual context to a unique concept or entity in a knowledge base, e.g., the Unified Medical Language System (UMLS). One of the most challenging aspects of the entity linking task is the ambiguity of mentions, i.e., (1) mentions whose surface forms are very similar, but which map to different entities in different contexts, and (2) entities that can be expressed using diverse types of mentions. Recent studies have used BERT-based encoders to encode mentions and entities into distinguishable representations such that their similarity can be measured using distance metrics. However, most real-world biomedical datasets suffer from severe imbalance, i.e., some classes have many instances while others appear only once or are completely absent from the training data. A common way to address this issue is to down-sample the dataset, i.e., to reduce the number instances of the majority classes to make the dataset more balanced. In the context of entity linking, down-sampling reduces the ability of the model to comprehensively learn the representations of mentions in different contexts, which is very important. To tackle this issue, we propose a metric-based learning method that treats a given entity and its mentions as a whole, regardless of the number of mentions in the training set. Specifically, our method uses a triplet loss-based function in conjunction with a clustering technique to learn the representation of mentions and entities. Through evaluations on two challenging biomedical datasets, i.e., MedMentions and BC5CDR, we show that our proposed method is able to address the issue of imbalanced data and to perform competitively with other state-of-the-art models. Moreover, our method significantly reduces computational cost in both training and inference steps. Our source code is publicly available here.