Skip to main content

BRIEF RESEARCH REPORT article

Front. Digit. Health

Sec. Health Informatics

Volume 7 - 2025 | doi: 10.3389/fdgth.2025.1495040

This article is part of the Research Topic Healthcare Text Analytics: Unlocking the Evidence from Free Text, Volume V View all articles

A Simplified Retriever to Improve Accuracy of Phenotype Normalizations by Large Language Models

Provisionally accepted
  • 1 Missouri University of Science and Technology, Rolla, United States
  • 2 Missouri State University, Springfield, Illinois, United States

The final, formatted version of the article will be published soon.

    Large language models have shown improved accuracy in phenotype term normalization tasks when augmented with retrievers that suggest candidate normalizations based on term definitions. In this work, we introduce a simplified retriever that enhances large language model accuracy by searching the Human Phenotype Ontology (HPO) for candidate matches using contextual word embeddings from BioBERT without the need for explicit term definitions. Testing this method on terms derived from the clinical synopses of Online Mendelian Inheritance in Man (OMIMĀ®), we demonstrate that the normalization accuracy of GPT-4o increases from a baseline of 62% without augmentation to 85% with retriever augmentation. This approach is potentially generalizable to other biomedical term normalization tasks and offers an efficient alternative to more complex retrieval methods.

    Keywords: phenotype normalization, Large Language Model, small language model, Cosine similarity, Hpo, OMIM, retrievalaugmented generation

    Received: 11 Sep 2024; Accepted: 12 Feb 2025.

    Copyright: Ā© 2025 Hier, Do and Obafemi-Ajayi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

    * Correspondence: Daniel B Hier, Missouri University of Science and Technology, Rolla, United States

    Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

    Research integrity at Frontiers

    Man ultramarathon runner in the mountains he trains at sunset

    94% of researchers rate our articles as excellent or good

    Learn more about the work of our research integrity team to safeguard the quality of each article we publish.


    Find out more