Discrepancies in ICD-9/ICD-10-based codes used to identify three common diseases in cancer patients in real-world settings and their implications for disease classification in breast cancer patients and patients without cancer: a literature review and descriptive study

Tu, Nora; Henderson, Mackenzie; Sundararajan, Meera; Salas, Maribel

doi:10.3389/fonc.2023.1016389

METHODS article

Front. Oncol. , 06 September 2023

Sec. Breast Cancer

Volume 13 - 2023 | https://doi.org/10.3389/fonc.2023.1016389

This article is part of the Research Topic Methods in Breast Cancer View all 17 articles

Discrepancies in ICD-9/ICD-10-based codes used to identify three common diseases in cancer patients in real-world settings and their implications for disease classification in breast cancer patients and patients without cancer: a literature review and descriptive study

Nora Tu¹

Mackenzie Henderson^1*

Meera Sundararajan¹

Maribel Salas^1,2

¹Epidemiology, Clinical Safety and Pharmacovigilance, Daiichi Sankyo, Inc., Basking Ridge, NJ, United States
²Center for Real-World Effectiveness and Safety of Therapeutics (CREST), University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, United States

Background: International Classification of Diseases, Ninth/Tenth revisions, clinical modification (ICD-9-CM, ICD-10-CM) are frequently used in the U.S. by health insurers and disease registries, and are often recorded in electronic medical records. Due to their widespread use, ICD-based codes are a valuable source of data for epidemiology studies, but there are challenges related to their accuracy and reliability. This study aims to 1) identify ICD-9/ICD-10-based codes reported in literature/web sources to identify three common diseases in elderly patients with cancer (anemia, hypertension, arthritis), 2) compare codes identified in the literature/web search to SEER-Medicare’s 27 CCW Chronic Conditions Algorithm (“gold-standard”) to determine their discordance, and 3) determine sensitivity of the literature/web search codes compared to the gold standard.

Methods: A literature search was performed (Embase, Medline) to find sources reporting ICD codes for at least one disease of interest. Articles were screened in two levels (title/abstract; full text). Analysis was performed in SAS Version 9.4.

Results: Of 106 references identified, 29 were included that reported 884 codes (155 anemia, 80 hypertension, 649 arthritis). Overall discordance between the gold standard and literature/web search code list was 32.9% (22.2% for ICD-9; 35.7% for ICD-10). The gold standard contained codes not found in literature/web sources, including codes for hypertensive retinopathy/encephalopathy, Page Kidney, spondylosis/spondylitis, juvenile arthritis, thalassemia, sickle cell disorder, autoimmune anemias, and erythroblastopenia. Among a cohort of non-cancer patients (N=684,376), the gold standard identified an additional 129 patients with anemia, 33,683 with arthritis, and 510 with hypertension compared to the literature/web search. Among a cohort of breast cancer patients (N=303,103), the gold standard identified an additional 59 patients with anemia, 10,993 with arthritis, and 163 with hypertension. Sensitivity of the literature/web search code list was 91.38-99.96% for non-cancer patients, and 93.01-99.96% for breast cancer patients.

Conclusion: Discrepancies in codes used to identify three common diseases resulted in variable differences in disease classification. In all cases, the gold standard captured patients missed using the literature/web search codes. Researchers should use standardized, validated coding algorithms when available to increase consistency in research and reduce risk of misclassification, which can significantly alter the findings of a study.

1 Introduction

International Classification of Diseases (ICD) coding is one of the oldest efforts to systematically classify and track diseases and mortality (1). Its first edition (the International List of Causes of Death) was released in 1893, and there have since been many revisions to ICD coding led by the World Health Organization (2). The ICD Ninth Revision, Clinical Modification (ICD-9-CM) was adopted in the United States in 1979, and the Tenth Revision, Clinical Modification (ICD-10-CM) was adopted in the United States in 2015 (2). ICD-9-CM and ICD-10-CM coding are modified versions of the WHO’s ICD-9 and ICD-10 coding systems, and are used in a variety of healthcare settings in the United States. They are frequently used by health insurers for the reimbursement of claims related to health care services. They are also recorded in patients’ electronic medical records and are used by many disease registries to record disease state information (3, 4). The presence of ICD-9-CM and ICD-10-CM codes in such a variety of U.S. healthcare data sources has introduced an invaluable source of information for epidemiology studies (5, 6). In research settings, ICD-9-CM and ICD-10-CM codes have been used for many purposes, including classifying patients’ disease status, studying the natural history and outcomes of diseases, and documenting comorbidities (6, 7).

However, the use of ICD-9-CM and ICD-10-CM coding for research is also associated with challenges related to the accuracy and consistency of their use, largely due to widespread and variable usage of the codes in administrative claims in the United States. O’Malley et al. (2005) found several sources of error in their coding, including coder training and experience, quality-control processes in place at healthcare facilities, and intentional or unintentional coding errors (8). Similarly, Liebovitz and Fahrenbach (2018) suggested limitations due to physician time constraints, inability to find codes, and lack of coverage warnings leading physicians to choose different codes, among other limitations (9). Some studies have reported error rates in ICD-9-based coding up to 80% (8). Thus, researchers’ decisions regarding which codes to include in research can potentially have a large impact on study results.

There have been many approaches to address this issue. Some researchers have attempted to create and validate standardized coding algorithms that can be used to identify diseases accurately and reliably in a variety of databases. For example, in 2005, Quan and colleagues (10) created and evaluated several ICD-based coding algorithms to identify common comorbidities such as diabetes and chronic pulmonary disease. In the years since these results were published, many researchers have used these coding algorithms in their own research to accurately identify comorbid diseases (10). Alternatively, some organizations that create or maintain databases provide researchers with their own coding algorithms that researchers can use to identify diseases specifically in their database.

One example of this is the Surveillance, Epidemiology, and End Results (SEER)-Medicare database. SEER-Medicare is a linked database that includes claims data for patients enrolled in Medicare who have a cancer diagnosis. SEER-Medicare provides researchers with a code list (the 27 CCW Chronic Conditions Algorithm) that was developed within SEER-Medicare data and can be used to identify common comorbidities within these data (11). This code list includes not only ICD-9-CM and ICD-10-CM codes, but other codes as well, such as Healthcare Common Procedure Coding System (HCPCS) and Current Procedural Terminology (CPT) codes.

In this study, we utilized a SEER-Medicare breast cancer (BC) dataset to understand the implications of using different coding algorithms to identify common comorbidities in patients with BC. Using the 27 CCW Chronic Conditions algorithm provided by SEER-Medicare as the gold standard to identify comorbidities, we were able to evaluate the implications of using different, often simpler, algorithms that are commonly used for identification of comorbidities in research. For this study, we chose to focus on identification of three common comorbidities in elderly patients with BC: anemia, hypertension, and arthritis.

There were three primary objectives of this literature review and descriptive study. The first objective was to use published literature and online sources to identify and summarize ICD-based codes used to identify anemia, hypertension, and/or arthritis. The second objective was to systematically compare the ICD-based codes identified from the literature/web search to the ICD-9-CM and ICD-10-CM codes included in the SEER-Medicare 27 CCW Chronic Conditions Algorithm (gold standard) to evaluate their discordance. The third objective was to evaluate numerical differences in disease classification in cohorts of breast cancer and non-cancer SEER-Medicare patients using the literature/web search codes compared to the gold standard and determine sensitivity of the literature/web search code list.

2 Materials and methods

2.1 Study design

A literature search was performed in Embase (1980 – 22 February 2021) and Medline (1946 – 22 February 2021) to find literature that reported ICD-9/ICD-10-based codes used to identify at least one of three diseases of interest: anemia, hypertension, and/or arthritis (including both osteoarthritis, OA, and rheumatoid arthritis, RA). The search was limited to articles in English. The full literature search strategy is reported in Supplementary Table 1. Additional sources were evaluated for articles, including PubMed, references of articles retrieved in the literature search, the American Medical Association’s (AMA) official 2019 ICD-10-CM codebook (9), healthcare institution guidance publications (12, 13), and online ICD code look-up tools (14–16).

Publications were eligible for inclusion if they reported ICD-9/ICD-10-based codes used to identify at least one disease of interest, regardless of the primary objectives and methods of the publication. We did not limit inclusion of articles to ICD-9-CM and ICD-10-CM only; other modifications of ICD-coding were included as well. If a publication reported both ICD-based codes and other types of codes (e.g., HCPCS, CPT, or National Drug Codes), it was eligible for inclusion. However, only ICD-based codes were evaluated in this study and all other types of codes were excluded (due to feasibility concerns, inconsistencies in use, and limited usefulness in some databases).

Two levels of article screening were performed by one researcher. In level 1 screening, the titles and abstracts of identified publications were reviewed. Articles that were selected to move on after level 1 screening were then reviewed in level 2 screening, in which the full texts of the articles were reviewed. If there was uncertainty about the decision to include a specific publication, a second researcher was consulted.

The following data were extracted from all included articles: ICD-9/ICD-10-based codes for anemia, hypertension, and arthritis, and code descriptions when reported. If descriptions were not reported, they were extracted from ICD code look-up tools. One researcher performed the data extraction in Microsoft Excel and a second researcher performed quality control on the extracted data. Statistical analysis was performed in SAS Version 9.4.

2.2 Statistical methods

To address the first objective, we summarized the ICD-9/ICD-10-based codes identified from the literature/web search for each disease state and provided brief descriptions of these codes.

To address the second objective, we evaluated and summarized the extent to which the ICD-based codes identified in the literature/web search differed from the ICD-9-CM/ICD-10-CM codes in the SEER Medicare 27 CCW Chronic Conditions Algorithm. This was measured using percent discordance. Concordant codes were defined as ICD-based codes that were in both the 27 CCW Chronic Conditions Algorithm and the literature/web search code list. Discordant codes were defined as ICD-based codes found only in either the 27 CCW Chronic Conditions Algorithm or the literature/web search code lists, but not both. Total codes were defined as any codes found in either the 27 CCW Chronic Conditions Algorithm or the literature/web search (including both concordant and discordant codes). The percent discordant was defined as:

p e r c e n t d i s c o r d a n t = \frac{n u m b e r o f d i s c o r d a n t c o d e s}{t o t a l c o d e s} x 100 %

To address the third objective, we classified cohorts of non-cancer and BC SEER-Medicare patients (2008 – 2016) using the ICD-based codes found in the literature/web search and separately using the 27 CCW Chronic Conditions Algorithm to determine the difference in overall patient counts with each disease when using the different ICD-based code lists. For this analysis, one comprehensive literature/web search code list was created that included all ICD-based codes for any of the three diseases of interest found in any of the 29 references included herein from the literature review. The 27 CCW Chronic Conditions Algorithm was considered the gold standard for this study for multiple reasons, including that it was developed specifically for use in the dataset that we used for this study, and because the literature/web search code list was an aggregated list, and thus did not represent one specific list of codes and has not undergone any validation. Using the 27 CCW Chronic Conditions Algorithm as the gold standard, we calculated sensitivity for the literature/web search code lists for each of the three diseases.

3 Results

3.1 Literature search results

After all duplicates were removed, the literature search retrieved a total of 84 references. Twenty-two additional references were identified through other means, such as searching PubMed and reviewing references of articles identified in the literature search (12–33). Out of a total of 106 references identified, 29 references met the inclusion criteria and were included in this study (34–40). All ICD-9/ICD-10-based codes extracted from the included literature/web search are reported in Tables 1A and B. All tables report a lowercase x in a code to indicate a wildcard, meaning this digit can be replaced with any number. Unless otherwise noted, a code with n wildcard places after a base code includes all codes with up to n digits after the base code (e.g., M16.xx includes M16.x).

TABLE 1A

Table 1A All ICD-9-based codes extracted from the literature/web search and SEER Medicare 27 CCW chronic conditions algorithm.

TABLE 1B

Table 1B All ICD-10-based codes extracted from the literature/web search and SEER Medicare 27 CCW chronic conditions algorithm.

3.2 Discordant code findings

3.2.1 Overall discordance

Overall, 884 total codes were identified from either the literature/web search or SEER Medicare 27 CCW Chronic Conditions Algorithm: 180 were ICD-9-based and 704 were ICD-10-based codes. Of the total codes, 155 (17.5%) were for anemia, 80 (9.1%) were for hypertension, and 649 (73.4%) were for arthritis. There were 291 discordant codes found between the literature/web search code lists and 27 CCW Chronic Conditions Algorithm: 40 discordant ICD-9-based codes and 251 discordant ICD-10-based codes. This resulted in an overall discordance of 32.9% (22.2% for ICD-9-based codes and 35.7% for ICD-10-based codes) between the literature/web codes and the 27 CCW Chronic Conditions Algorithm. Discordant code findings are reported in Tables 2A and B.

TABLE 2A

Table 2A Discordant ICD-9-based codes with code descriptions.

TABLE 2B

Table 2B Discordant ICD-10-based codes with code descriptions.

3.2.2 Anemia discordance

A total of 59 ICD-9-based anemia codes were identified from either the literature/web search or SEER Medicare 27 CCW Chronic Conditions Algorithm. Of these, there was one discordant code that was found in the literature/web search but not the 27 CCW Chronic Conditions Algorithm (Table 2A). This resulted in an overall discordance of 1.7% for ICD-9-based anemia codes. A total of 96 ICD-10-based anemia codes were identified from either the literature/web search or 27 CCW Chronic Conditions Algorithm. Of these, there were 35 discordant codes (30 of which were found only in the 27 CCW Chronic Conditions Algorithm and 5 of which were found only in the literature/web search; Table 2B). This resulted in an overall discordance of 36.5% for ICD-10-based anemia codes.

3.2.3 Hypertension discordance

A total of 60 ICD-9-based hypertension codes were identified from either the literature/web search or the SEER Medicare 27 CCW Chronic Conditions Algorithm. Of these, there were 26 discordant codes (1 of which was found only in the 27 CCW Chronic Conditions Algorithm and 25 of which were only found in the literature/web search; Table 2A). This resulted in an overall discordance of 43.3% for ICD-9-based hypertension codes. A total of 20 ICD-10-based hypertension codes were identified from either the literature/web search or the 27 CCW Chronic Conditions Algorithm. Of these, there were 6 discordant codes (all of which were only found in the 27 CCW Chronic Conditions Algorithm; Table 2B). This resulted in an overall discordance of 30% for ICD-10-based hypertension codes.

3.2.4 Arthritis discordance

For the arthritis code analysis, RA and OA were grouped together to be consistent with the SEER Medicare 27 CCW Chronic Conditions Algorithm, which includes only one overall group for arthritis. A total of 61 ICD-9-based arthritis codes were identified from either the literature/web search or the 27 CCW Chronic Conditions Algorithm. Of these, there were 13 discordant codes (7 were found only in the 27 CCW Chronic Conditions Algorithm and 6 were found only in the literature/web search; Table 2A). This resulted in an overall discordance of 21.3% for ICD-9-based arthritis codes. A total of 588 ICD-10-based arthritis codes were identified from either the literature/web search or the 27 CCW Chronic Conditions Algorithm. Of these, there were 210 discordant codes (182 were found only in the 27 CCW Chronic Conditions Algorithm and 28 were found only in the literature/web search; Table 2B). This resulted in an overall discordance of 35.7% for ICD-10-based arthritis codes.

3.3 Most frequently identified codes

The most frequent concordant ICD-9/ICD-10-based codes overall (i.e., those identified in both the literature/web search and the SEER-Medicare 27 CCW Chronic Conditions Algorithm), are reported in Supplementary Table 2. The most frequently identified anemia codes were for unspecified anemia, anemia of chronic illness or blood loss, and deficiency anemias (including iron and vitamin B12). The most frequently identified hypertension codes were for malignant or benign essential/primary hypertension, hypertensive heart disease, hypertensive chronic kidney disease (CKD), hypertensive heart disease and CKD, and secondary hypertension. The most frequently identified arthritis codes were for rheumatoid arthritis and variations thereof (e.g., with visceral involvement, with rheumatoid myopathy; Supplementary Table 2), osteoarthritis and variations thereof (e.g., of the hip, of the knee; Supplementary Table 2), rheumatoid bursitis or nodules, Felty’s syndrome, and adult-onset Still’s Disease.

The most frequently identified discordant codes found only in literature/web sources are listed in Supplementary Table 3. The most commonly found discordant ICD-9-based codes in the literature/web search included certain hypertensive disorders associated with pregnancy and childbirth and certain arthropathies/polyarthropathies (Supplementary Table 3). The most common discordant ICD-10-based codes in the literature/web search included certain codes for rheumatoid lung disease with RA, RA of unspecified sites, inflammatory polyarthropathy, and certain codes for OA of unspecified sites.

3.4 Classification of non-cancer and breast cancer patient cohorts in the SEER-Medicare database

Finally, to address the third objective of this study we evaluated the numerical differences in disease classification in two cohorts of patients in SEER-Medicare (non-cancer patients and BC patients) using the literature/web search codes compared to the SEER-Medicare 27 CCW Chronic Conditions Algorithm codes. These results are presented in Tables 3A, B. For non-cancer patients, the 27 CCW Chronic Conditions Algorithm identified 129 additional patients with anemia (p=0.83), 510 additional patients with hypertension (p=0.27), and 33,683 additional patients with arthritis (p<0.0001) that were not identified using the literature/web code list. Using the 27 CCW Chronic Conditions Algorithm as the gold standard, the comprehensive literature/web search code list had a 99.96% sensitivity to identify anemia in non-cancer patients, 99.91% sensitivity to identify hypertension in non-cancer patients, and 91.38% sensitivity to identify arthritis (including both OA and RA) in non-cancer patients. For BC patients, the 27 CCW Chronic Conditions Algorithm identified 59 additional patients with anemia (p=0.88), 163 additional patients with hypertension (p=0.66), and 10,993 additional patients with arthritis (p<0.0001) that were not identified using the literature/web code list. Using the 27 CCW Chronic Conditions Algorithm as the gold standard, the comprehensive literature/web search code list had a 99.96% sensitivity to identify anemia in BC patients, 99.92% sensitivity to identify hypertension in BC patients, and 93.01% sensitivity to identify arthritis in BC patients.

TABLE 3A

Table 3A Total number of non-cancer SEER-Medicare patients (N=684,376) classified with each disease of interest using codes found in the literature/web search code list compared to the SEER-Medicare 27 CCW chronic conditions algorithm.

TABLE 3B

Table 3B Total number of SEER-Medicare breast cancer patients N=303,103 classified with each disease of interest using codes found in the literature/web search code list compared to the SEER-Medicare 27 CCW chronic conditions algorithm.

4 Discussion

A total of 884 codes were identified for anemia, hypertension, and arthritis. The majority of these codes were ICD-10-based codes (n=704), and the remainder were ICD-9-based codes (n=180). The discrepancy between number of codes in the ninth and tenth revisions was expected, given that there are almost five times as many ICD-10-CM codes as there are ICD-9-CM codes, largely due to differences in grouping and specificity between the ICD versions (6). The most common codes identified for anemia were for anemias of chronic illness or blood loss, unspecified anemias, and deficiency anemias. The most common codes identified for hypertension were for malignant or benign essential/primary hypertension, secondary hypertension, and hypertensive heart disease and/or hypertensive CKD. Finally, the most common codes for arthritis were for OA and variations thereof, RA and variations thereof, rheumatoid bursitis or nodules, Felty’s syndrome, and adult-onset Still’s Disease.

When the literature/web search code lists were compared to the SEER-Medicare 27 CCW Chronic Conditions Algorithm, there was variable discordance. Discordance for all codes was less than 50% (overall discordance was 32.9%), and higher discordance was observed for hypertension compared to either anemia or arthritis. Discordance for ICD-9-based codes ranged from 1.7% - 43.3% and discordance for ICD-10-based codes ranged from 30% - 36.5%. There were several codes included in the 27 CCW Chronic Conditions Algorithm that were not found in literature/web sources. These included certain codes for hypertensive retinopathy/encephalopathy, Page kidney, thalassemia, sickle cell disorders, autoimmune hemolytic anemia, erythroblastopenia, spondylitis/spondylosis, and juvenile arthritis conditions. On the other hand, the most common codes found only in the literature/web search included certain codes related to hypertensive disorders of pregnancy/childbirth, certain arthropathies/polyarthropathies, rheumatoid lung disease with RA, and RA of unspecified sites, (Supplementary Table 3).

There are many possible reasons for the differences between the codes included in the literature/web search code list and the 27 CCW Chronic Conditions Algorithm. Specific codes included in any given study may be driven largely by the population of interest. This is demonstrated clearly by the fact that pregnancy-related hypertensive disorders were not found in the 27 CCW Chronic Conditions Algorithm. Because the SEER-Medicare database primarily contains information about older adults (≥65 years old), codes related to pregnancy are less relevant in these patients, which may be why they were excluded. Interestingly, when examining codes found only in the 27 CCW Chronic Conditions Algorithm, there were several codes related to juvenile arthritis. As previously noted, SEER-Medicare includes data on mostly older individuals, so the rationale for including these codes in the code list is unclear. It is possible that since some types of juvenile arthritis are chronic diseases that persist into adulthood, they may remain relevant in older populations (41).

Furthermore, the exact codes used in a study may be based on the specific database being used, or based on previous research that has validated the use of specific codes to identify the disease of interest. As an example, a 2011 article by Kim et al. (19) performed a validation of several code lists to identify RA in Medicare claims data. Since this initial validation, this paper has been cited by over 150 articles, many of which used one of Kim et al.’s code lists to identify RA in their own research (19, 42–45). This indicates that a researcher’s decision about which codes to use may be based on previous work done to validate those codes in the same or similar databases.

A third potential reason for the differences seen may be due to variable consultation of clinical or coding experts when developing codes lists for specific diseases. When examining the codes found in the literature/web search and the SEER-Medicare 27 CCW Chronic Conditions Algorithm, each contained codes that did not explicitly match the disease name, but may have been included because a clinical expert deemed them appropriate. For example, under the scope of arthritis, the 27 CCW Chronic Conditions Algorithm includes codes for spondylosis and adult-onset Still’s disease. Professionals in medical coding and clinicians who specialize in a particular area of practice may be knowledgeable about common coding practices and diseases that share common features and may be able to use this knowledge to ensure face validity of code lists (46, 47).

Finally, another possible reason for the differences observed may be due to variation in the use of specific codes over time. Common ICD-9-/ICD-10-based coding practices or reimbursement policies for any given disease state may change over time, and this would in turn necessitate a change in the codes used to identify the given disease in a healthcare database. In addition, the code version used in the United States changed in 2015 from ICD-9-CM to ICD-10-CM. Thus, depending on the years included in a specific study, it may be necessary to include one or both of these code versions. These issues may also account for some of the differences in code lists observed in this study.

Regardless of the specific reasons for the variation in coding algorithms, the differences can result in important differences in patient classification. When we classified two cohorts of patients in SEER-Medicare, the literature/web search code list had between 91.38% – 99.96% sensitivity in identifying non-cancer patients and 93.01% - 99.96% sensitivity in identifying BC patients with the three diseases of interest. While the overall sensitivity was high, it should be noted that the sensitivity for the code lists used in individual studies may have been significantly lower than the overall sensitivity, given that we combined all 29 literature/web search code lists into one list for analysis. Interestingly, percent discordance did not necessarily correspond to lower sensitivity. While the highest discordance was identified for hypertension, the lowest overall sensitivity was seen for arthritis: the SEER-Medicare 27 CCW Chronic Conditions Algorithm identified a significant additional number of patients with arthritis in the non-cancer cohort (33,683 additional patients; p<0.0001) and in the BC cohort (10,993 additional patients p<0.0001) that were not identified with the literature/web search codes.

Using the 27 CCW Chronic Conditions Algorithm as the gold standard, the literature/web search code list misclassified a significant number of patients with arthritis. Because these codes are often used to assign patients to exposure or outcome groups, or used for subgroup analyses in epidemiology studies, issues of misclassification can affect the clinical interpretation of a study’s results. Whether this misclassification is differential or non-differential may depend on the study design and data source used. If misclassification occurs proportionally between the groups being compared to each other, this will result in non-differential misclassification and will bias the study results towards the null. The extent to which this is an issue for any particular study will depend largely on the disease of interest, its common coding practices, and the database used. However, the differences can be substantial, and this example offers a clear illustration of why researchers must carefully evaluate and determine which codes to include in their research.

This study has a few limitations. The gold standard (the 27 CCW Chronic Conditions Algorithm) and the ICD-10-CM coding system are frequently updated. We used the version of the 27 CCW Chronic Conditions Algorithm that was developed using data through 2016, aligning with the specific dataset used in this study. For this reason, we were able to use it as a gold standard for this study, but this algorithm has since been updated and our results may not reflect the most recent algorithms or ICD-10-CM coding. This study focused on the evaluation of how the literature/web search code list performed against the gold standard, but we were unable to evaluate the performance of the gold standard itself. It should also be noted that this algorithm undergoes continual updating to reflect the most current coding practices and understanding of the relevant disease states. In addition, definitions used to determine disease status in epidemiology studies are not limited to codes (e.g., ICD-9-CM or ICD-10-CM codes), but may incorporate other rules. These can include requirements for patients to have more than one code recorded for the disease, potentially at prespecified time intervals. Though not directly evaluated in the current study, articles that were included in our literature review varied extensively in their definitions of arthritis. Kim et al. (19) required patients to have at least two or three diagnosis codes for RA. Lacaille et al. (21) required at least two physician visits more than two months apart with a diagnosis code for RA. In contrast, French et al. (31) required only one diagnosis code for OA. Finally, Postler et al. (24) required an outpatient diagnosis of OA in at least two quarters of a single calendar year. These differences in coding algorithms must also be considered when determining the appropriate way to identify and classify patients’ disease status.

5 Conclusions

Although it may not be feasible to develop one coding algorithm to identify a specific disease for use across all databases, there is considerable room for improvement in the development of valid coding algorithms and increased consistency of their use in research. Researchers should carefully evaluate what codes to include in their research, and consider the potential implications of these decisions. If significant misclassification occurs because invalid coding algorithms are used to identify patients, this may bias the results of a study and call into question their clinical utility. It is advisable that researchers provide justification for their inclusion and exclusion of certain codes in their publications. Finally, if validated coding algorithms or validated ICD-9/ICD-10-based codes are available for use, researchers should use them in their research. Future work is needed to develop and validate coding algorithms for use in specific databases.

Data availability statement

The data analyzed in this study is subject to the following licenses/restrictions: SEER-Medicare data are not public use data files. Requests to access these datasets should be directed to SEER-Medicare, U0VFUk1lZGljYXJlQGltc3dlYi5jb20=.

Ethics statement

The studies involving humans were approved by SEER-Medicare Review Committee. The studies were conducted in accordance with the local legislation and institutional requirements. Written informed consent for participation was not required from the participants or the participants’ legal guardians/next of kin in accordance with the national legislation and institutional requirements.

Author contributions

NT contributed to the development of the research project, contributed to drafting of study materials including statistical analysis plan and study protocol, contributed to statistical analysis of data and quality control, critically reviewed drafts and approved the final manuscript. MH contributed to the development of the research project, contributed to drafting of all study materials, provided input into the statistical analysis plan, contributed to drafting of manuscript and critical review, approved final manuscript. MeS contributed to drafting of study materials, conducted statistical analysis of the data and quality control, contributed to the drafting of the manuscript, and critically reviewed drafts and approved final manuscript. MaS contributed to the development of the research project, provided input into the statistical analysis plan and interpretation of results, critically reviewed drafts and approved the final manuscript. All authors contributed to the article and approved the submitted version.

Conflict of interest

All authors are employees of Daiichi Sankyo, Inc., the funder of this research. Authors MaS, MeS, and NT own Daiichi Sankyo stock.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fonc.2023.1016389/full#supplementary-material

References

1. International Statistical Classification of Diseases and Related Health Problems (ICD). Geneva, Switzerland: World Health Organization (2021). Available at: https://www.who.int/standards/classifications/classification-of-diseases.

Discrepancies in ICD-9/ICD-10-based codes used to identify three common diseases in cancer patients in real-world settings and their implications for disease classification in breast cancer patients and patients without cancer: a literature review and descriptive study

1 Introduction

2 Materials and methods

2.1 Study design

2.2 Statistical methods

3 Results

3.1 Literature search results

3.2 Discordant code findings

3.2.1 Overall discordance

3.2.2 Anemia discordance

3.2.3 Hypertension discordance

3.2.4 Arthritis discordance

3.3 Most frequently identified codes

3.4 Classification of non-cancer and breast cancer patient cohorts in the SEER-Medicare database

4 Discussion

5 Conclusions

Data availability statement

Ethics statement

Author contributions

Conflict of interest

Publisher’s note

Supplementary material

References

95% of researchers rate our articles as excellent or good