A Decentralized Kidney Transplant Biopsy Classifier for Transplant Rejection Developed Using Genes of the Banff-Human Organ Transplant Panel

van Baardwijk, Myrthe; Cristoferi, Iacopo; Ju, Jie; Varol, Hilal; Minnee, Robert C.; Reinders, Marlies E. J.; Li, Yunlei; Stubbs, Andrew P.; Clahsen-van Groningen, Marian C.

doi:10.3389/fimmu.2022.841519

ORIGINAL RESEARCH article

Front. Immunol., 10 May 2022

Sec. Alloimmunity and Transplantation

Volume 13 - 2022 | https://doi.org/10.3389/fimmu.2022.841519

This article is part of the Research TopicT-cell and Antibody-mediated rejection after organ transplantation in the post-COVID-19 era – diagnosis, immunological risk evaluation and therapyView all 7 articles

A Decentralized Kidney Transplant Biopsy Classifier for Transplant Rejection Developed Using Genes of the Banff-Human Organ Transplant Panel

Myrthe van Baardwijk^1,2,3,4*†

Iacopo Cristoferi^1,2,3†

Jie Ju¹

Hilal Varol^1,3

Robert C. Minnee^2,3

Marlies E. J. Reinders^3,5

Yunlei Li¹

Andrew P. Stubbs^1‡

Marian C. Clahsen-van Groningen^1,3,6‡

¹Department of Pathology and Clinical Bioinformatics, Erasmus MC, University Medical Center Rotterdam, Rotterdam, Netherlands
²Division of HPB and Transplant Surgery, Department of Surgery, Erasmus MC, University Medical Center Rotterdam, Rotterdam, Netherlands
³Erasmus MC Transplant Institute, Erasmus MC, University Medical Center Rotterdam, Rotterdam, Netherlands
⁴ Companion Diagnostics and Personalised Healthcare, Omnigen BV, Delft, Netherlands
⁵Department of Internal Medicine, Erasmus MC, University Medical Center Rotterdam, Rotterdam, Netherlands
⁶Institute of Experimental Medicine and Systems Biology, Rheinish-Westphalian Technical University Aachen University (RWTH), Aachen, Germany

Introduction: A decentralized and multi-platform-compatible molecular diagnostic tool for kidney transplant biopsies could improve the dissemination and exploitation of this technology, increasing its clinical impact. As a first step towards this molecular diagnostic tool, we developed and validated a classifier using the genes of the Banff-Human Organ Transplant (B-HOT) panel extracted from a historical Molecular Microscope^® Diagnostic system microarray dataset. Furthermore, we evaluated the discriminative power of the B-HOT panel in a clinical scenario.

Materials and Methods: Gene expression data from 1,181 kidney transplant biopsies were used as training data for three random forest models to predict kidney transplant biopsy Banff categories, including non-rejection (NR), antibody-mediated rejection (ABMR), and T-cell-mediated rejection (TCMR). Performance was evaluated using nested cross-validation. The three models used different sets of input features: the first model (B-HOT Model) was trained on only the genes included in the B-HOT panel, the second model (Feature Selection Model) was based on sequential forward feature selection from all available genes, and the third model (B-HOT+ Model) was based on the combination of the two models, i.e. B-HOT panel genes plus highly predictive genes from the sequential forward feature selection. After performance assessment on cross-validation, the best-performing model was validated on an external independent dataset based on a different microarray version.

Results: The best performances were achieved by the B-HOT+ Model, a multilabel random forest model trained on B-HOT panel genes with the addition of the 6 most predictive genes of the Feature Selection Model (ST7, KLRC4-KLRK1, TRBC1, TRBV6-5, TRBV19, and ZFX), with a mean accuracy of 92.1% during cross-validation. On the validation set, the same model achieved Area Under the ROC Curve (AUC) of 0.965 and 0.982 for NR and ABMR respectively.

Discussion: This kidney transplant biopsy classifier is one step closer to the development of a decentralized kidney transplant biopsy classifier that is effective on data derived from different gene expression platforms. The B-HOT panel proved to be a reliable highly-predictive panel for kidney transplant rejection classification. Furthermore, we propose to include the aforementioned 6 genes in the B-HOT panel for further optimization of this commercially available panel.

Introduction

A reliable diagnostic system for rejection is needed for an optimal therapeutic approach of kidney transplant recipients (1). Currently, kidney transplant rejection is commonly classified following the Banff consensus criteria, a classification that relies on the evaluation of canonical traits (i-, t-, and v-lesions for T-cell-mediated rejection (TCMR); ptc-, g-, and cg-lesions, staining for C4d, and circulating donor-specific antibody for antibody-mediated rejection (ABMR)) (2). A grading system to evaluate continuous biological processes should be reproducible by one observer (with low intraobserver error) and between observers (with low interobserver error) (3). However, high interobserver disagreement (Cohen’s kappa coefficient ranging between 0.2 and 0.4) characterizes the histological diagnosis and classification of biopsies obtained from patients suspected to be undergoing graft rejection using the Banff criteria (4).

Over the past years, multiple study groups focused on the development of more reliable diagnostic systems such as molecular systems for allograft pathology. A centralized rejection diagnostic system called “The Molecular Microscope^® Diagnostic System” (MMDx) was developed based on the microarray assessment of messenger RNA levels performed over post-transplant kidney biopsies and their relationship with histologically determined clinical phenotypes (5). The abovementioned system estimates the probability that a sample has features of TCMR, ABMR, any type of rejection, tubular atrophy/interstitial fibrosis, or progression to failure. The same system also provides clinically valuable predictions for samples with difficult histological diagnoses. However, even though this centralized microarray-limited approach could minimize the impact of variation in measurements between laboratories (6), it limits in the meantime the availability and the impact of a diagnostic tool for other centers.

A decentralized, open-access system to diagnose rejection and classify TCMR and ABMR that is compatible with different gene expression assessment platforms would be crucial for optimal dissemination and exploitation of this technology. In an attempt to move the research community towards a decentralized diagnostic system, the Banff Molecular Diagnostics Working Group (MDWG), in association with the industry partner NanoString^®, developed a non-proprietary 770 genes panel called the Banff-Human Organ Transplant (B-HOT) Panel. The B-HOT panel includes the most relevant genes for what concerns transplant rejection, tolerance, viral infections, and innate and adaptive immune response according to peer-reviewed literature (7). The development of a classifier that is based on a smaller and standardized subset of genes that can be measured with different techniques could be the first step towards the development of a decentralized and multiplatform-compatible kidney transplant biopsy classifier.

The present study aims at developing a decentralized molecular kidney transplant biopsy classifier to diagnose transplant rejection and ultimately improve and fine-tune this classification. Achieving a more precise diagnosis will aid transplant clinicians in the quick elaboration of a tailored therapeutic plan. In this study, a classifier based on the random forest algorithm was developed using microarray data from an online public dataset (8). The available data was filtered to contain only those probes included in the B-HOT Panel. Moreover, we compared the discriminating power of the B-HOT panel to that of sequential forward feature selection applied to the whole microarray gene set in order to assess its performance in a clinical scenario. Successively, the system was validated on another publicly available dataset based on a different microarray version (9).

Materials and Methods

All code presented in this section is available in the following GitHub repository: https://github.com/ErasmusMC-Bioinformatics/KidneyRejectionClassifier.

Data Collection and Preprocessing

An overview of data collection and preprocessing is presented in Figure 1. Two public gene expression datasets (GSE98320 and GSE129166) from the Gene Expression Omnibus database were collected as raw data matrices to serve as training and independent validation datasets respectively. A summary of the composition of both sets is shown in Table 1. A summary of the demographics and the clinical characteristics of the GSE datasets (GSE98320 and GSE129166) is shown in Table 2 (8, 9). As previously reported (8), GSE98320 was obtained by running 1,208 biopsy samples from 1,045 patients at 13 international centers on Affymetrix hgu219 PrimeView microarray chips. Diagnoses derived from the annotations of the dataset GSE98320 were MMDx archetypes, specifically three types of ABMR (early-stage, fully-developed, and late-stage), TCMR, NR, and mixed rejection. The cohort is composed of 774 biopsies classified as NR and 434 biopsies classified as rejection (275 as ABMR, 51 as late-ABMR, 27 as Mixed, and 81 as TCMR). As previously described (9), the GSE129166 was obtained running 117 peripheral blood samples and 95 kidney biopsy samples on Affymetrix GeneChip Human Genome U133 Plus 2.0 microarray chips. Diagnoses derived from the annotations of the dataset GSE129166 were histologically assessed categorical classes, specifically ABMR, TCMR, NR, mixed rejection, and borderline rejection. After discarding the peripheral blood samples, the resulting cohort was composed of 60 biopsies classified as NR and 35 biopsies classified as rejection (15 as ABMR, 2 as TCMR, 2 as Mixed, and 16 as Borderline). The main clinical interest in the use of this classifier is to distinguish between rejection and NR and subsequently distinguishing the presence of ABMR as this is not always clearcut when evaluating the biopsy. For this reason and to make the predicted categories homogeneous between the two datasets, ABMR and late-ABMR samples from GSE98320 have been grouped under the ABMR category and all the samples classified as Mixed or Borderline have been removed. For the same reason GSE129166 was considered a valid dataset for validation. The limitation concerning the scarcity of TCMR samples is addressed in the Discussion section. After samples labeled as Mixed and Borderline were removed, 1,181 and 77 samples were left in the training set and in the validation set respectively.

FIGURE 1

Figure 1 Overview of Data Collection and Preprocessing. Data has been retrieved from the GEO dataset repository. Probes have been matched using raw annotation files and then aggregated based on the median values and using robustscale per feature. Finally, the genes from the Banff Human Organ Transplant B-HOT panel have been filtered. B-HOT, Banff-Human Organ Transplant; GEO, Gene Express Omnibus; KNN, K-nearest neighbors.

TABLE 1

Table 1 Overview of datasets composition.

TABLE 2

Table 2 Demographics and clinical characteristics GSE data sets.

Preprocessing of expression datasets was done using R version 4.1 (10). The probes of the different microarrays were mapped to Entrez IDs as stable gene identifiers using the annotation files provided by the manufacturer. When multiple probes were annotated to be the same gene, the gene expression level was aggregated based on their median value to produce expression values less sensitive to outlier transcripts. Genes not measured on both microarray versions or with ambiguous mapping (e.g. assigned to multiple genes) were removed from the dataset. The ComBat package was used for adjusting possible batch effects introduced by the difference between the two microarray platforms (9). Afterward, the gene expression values of the two datasets were scaled separately using the robustscale function from the quantable package (11). This method removes the median and scales according to the quantile range, thereby removing variance introduced by outliers.

Principal Component Analysis

Principal component analysis was executed before and after the application of the above-described preprocessing steps using the PCAtools package in R (12). Biplots of the 1^st and 2^nd principal components were analyzed to evaluate the distinction of the different patient groups, as well as to compare the variation in expression introduced by differences in gene expression measurement techniques. The biplots before and after transformation were compared to determine if the batch effect removal was successful.

Rejection Classifier Development

All modeling was executed using Python 3.9 (13) with the scikit-learn module (14). An overview of the model development strategy is presented in Figure 2. For each of the hereafter mentioned models, a nested cross-validation (CV) procedure using ten outer folds and three inner folds was implemented. The folds were stratified for the different classes and the same split was used for each of the developed models. The selection of parameters and features was executed within the inner CV folds to prevent overfitting the training dataset. Based on the outer CV prediction probabilities for the different classes, the overall accuracy was calculated, as well as the precision, recall, Area Under the ROC Curve (AUC), and F1 scores for each of the three classes. The AUC scores were determined for the prediction of each class against all others.

FIGURE 2

Figure 2 Overview of model development workflow. B-HOT, Banff-Human Organ Transplant; CV, Cross-Validation; KNN, K-nearest neighbors.

B-HOT Model

The B-HOT panel was selected as a suitable feature set for rejection classification. The annotation file from NanoString^® was subjected to manual curation which included removal of viral genes undetectable by microarray and changing ambiguous nomenclature. Gene expression data was filtered after the data collection and preprocessing steps. A first random forest model was trained on the B-HOT panel genes within the nested cross-validation scheme. Parameters of the model were tuned within the inner loops using a grid search algorithm. Tested classification parameters of the model are available on GitHub. Based on the overall cross-validation accuracy of the models trained on the outer folds, the best model was selected and fitted to the entire training set. This model was identified as the B-HOT Model.

Feature Selection Model

To test the validity of the B-HOT panel feature set, a second random forest model was trained using a wrapper feature selection technique on the complete set of genes measured by both microarray versions. For this purpose, a sequential forward feature selector was implemented that used a k-nearest neighbors classifier to sequentially select the best features. K was set to three to limit the computational load and the number of features to select was limited to a hundred to match the maximum number of features selected by the random forest model. The feature selection was implemented within the inner CV loop and the random forest model was then trained with the same classification parameters as the B-HOT model. This model was identified as the Feature Selection Model.

B-HOT+ Model

To test potential new candidate genes for the B-HOT panel, a third model was trained by sequentially adding the most important features from the Feature Selection Model to the B-HOT genes until the average cross-validation accuracy could not be improved. The most important Feature Selection Model features were defined as those with the highest mean Gini impurity decrease that is built-in within the scikit-learn module. This third model was developed using the same classification parameters as before and identified as the B-HOT+ Model. The cross-validation metrics of the three strategies were compared to determine the most suitable feature set. Based on the average accuracy of the cross-validation folds, the best-performing model was selected and fitted to the complete training dataset.

Measuring Performance Using Independent Micro-Array Data

Out of the three developed models, only the best-performing model was tested on the independent validation set GSE129166 to determine the validity of the model independently from the training data and microarray version. Predictions of ABMR, TCMR, and NR classes were made for the samples of the GSE129166 dataset. Based on the prediction probabilities for the different classes, the overall accuracy was calculated, as well as the precision, recall, AUC, and F1 scores for the ABMR and NR classes. The AUC scores were determined for each class against all others.

Results

Preprocessing

Batch effect correction is necessary when combining datasets measured using different microarray platforms. To apply ComBat batch effect correction, only genes that are present in both the involved datasets must be selected. After probe matching and aggregation on the gene level, 18,945 genes overlapped between the two datasets. An overview of principal component analysis and batch effect correction is presented in Figure 3. Before the application of ComBat, most variation is observed between the two different datasets (Figures 3A, C), while after the application of ComBat, most variation is observed between the different class labels (Figures 3B, D). Therefore, ComBat successfully removed the batch effect between the two datasets.

FIGURE 3

Figure 3 Principal component analysis and Batch Correction. (A, B). Principal component analysis biplots of samples labeled based on their origin dataset before (A) and after (B) batch effect removal using ComBat. (C, D). Principal component analysis biplots of samples labeled based on their diagnosis before (C) and after (D) batch effect removal using ComBat. ABMR, Antibody-Mediated Rejection; NR, Non-Rejection; PC, Principal Component; TCMR, T-Cell-Mediated Rejection.