- 1Department of Statistics and Actuarial Science, The University of Hong Kong, Hong Kong SAR, China
- 2Department of Paediatrics and Adolescent Medicine, The University of Hong Kong, Hong Kong SAR, China
- 3Centre for PanorOmic Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
Polygenic risk scores (PRS) leverage the genetic contribution of an individual’s genotype to a complex trait by estimating disease risk. Traditional PRS prediction methods are predominantly for the European population. The accuracy of PRS prediction in non-European populations is diminished due to much smaller sample size of genome-wide association studies (GWAS). In this article, we introduced a novel method to construct PRS for non-European populations, abbreviated as TL-Multi, by conducting a transfer learning framework to learn useful knowledge from the European population to correct the bias for non-European populations. We considered non-European GWAS data as the target data and European GWAS data as the informative auxiliary data. TL-Multi borrows useful information from the auxiliary data to improve the learning accuracy of the target data while preserving the efficiency and accuracy. To demonstrate the practical applicability of the proposed method, we applied TL-Multi to predict the risk of systemic lupus erythematosus (SLE) in the Asian population and the risk of asthma in the Indian population by borrowing information from the European population. TL-Multi achieved better prediction accuracy than the competing methods, including Lassosum and meta-analysis in both simulations and real applications.
1 Introduction
Genetic risk prediction is an important methodology for understanding the underlying genetic architecture and the inclusion of information on complex traits, such as estimating the genetic risk of complex traits or diseases (for example, coronary artery disease) (Chatterjee et al., 2016; Ge et al., 2019). Polygenic risk scores (PRS) are one of the approaches to reflect a mathematical aggregation of risk by variants such as single nucleotide polymorphisms (SNPs) (Peterson et al., 2019). With the application of the best linear unbiased predictor to estimate PRS, some methods use summary association statistics as training data (Consortium, 2009; Vilhjálmsson et al., 2015; Shi et al., 2016), and others require individual-level data, such as genotype data and phenotypes (De Los Campos et al., 2010; Speed and Balding, 2014; Maier et al., 2015; Moser et al., 2015; Coram et al., 2017). As an implementation, PRS have become a widely used statistical tool to estimate the genetic risk of certain diseases or phenotypes (Mak et al., 2017). Specifically, PRS for a particular disease demonstrates the risk index for people to suffer from the disease. A remarkable study of five common diseases (coronary artery disease, atrial fibrillation, type 2 diabetes, inflammatory bowel disease, and breast cancer) found that people with top 8.0, 6.1, 3.5, 3.2, and 1.5% highest PRS had a three-fold higher risk to develop these diseases than people with average PRS (Khera et al., 2018).
However, the majority of public genome-wide association studies (GWAS) data has been conducted on the European population (Popejoy and Fullerton, 2016). Due to the limited availability of non-European ancestral data and the diversity of linkage disequilibrium (LD) architectures among distinct populations, previous studies showed that the genetic architectures of specific phenotypes or diseases were highly consistent between populations (single-variant level and genome-wide level) (Huang et al., 2021). Hence, using PRS derived from the European population can result in disease associations being under- or over-estimated in other populations (Kim et al., 2018). Traditional approaches are insufficient to address this challenge when multiple populations are involved. Recent genetic statistical studies have indicated that diverse population variants share the same underlying causal variants (Brown et al., 2016; Shi et al., 2020), which raises the possibility of transferability of PRS across distinct ethnic groups. However, existing studies focus mostly on the application with one homogeneous population. For example, LDpred (Vilhjálmsson et al., 2015) and PRS-CS (Ge et al., 2019) improve the prediction accuracy by enhancing LD modelling. As an alternative, a penalized regression framework based on summary statistics, namely Lassosum, was proposed by Mak et al. (2017), whereas these methods are limited to GWAS data from one homogeneous population. Current multiethnic PRS construction approaches that incorporate training data from both the European and target populations can leverage trans-ethnic GWAS information and stratify squared trans-ethnic genetic correlation in the explanation of environmental effects on genes (Coram et al., 2017; Mak et al., 2017; Shi et al., 2021). Moreover, Márquez-Luna et al. (2017) proposed PT-Multi for multiethnic PRS prediction by performing LD-informed pruning and p-value thresholding (PT) (Consortium, 2009) on each homogeneous population and linearly combining the optimal PRS from each specific population.
However, previous studies ignored the information gap among diverse populations. Li et al. (2022) proposed a high-dimensional linear regression model to transfer knowledge between informative samples and target samples to improve the learning performance of target samples. By using GWAS summary statistics from different ancestries and incorporating the idea of transfer learning (Li et al., 2022), we propose a novel statistical method called TL-Multi to enhance the transferability of polygenic risk prediction across diverse populations. TL-Multi assumes most causal variants are shared among diverse populations. There is a difference between the target samples and the informative auxiliary samples in the genetic architecture, which causes estimation bias. TL-Multi further corrects this bias and estimates the PRS using Lassosum (Mak et al., 2017). Additionally, TL-Multi inherits the advantages of Lassosum, ensuring that it has more accurate performance in all circumstances than initial PT and circumvents convex optimization challenges in LDpred. Moreover, TL-Multi extends the application to estimate the genetic risk from unmatched ancestral populations and employs all available data without pruning or discarding. For practical analysis, we investigated TL-Multi prediction performance with informative auxiliary European samples from UK Biobank (https://www.ukbiobank.ac.uk), and European summary statistics and Hong Kong target samples from previous studies to predict PRS in systemic lupus erythematosus (SLE) (Morris et al., 2016; Julià et al., 2018; Wang et al., 2021). We obtained a greater than 125% relative improvement in prediction accuracy compared to only using GWAS data from the Hong Kong population. Furthermore, TL-Multi performs more accurately in PRS prediction in most scenarios in comparison with the recent multiethnic methods, meta-analysis, and PT-Multi.
Additionally, we refer to Huang et al. (2021) to classify the PRS methods into two categories: single-discovery methods and multi-discovery methods. Single-discovery methods use GWAS data from a single homogeneous population, and multi-discovery methods apply the combined GWAS data of multiple populations.
2 Materials and methods
2.1 Data overview
In this study, we requested the individual-level genotyped data for a previous SLE GWAS in Hong Kong (Wang et al., 2021) as the testing dataset, which included 1,604 SLE cases and 3,324 controls. We used GWAS summary statistics of SLE from both East Asian and European populations to train the models. The data for East Asians were collected from Guangzhou (GZ) and Central China (CC), including 2,618 SLE cases and 5,107 controls (Wang et al., 2021). The data for Europeans were obtained from previous studies (Morris et al., 2016; Julià et al., 2018; Wang et al., 2021), involving a total of 4,576 cases and 8,039 controls. Variants with minor allele frequency greater than 1% and imputed INFO scores greater than 0.7 in respective ancestral groups were reserved for the following analyses.
In our analysis of asthma, we requested the genotyped data of Indian and European individuals for asthma from UK Biobank. The UK Biobank data consisted of 4,160 unrelated Indian samples genotyped at 1,175,469 SNPs after QC and mapping HapMap 3 SNPs, and we further sampled 48,362 unrelated British samples genotyped at 1,189,752 SNPs after QC and mapping HapMap 3 SNPs. We divided the Indian samples into two groups: 3,160 samples as a training data set and 1,000 samples as a testing data set. As stated previously, the final data set comprises 3,160 (408 cases and 2,752 controls) unrelated Indian samples for training, 1,000 (127 cases and 873 controls) unrelated Indian samples for testing, and 48,362 (6,555 cases and 41,807 controls) unrelated British samples for training. Variants with minor allele frequency greater than 1% and p-values of Hardy–Weinberg equilibrium Fisher’s exact test
2.2 Lassosum
Lassosum is a statistical approach introduced by Mak et al. (2017) which enables to tune parameters without validation datasets and phenotype data via pseudovalidation and outperforms PT and LDpred in prediction (Consortium, 2009; Vilhjálmsson et al., 2015). It refers to the idea of Tibshirani (1996) to deal with sparse matrices and calculate PRS only by using summary statistics and an external LD reference panel. In this article, the ancestry-matched LD block is generally estimated by the 1000 Genome project (https://www.internationalgenome.org). Additionally, we keep the reference panel’s ancestry consistent with that of our target population. Furthermore, if the SNP-wise correlation ri is not available, we can estimate ri following Mak et al. (2017):
2.3 PT-Multi
PT-Multi assumes the multiethnic PRS is a linear combination of the most predictive PRS from each population. First, it applies LD-pruning and p-value thresholding (PT) (Consortium, 2009) to each single ethnic summary statistic and gets the most predictive PRS. Second, it fits marginal linear regression models to get weights for each population, respectively. We apply the R package ‘bigspnr’ (Privé et al., 2018) to validation data for LD-informed clumping with an r2 threshold of 0.1. The p-value thresholds are among 1, 0.3, 0.1, 3 × 10–2, 10–2, 3 × 10–3, 10–3, 3 × 10–4, and 10–4. We conducted 10-fold cross-validation to determine the optimal p-value threshold for each population. We used an independent validation data set to compute the final PRS and the average value of R2 across the 10 folds.
This study used the single-discovery method (Lassosum) to regress European, Asian, and multi-discovery methods (meta-analysis, TL-Multi, and PT-Multi) to determine the most predictive PRS with the highest R2. For ease of notations, let PRSa, PRSe, PRSma, PRStl, and PRSpt represent PRS for Asian, European, meta-analysis, TL-Multi, and PT-Multi, respectively.
2.4 Meta-analysis of two diverse ancestries
We generated the estimates of effect sizes of joint GWAS data by
where βa and βe are the effect sizes obtained from Asian and European GWAS data, respectively, and sea and see are the standard errors obtained directly from ancestry-matched GWAS data. Furthermore, the estimate of the standard error in meta-analysis is defined as follows:
and the estimate of z-statistic is obtained from the following:
The p-value is converted from
where Φ(⋅) is the cumulative distribution function of the standard normal distribution N(0, 1). In this meta-analysis, the ancestry of the reference panel is consistent with the ancestry of the target population. Furthermore, due to the majority of the total sample being of European ancestry, the LD block is estimated from the European population in the 1000 Genome Project.
2.5 Multiethnic polygenic risk scores prediction using TL-Multi
In this article, we used European population data as our informative auxiliary data, owing to its large sample size and relative accessibility. Additionally, we treated East Asians as the target population due to the scarcity of public data (Brown et al., 2016; Shi et al., 2020). The fundamental framework we used for genetic architecture and phenotype is a linear combination with effect sizes β and an n-by-p genotype matrix X, where p is the number of columns containing marker genotype codes corresponding to the number of reference alleles on the sample-specific SNP (for example, 0, 1, and 2) and n is the sample size:
where y is a vector of clinical outcomes. Tibshirani (1996) proposed Lasso, which is commonly used to estimate coefficients
where ya is the vector of Asian phenotypes, Xa is the genotype matrix of the Asian population, L(⋅) is an optimizing function, ‖βa‖1 is the L1 norm of βa, and λ is a data-dependent parameter determining the proportion of βa to be estimated to 0. It can be widely extended in scenarios in which only the summary statistics are available (Mak et al., 2017).
Motivated by Lassosum, we further proposed a novel method, namely TL-Multi, to extend its application to multiethnic polygenic prediction. We observed additional samples from auxiliary studies (for example, the European population). The estimate of the marginal effect sizes of the European population,
where ye is the vector of European phenotypes, and Xe is the genotype matrix of the European population. For illustration, we denoted the auxiliary studies, in which informative auxiliary samples can be transferred, and the target model and auxiliary model are similar at certain levels (for example, similar genetic architectures). Furthermore, we assumed that the difference between auxiliary samples and target samples is denoted as follows (Li et al., 2022):
where
where q ∈ [0, 1], and
Moreover, we assumed that Aq is informative to improve the prediction performance of the target population while h is relatively small compared to
Our goal was to correct the bias between these populations and improve prediction performance in Asian population. First, we estimated the marginal effect sizes of the European population,
where
Specifically, TL-Multi estimates the PRS of the Asian population by correcting the bias between European and Asian populations. We further denoted the bias as δ, which is the difference between European and Asian populations in genetic architecture. The new estimate of effect sizes of the Asian population can be presented as βtl = βe + δ, in which δ is estimated by
According to pseudovalidation proposed by Mak et al. (2017), the optimal single-discovery PRS for European and Asian populations can be determined directly by the highest R2 without the phenotypes. The optimal estimates of effect sizes of Asian and European populations that we applied to TL-Multi are the ancestry-matched optimal PRS, respectively. Algorithm 1 describes our proposed TL-Multi algorithm, and we further developed an R package, which is publicly available at https://github.com/mxxptian/TLMulti.
Algorithm 1. Algorithm for TL-Multi.
2.6 Simulation studies
We performed a wide range of simulation studies to evaluate the performance of TL-Multi. We used real genotypes of the European population from the UK Biobank and the Asian population from a previous SLE study (Wang et al., 2021). Following the quality control procedure provided in Chang et al. (2015), we utilized the UK Biobank and Asian lupus genotype data whose p-values of Hardy–Weinberg equilibrium Fisher’s exact test
where X is the training set of the standardized genotype matrix, and ϵ represents the random error which was generated from N(0, 1 − h2). In addition, GWAS was implemented using the R package ‘bigsnpr’ to obtain the summary statistics for each simulated phenotype.
Due to the possibility that sample size affects performance, we investigated 25:1 and 50:1 proportions of European samples to Asian samples. Additionally, we observed that the number of variants has a significant influence on the prediction performance, and the majority of variants are located on chromosomes 1–11. Motivated by previous works (Vilhjálmsson et al., 2015; Márquez-Luna et al., 2017), we further extrapolated the performance at a large sample size by conducting simulations with different subsets of chromosomes to increase
3 Results
3.1 Simulations
We performed simulations with real genotypes and simulated continuous phenotypes. We split the data from the Hong Kong population into two groups: 1,000 samples as a training data set and 3,049 samples as testing data and drew 50,000 samples from European samples. The training data set was used to simulate phenotypes, and the testing data were applied to performance assessments. The prediction accuracy was assessed by R2, which was based on the simulated phenotypes generated from the test data. Specifically, LD blocks for the single-discovery method were ancestry-matched as the reference panels, and they were in correspondence with the ancestry of the target population for multi-discovery methods.
In Figure 1, we displayed the average values with a 95% upper bound of each simulation setting under scenario 1) over 20 replicates. We conducted single-discovery analyses for Asian and European populations by Lassosum and multi-discovery analyses by meta-analysis, TL-Multi, and PT-Multi. Lassosum adopted the PRS with the maximum R2 by 10-fold cross-validation. We observed that meta-analysis could not improve the prediction accuracy when a single-discovery analysis of the European population did not perform better than the Asian one. Particularly, when the genetic architecture correlation was quite low (ρ = 0.2), meta-analysis and European prediction performances were comparably inferior. In this case, it was explained that the shared information between the Asian and European populations would be limited, preventing prediction improvement from being achieved by directly integrating the European data. It also reflected the consistent relationship between meta-analysis and single-discovery analysis of the informative population. Moreover, the meta-analysis could hardly outperform the European one. The performance of Lassosum for the European population dominated the performance of the meta-analysis since the sample size of the European population is significantly larger than that of the Hong Kong population. Additionally, we observed that TL-Multi could always improve the accuracy compared to Lassosum for the Hong Kong population. If the genetic architecture correlation was not too high (for example, ρ = 0.4 or 0.6), TL-Multi attained the highest prediction accuracy compared to competing approaches. However, when the genetic architecture was high (ρ = 0.8), we noticed that TL-Multi performed slightly worse than other approaches. In this example, the results might be explained by the remarkable similarity of the genetic architecture. When the genetic correlation reaches 0.8, the majority of information about the Asian population would be directly explained by that of the European population. Combining these two groups in the meta-analysis might increase the accuracy of estimated effect sizes.
FIGURE 1. Prediction accuracy of Lassosum, meta-analysis, PT-Multi, and TL-Multi over 20 replications in simulations. Lassosum_HK is Lassosum for the Hong Kong population, and Lassosum_Eur is Lassosum for the European population. Heritability was fixed at 0.5 and different genetic correlations (0.2, 0.4, 0.6, and 0.8) with different causal variant proportions (1%, 1.5%, 2%, and 5%) were generated. A total of 50,000 European samples and 1,000 Hong Kong samples were simulated. The variants were generated from the common variants of the first four chromosomes (21,477 SNPs). The prediction accuracy was measured by R2 between the simulated and true phenotypes. The error bar indicated the upper bound of a 95% confidence interval over 20 replications.
In most scenarios, TL-Multi outperformed PT-Multi. Specifically, TL-Multi substantially improved multiethnic prediction accuracy for the instances with 1, 1.5, and 2% causal proportions. PT-Multi conducted PT, which caused information loss in the data. However, TL-Multi could take all the data information into account. We found TL-Multi performed poorly at a 5% causal proportion. We noted that under this situation, the result of Lassosum for the Hong Kong population was significantly inferior to that of the European population. We referred to assumption (2) to cast doubt on the breach of our assumption. If the assumption did not hold, the European population could not be denoted as auxiliary informative data because the useful information was limited. Due to it, TL-Multi would fail to borrow the information to improve the learning performance of the target population. Alternatively, we considered that the effect sizes were simulated depending on the number of causal variations, m. As the proportion of causality rose, the effect sizes tended to approach zero. Limited by the small sample size of the Asian population, the bias between the estimated effect sizes derived from the simulated phenotypes and the actual effect sizes would be even larger. Some causal variants with relatively small signals more likely erroneously failed to be captured, which resulted in the restricted TL-Multi’s performance. However, it is noteworthy that TL-Multi still enhanced Hong Kong’s prediction accuracy in this scenario. We discovered that the performance of meta-analysis and PT-Multi for Hong Kong were nearly identical to that of Lassosum for Europeans when we attributed the huge disparity in multiethnic sample sizes. To summarize, the European population dominated the performance of meta-analysis and PT-Multi. In particular, TL-Multi could be employed to moderate genetic architecture correlations (for example, ρ = 0.4 and 0.6) when the informative auxiliary population (for example, the European population) outperformed the target population (for example, the Hong Kong population). Referring to assumption (2), the performance of the European population was supposed to be more accurate than that of the target population; therefore, it would be appropriate to borrow information from it. Moreover, if the proportion of casuals increased, the estimated effect sizes of the target population would be relatively biased. We found that the precision of the effect sizes of the target population would have a substantial effect on TL-Multi.
Alternatively, we generated phenotypes using different chromosome subsets and sample sizes of the European population while maintaining a fixed Hong Kong sample size. Over 20 replicates, we took the performance using a fixed genetic correlation of 0.4 and 1.5% causal variants as an example. In Figure 2, we drew 25,000 European subjects and 1,000 Hong Kong subjects. We observed that TL-Multi performed much better than the competing approaches. While the performance of the Hong Kong population was superior to that of the European population, the performance of the meta-analysis was poor compared to that of Hong Kong. As the total number of SNPs increased, the prediction accuracy of Hong Kong dramatically decreased. However, the prediction accuracy of Europeans decreased relatively slowly. Specifically, under scenario (4), TL-Multi was inferior to the other two multi-discovery methods. This could be explained by fact that for this case, there were 1,000 subjects from Hong Kong with 49,909 SNPs which resulted in a significant bias while estimating the effect sizes by applying GWAS. In this case, TL-Multi thus failed to improve the accuracy of the forecast compared to the previous scenarios, as the bias in the estimates of Hong Kong’s effect sizes was larger. Moreover, the consistent trend in European meta-analysis and PT-Multi supported our previous extrapolation that the performance of the European population could determine the primary contribution of the other two. In Figure 2, we simulated 50,000 European subjects. We further observed that the performance of PT-Multi was inferior to TL-Multi under scenarios (1)–(3), and both of them outperformed the single-discovery method and meta-analysis. Furthermore, the performance of the meta-analysis was consistent with that of European. As a result, even though the prediction accuracy of TL-Multi went down, it was still better than the meta-analysis’s prediction accuracy under all the scenarios.
FIGURE 2. Prediction accuracy of Lassosum, meta-analysis, PT-Multi, and TL-Multi over 20 replications in simulations. The selected ratio of SNPs was the ratio of the actual numbers of SNPs simulated to the total number of common SNPs (69,398). The actual numbers of SNPs simulated in the four scenarios were 21,477 (chromosomes 1–4), 32,151 (chromosomes 1–6), 39,682 (chromosomes 1–8), and 49,909 (chromosomes 1–11), respectively. The average of R2 was plotted. Upper: The sample size of the European population is 25,000, and the sample size of the Hong Kong population is 1,000. Lower: The sample size of the European population is 50,000, and the sample size of the Hong Kong population is 1,000.
3.2 Analysis of SLE in the Hong Kong population
We further applied the four abovementioned approaches to predict SLE risk in the Hong Kong population to evaluate the performance in real data analysis. We used European SLE GWAS summary statistics from previous studies (Morris et al., 2016; Julià et al., 2018; Wang et al., 2021) (4,576 cases, and 8,039 controls) and the ancestry-matched GWAS summary statistics (Wang et al., 2021) (2,618 cases and 5,107 controls). The validation data for the Hong Kong population were from Wang et al. (2021) (1,604 cases and 3,324 controls) employing 10-fold cross-validation following Mak et al. (2017).
We reported the area under the receiver operating characteristic curve (AUC) to assess the prediction accuracy of derived PRS. The ethnicity of the LD block is consistent with that of the majority population in GWAS data, and the LD block was derived from Berisa and Pickrell (2016). Furthermore, the reference panel was obtained from the 1000 Genome Project, and its ethnicity was consistent with the target populations. We set the p-value thresholds to be the same as the values in simulation studies, and r2 = 0.1. In real data analysis, TL-Multi outperformed the competing methods. The optimal PRS from European GWAS data yielded AUC of 0.6872 and 0.6943 from East Asian GWAS data. We further obtained the optimal PRS of meta-analysis, TL-Multi, and PT-Multi, with AUC values of 0.7098, 0.7131, and 0.5447, and the corresponding ROC curves are depicted in Figure 3. For binary classification, we used logistic regression to obtain the mixing weights in PT-Multi. Consistent with the evaluations in simulation studies, we observed that TL-Multi improved 2.7% in prediction accuracy compared to Lassosum for the Hong Kong population, and meta-analysis improved 2.2% compared to Lassosum. However, PT-Multi performed even worse than the single-discovery method in real data analysis.
FIGURE 3. Receiver operating characteristic curve of Lassosum, meta-analysis, and TL-Multi in the analysis of SLE study. Lassosum_HK is Lassosum for the Hong Kong population, and Lassosum_Eur is Lassosum for the European population. The corresponding AUC values with the optimal PRS of Lassosum for the Hong Kong population and European population, meta-analysis, and TL-Multi are 0.6872, 0.6943, 0.7098, and 0.7131, respectively.
Moreover, we reported the case prevalence of the bottom 2, 5, and 10% and top 2, 5, and 10% of PRS distribution, constructed by single-discovery method, meta-analysis, and TL-Multi in Table 1. This summary report demonstrated the case prevalence under different PRS conditions. For instance, the bottom numbers indicate the prevalence of SLE among individuals with low PRS. We observed that TL-Multi had satisfactory performance and showed 10.66-, 7.50-, and 5.80-fold increases comparing the top 2, 5, and 10% with the bottom 2, 5, and 10% of the PRS distribution, respectively.
TABLE 1. Case prevalence of 2, 5, and 10% for the top and bottom quantiles of the PRS distribution in the analysis of SLE study with the target Hong Kong population and auxiliary European population, generated by Lassosum, meta-analysis, and TL-Multi.
3.3 Analysis of asthma in the Indian population
We applied the same methods to the Indian and European samples from UK computing associated summary statistics by the ‘bigsnpr’ R package. We split 1,000 (127 cases and 873 controls) unrelated Indian samples as validation data, and 3,160 (408 cases and 2,752 controls) unrelated samples as training data, and further sampled 48,362 (6,555 cases and 41,807 controls) unrelated European samples. We further reported the AUC of the above four methods to evaluate the optimal prediction method. The ancestry of LD blocks matches that of the data’s predominant population. We used training data as a reference panel whose ancestry was always identical to that of the target population. During pruning and clumping, the p-value thresholds were set to be equal to simulation with r2 = 0.1.
The ROC curves for binary classification are depicted in Figure 4. The optimal PRS from European and Indian samples revealed AUC values of 0.5657 and 0.5441, respectively. In addition, for the multiethnic PRS construction methods, the optimal PRS of meta-analysis, TL-Multi, and PT-Multi resulted in AUC values of 0.5705, 0.5721, and 0.6427, respectively. We found that TL-Multi was superior to the all singe-discovery methods and meta-analysis. For binary classification, TL-Multi improved 5.15% in prediction accuracy compared to Lassosum for the Indian population, and meta-analysis improved 4.85% compared to Lassosum for the Indian population. We noted that PT-Multi performed better than ours. However, the comparison of the PT-Multi method with other methods might not be fair since PT-Multi required individual-level data, whereas the other four approaches solely relied on summary statistics. Moreover, access to individual-level data was typically difficult.
FIGURE 4. Receiver operating characteristic curve of Lassosum, meta-analysis, and TL-Multi in the analysis asthma study. Lassosum_Ind is Lassosum for the Indian population, and Lassosum_Eur is Lassosum for the European population. The corresponding AUC values with the optimal PRS of Lassosum for the Indian population and European population, meta-analysis, and TL-Multi are 0.5657, 0.5441, 0.5705, and 0.5721, respectively.
Additionally, the case prevalence of the bottom 2, 5, and 10% and top 2, 5, and 10% of PRS distribution, conducted by Lassosum, meta-analysis, and TL-Multi for Indian and European populations, is reported in Table 2. We observed that TL-Multi would also perform with more accuracy in terms of case prevalence than the competing methods.
TABLE 2. Case prevalence of 2, 5, and 10% for the top and bottom quantiles of the PRS distribution in the analysis of asthma study with the target Indian population and auxiliary European population, generated by Lassosum, meta-analysis, and TL-Multi.
4 Discussion
In this article, we proposed a novel approach named TL-Multi to improve the accuracy of PRS prediction for non-European populations. Our proposed method leverages summary statistics and makes complete use of all available data without clumping. We have shown that transferring the information from the informative auxiliary populations (for example, European) to the target populations (for example, East Asian) can indeed improve learning performance and the prediction accuracy of the target populations compared to the single-discovery methods. Particularly, TL-Multi shows a higher AUC compared to meta-analysis and PT-Multi in the analysis of SLE in the Hong Kong population. In our analysis of asthma in the Indian population, TL-Multi outperforms Lassosum and meta-analysis in terms of prediction performance and case prevalence prediction accuracy. Moreover, we noted that in the field of PRS prediction, there is no particular method that outperforms all the others. It depends on the specific situation to select an appropriate method. For instance, PRS-CS can always outperform PT (Huang et al., 2021), but PRS-CS may be inferior to PT (Weissbrod et al., 2022) in some circumstances. Therefore, we provided some potential circumstances in which TL-Multi would be an appropriate choice. First, TL-Multi is implemented using summary statistics and performs well under the moderate genetic architecture correlation (for example, ρ = 0.4 and 0.6) and moderate causal proportions (for example,
Compared to the single-discovery methods, we showed that the performance of TL-Multi was always more accurate with an acceptable running time (for example, 2 min) than the performance of Lassosum for the Hong Kong population, especially under moderate genetic correlation (for example, ρ = 0.6). When the sample size of the target data set is limited, increasing the sample size of the informative data set can enhance the prediction accuracy of TL-Multi. In the simulation studies, we found that the performances of meta-analysis and PT-Multi were dominated by the performance of Lassosum for the European population. As the genetic architecture correlation was rather high (ρ = 0.8), TL-Multi may perform poorly, and it would be more prudent to consider approaches that integrate the whole data set across distinct populations. Therefore, the performances of PT-Multi and meta-analysis were unsatisfactory, while the performance of the European population was worse than that of the Hong Kong population.
Another advantage of TL-Multi is its powerful transferability, which corrects the bias in estimation between European and non-European populations. De Candia et al. (2013) showed that the cross-population genetic correlation could leverage the causal effect sizes in different populations. In simulation studies, TL-Multi performed better when the genetic correlations were 0.4 and 0.6. It indicated that TL-Multi could be widely applied to two different populations which share some common genetic architecture information. Moreover, TL-Multi retained the pseudo validation proposed by Mak et al. (2017). It extended the application of TL-Multi to fit the data without a validation data set and phenotype data.
Despite these advantages, some limitations of TL-Multi still remain for future work. For example, if the difference between two populations is too enormous, our proposed approach’s assumptions will fail to hold. It is worth bearing in mind to deal with this scenario. In this article, we did not consider the X chromosome, whose information could also contribute to prediction accuracy (Tukiainen et al., 2014). In recent years, some approaches have fitted multiple diseases simultaneously (Maier et al., 2015; Turley et al., 2018; Chung et al., 2019; Musliner et al., 2019; Graff et al., 2021). These studies inspire us to investigate other TL-Multi extensions that bridge not only the gap between populations but also the gap between illnesses in the interim.
Data availability statement
The 1000 Genome project data were downloaded from https://www.internationalgenome.org/. The UK Biobank data were accessed under Application Number 58942. Request to access the dataset of SLE summary association statistics should be directed to the corresponding author YDZ, doraz@hku.hk.
Author contributions
YDZ conceived and supervised the study. PT and THC processed the data, implemented the software and conducted the analysis. YW and WY provided the lupus data and offered insights of the interpretation of the results. PT and YDZ wrote the first manuscript. All authors contributed to the revision of the manuscript.
Funding
This work was supported, in part, by the Hong Kong Research Grants Council (RGC) Early Career Scheme 2021/22 (project number 27305221).
Acknowledgments
The data analysis was conducted using data from the UK Biobank Resource accessed under Application Number 58942.
Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
Berisa, T., and Pickrell, J. K. (2016). Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics 32, 283–285. doi:10.1093/bioinformatics/btv546
Brown, B. C., Asian Genetic Epidemiology Network Type 2 Diabetes Consortium, Ye, C. J., Price, A. L., and Zaitlen, N. (2016). Transethnic genetic-correlation estimates from summary statistics. Am. J. Hum. Genet. 99, 76–88. doi:10.1016/j.ajhg.2016.05.001
Bulik-Sullivan, B. K., Loh, P.-R., Finucane, H. K., Ripke, S., Yang, J., Patterson, N., et al. (2015). Ld score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295. doi:10.1038/ng.3211
Chang, C. C., Chow, C. C., Tellier, L. C., Vattikuti, S., Purcell, S. M., Lee, J. J., et al. (2015). Second-generation plink: Rising to the challenge of larger and richer datasets. Gigascience 4, 7. doi:10.1186/s13742-015-0047-8
Chatterjee, N., Shi, J., and García-Closas, M. (2016). Developing and evaluating polygenic risk prediction models for stratified disease prevention. Nat. Rev. Genet. 17, 392–406. doi:10.1038/nrg.2016.27
Chung, W., Chen, J., Turman, C., Lindstrom, S., Zhu, Z., Loh, P.-R., et al. (2019). Efficient cross-trait penalized regression increases prediction accuracy in large cohorts using secondary phenotypes. Nat. Commun. 10, 569. doi:10.1038/s41467-019-08535-0
Consortium, I. S. (2009). Common polygenic variation contributes to risk of schizophrenia that overlaps with bipolar disorder. Nature 460, 748.
Coram, M. A., Fang, H., Candille, S. I., Assimes, T. L., and Tang, H. (2017). Leveraging multi-ethnic evidence for risk assessment of quantitative traits in minority populations. Am. J. Hum. Genet. 101, 638. doi:10.1016/j.ajhg.2017.09.005
De Candia, T. R., Lee, S. H., Yang, J., Browning, B. L., Gejman, P. V., Levinson, D. F., et al. (2013). Additive genetic variation in schizophrenia risk is shared by populations of african and European descent. Am. J. Hum. Genet. 93, 463–470. doi:10.1016/j.ajhg.2013.07.007
De Los Campos, G., Gianola, D., and Allison, D. B. (2010). Predicting genetic predisposition in humans: The promise of whole-genome markers. Nat. Rev. Genet. 11, 880–886. doi:10.1038/nrg2898
Ge, T., Chen, C.-Y., Ni, Y., Feng, Y.-C. A., and Smoller, J. W. (2019). Polygenic prediction via bayesian regression and continuous shrinkage priors. Nat. Commun. 10, 1776. doi:10.1038/s41467-019-09718-5
Graff, R. E., Cavazos, T. B., Thai, K. K., Kachuri, L., Rashkin, S. R., Hoffman, J. D., et al. (2021). Cross-cancer evaluation of polygenic risk scores for 16 cancer types in two large cohorts. Nat. Commun. 12, 970. doi:10.1038/s41467-021-21288-z
Huang, H., Ruan, Y., Feng, Y.-C. A., Chen, C.-Y., Lam, M., Sawa, A., et al. (2021). Improving polygenic prediction in ancestrally diverse populations. Nat. Genet. 54 (5), 573–580. doi:10.1038/s41588-022-01054-7
Julià, A., López-Longo, F. J., Venegas, J. J. P., Bonàs-Guarch, S., Olivé, À., Andreu, J. L., et al. (2018). Genome-wide association study meta-analysis identifies five new loci for systemic lupus erythematosus. Arthritis Res. Ther. 20, 100. doi:10.1186/s13075-018-1604-1
Khera, A. V., Chaffin, M., Aragam, K. G., Haas, M. E., Roselli, C., Choi, S. H., et al. (2018). Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50, 1219–1224. doi:10.1038/s41588-018-0183-z
Kim, M. S., Patel, K. P., Teng, A. K., Berens, A. J., and Lachance, J. (2018). Genetic disease risks can be misestimated across global populations. Genome Biol. 19, 179. doi:10.1186/s13059-018-1561-7
Li, S., Cai, T. T., and Li, H. (2022). Transfer learning for high-dimensional linear regression: Prediction, estimation and minimax optimality. J. R. Stat. Soc. Ser. B Methodol. 84 (1), 149–173. doi:10.1111/rssb.12479
Maier, R., Moser, G., Chen, G.-B., Ripke, S., Absher, D., Agartz, I., et al. (2015). Joint analysis of psychiatric disorders increases accuracy of risk prediction for schizophrenia, bipolar disorder, and major depressive disorder. Am. J. Hum. Genet. 96, 283–294. doi:10.1016/j.ajhg.2014.12.006
Mak, T. S. H., Porsch, R. M., Choi, S. W., Zhou, X., and Sham, P. C. (2017). Polygenic scores via penalized regression on summary statistics. Genet. Epidemiol. 41, 469–480. doi:10.1002/gepi.22050
Márquez-Luna, C., Loh, P.-R., Consortium, S. A. T., Consortium, S. T., and Price, A. L. (2017). Multiethnic polygenic risk scores improve risk prediction in diverse populations. Genet. Epidemiol. 41, 811–823. doi:10.1002/gepi.22083
Morris, D. L., Sheng, Y., Zhang, Y., Wang, Y.-F., Zhu, Z., Tombleson, P., et al. (2016). Genome-wide association meta-analysis in Chinese and European individuals identifies ten new loci associated with systemic lupus erythematosus. Nat. Genet. 48, 940–946. doi:10.1038/ng.3603
Moser, G., Lee, S. H., Hayes, B. J., Goddard, M. E., Wray, N. R., Visscher, P. M., et al. (2015). Simultaneous discovery, estimation and prediction analysis of complex traits using a bayesian mixture model. PLoS Genet. 11, e1004969. doi:10.1371/journal.pgen.1004969
Musliner, K. L., Mortensen, P. B., McGrath, J. J., Suppli, N. P., Hougaard, D. M., Bybjerg-Grauholm, J., et al. (2019). Association of polygenic liabilities for major depression, bipolar disorder, and schizophrenia with risk for depression in the Danish population. JAMA psychiatry 76, 516–525. doi:10.1001/jamapsychiatry.2018.4166
Peterson, R. E., Kuchenbaecker, K., Walters, R. K., Chen, C.-Y., Popejoy, A. B., Periyasamy, S., et al. (2019). Genome-wide association studies in ancestrally diverse populations: Opportunities, methods, pitfalls, and recommendations. Cell 179, 589–603. doi:10.1016/j.cell.2019.08.051
Popejoy, A. B., and Fullerton, S. M. (2016). Genomics is failing on diversity. Nature 538, 161–164. doi:10.1038/538161a
Privé, F., Aschard, H., Ziyatdinov, A., and Blum, M. G. (2018). Efficient analysis of large-scale genome-wide data with two r packages: Bigstatsr and bigsnpr. Bioinformatics 34, 2781–2787. doi:10.1093/bioinformatics/bty185
Shi, H., Burch, K. S., Johnson, R., Freund, M. K., Kichaev, G., Mancuso, N., et al. (2020). Localizing components of shared transethnic genetic architecture of complex traits from gwas summary data. Am. J. Hum. Genet. 106, 805–817. doi:10.1016/j.ajhg.2020.04.012
Shi, H., Gazal, S., Kanai, M., Koch, E. M., Schoech, A. P., Siewert, K. M., et al. (2021). Population-specific causal disease effect sizes in functionally important regions impacted by selection. Nat. Commun. 12, 1098. doi:10.1038/s41467-021-21286-1
Shi, J., Park, J.-H., Duan, J., Berndt, S. T., Moy, W., Yu, K., et al. (2016). Winner’s curse correction and variable thresholding improve performance of polygenic risk modeling based on genome-wide association study summary-level data. PLoS Genet. 12, e1006493. doi:10.1371/journal.pgen.1006493
Speed, D., and Balding, D. J. (2014). Multiblup: Improved snp-based prediction for complex traits. Genome Res. 24, 1550–1557. doi:10.1101/gr.169375.113
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol. 58, 267–288. doi:10.1111/j.2517-6161.1996.tb02080.x
Tukiainen, T., Pirinen, M., Sarin, A.-P., Ladenvall, C., Kettunen, J., Lehtimäki, T., et al. (2014). Chromosome x-wide association study identifies loci for fasting insulin and height and evidence for incomplete dosage compensation. PLoS Genet. 10, e1004127. doi:10.1371/journal.pgen.1004127
Turley, P., Walters, R. K., Maghzian, O., Okbay, A., Lee, J. J., Fontana, M. A., et al. (2018). Multi-trait analysis of genome-wide association summary statistics using mtag. Nat. Genet. 50, 229–237. doi:10.1038/s41588-017-0009-4
Vilhjálmsson, B. J., Yang, J., Finucane, H. K., Gusev, A., Lindström, S., Ripke, S., et al. (2015). Modeling linkage disequilibrium increases accuracy of polygenic risk scores. Am. J. Hum. Genet. 97, 576–592. doi:10.1016/j.ajhg.2015.09.001
Wang, Y.-F., Zhang, Y., Lin, Z., Zhang, H., Wang, T.-Y., Cao, Y., et al. (2021). Identification of 38 novel loci for systemic lupus erythematosus and genetic heterogeneity between ancestral groups. Nat. Commun. 12, 772. doi:10.1038/s41467-021-21049-y
Keywords: GWAS, polygenic risk score, transfer learning, multiethnic populations, systemic lupus erythematosus, asthma
Citation: Tian P, Chan TH, Wang Y-F, Yang W, Yin G and Zhang YD (2022) Multiethnic polygenic risk prediction in diverse populations through transfer learning. Front. Genet. 13:906965. doi: 10.3389/fgene.2022.906965
Received: 29 March 2022; Accepted: 27 June 2022;
Published: 19 August 2022.
Edited by:
Ryan Sun, University of Texas MD Anderson Cancer Center, United StatesCopyright © 2022 Tian, Chan, Wang, Yang, Yin and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Yan Dora Zhang, doraz@hku.hk