Machine learning in computational NMR-aided structural elucidation

Cortés, Iván; Cuadrado, Cristina; Hernández Daranas, Antonio; Sarotti, Ariel M.

doi:10.3389/fntpr.2023.1122426

REVIEW article

Front. Nat. Prod., 27 January 2023

Sec. Structural and Stereochemical Analysis

Volume 2 - 2023 | https://doi.org/10.3389/fntpr.2023.1122426

This article is part of the Research TopicInsights in Structural and Stereochemical Analysis: 2022View all 5 articles

Machine learning in computational NMR-aided structural elucidation

Iván Cortés¹^†

Cristina Cuadrado²^†

Antonio Hernández Daranas²*

Ariel M. Sarotti¹*

¹Instituto de Química Rosario (CONICET), Facultad de Ciencias Bioquímicas y Farmacéuticas, Universidad Nacional de Rosario, Rosario, Argentina
²Instituto de Productos Naturales y Agrobiología, Consejo Superior de Investigaciones Científicas (IPNA-CSIC), San Cristóbal de La Laguna, Spain

Structure elucidation is a stage of paramount importance in the discovery of novel compounds because molecular structure determines their physical, chemical and biological properties. Computational prediction of spectroscopic data, mainly NMR, has become a widely used tool to help in such tasks due to its increasing easiness and reliability. However, despite the continuous increment in CPU calculation power, classical quantum mechanics simulations still require a lot of effort. Accordingly, simulations of large or conformationally complex molecules are impractical. In this context, a growing number of research groups have explored the capabilities of machine learning (ML) algorithms in computational NMR prediction. In parallel, important advances have been made in the development of machine learning-inspired methods to correlate the experimental and calculated NMR data to facilitate the structural elucidation process. Here, we have selected some essential papers to review this research area and propose conclusions and future perspectives for the field.

1 Introduction

Determination of molecular structure is one of the most complex and important stages in the discovery of natural products. Traditionally, this has been done using various spectroscopic techniques, NMR being the most important and decisive one (Gil, 2011). However, even with the advent of increasingly powerful equipment and new multidimensional experiments (Liu et al., 2019), structural misassignment is far from being eradicated, and unfortunately persists in current literature (Nicolaou and Snyder, 2005; Chhetri et al., 2018). In this context, computational chemistry has been synergistically coupled with experimental NMR, giving rise to a huge variety of hybrid methods that greatly facilitate elucidation (Napolitano et al., 2011; Gutiérrez-Cepeda et al., 2014). From earlier contributions of Bifulco, Bagno, and Saielli (Barone et al., 2002a; Barone et al., 2002b; Bagno et al., 2006), to the explosion in the post-DP4 era (Smith and Goodman, 2010; Grimblat et al., 2015; Ermanis et al., 2017; Grimblat et al., 2019), the process has undergone continuous improvements (Lodewyk et al., 2012a; Grimblat and Sarotti, 2016; Lauro and Bifulco, 2020; Marcarino et al., 2020; Costa et al., 2021; Marcarino et al., 2022). Among them, perhaps one of the most disruptive has been the implementation of machine learning (ML), which has undoubtedly revolutionized molecular simulation (Noé et al., 2020). Application of machine learning to computational NMR in the context of structural elucidation can be roughly divided into 2 main categories, namely, prediction and correlation (Figure 1). In the first, ML is used to obtain quantum-quality NMR chemical shifts at a remarkably lower computational cost. In the second category, ML facilitates the correlation between experimental and calculated data in order to determine the most likely structures. In this minireview, the latest developments in both approaches are discussed, focusing on methods that combine ML with quantum NMR calculations. For other applications of ML to NMR, including molecular phenotyping and clustering (Peng et al., 2020a; Peng, 2021; dos Santos et al., 2022; Peng et al., 2020b), among others, we refer the interested reader to other recent reviews on the subject (Cobas, 2019; Jonas et al., 2022). We begin here with a description of the most important ML procedures for NMR prediction (Section 2) and then discuss exciting examples of ML in data correlation (Section 3). A final conclusion and future perspectives are also provided (Section 4).

FIGURE 1

FIGURE 1. Schematic representation of the use of ML in computational NMR-aided structural elucidation.

2 NMR prediction

There are many reasons to calculate NMR chemical shifts with high accuracy. For example, NMR simulations can be helpful during spectroscopic assignment; that is, to determine which NMR signal belongs to which nucleus in the molecule. They can also provide insightful information in mechanistic and biogenetic studies (Cen-Pacheco et al., 2021; Li et al., 2021; Simonetti et al., 2021), conformational analysis (Domínguez et al., 2014a; Nguyen et al., 2018; Li et al., 2020; Sosa-Rueda et al., 2021), structural revisions (Lodewyk et al., 2012b; Cen-Pacheco et al., 2012; Sarotti, 2020), and structural elucidation (perhaps one of the most widely used applications today) (Napolitano et al., 2009; Cen-Pacheco et al., 2013; Domínguez et al., 2014b; Cen-Pacheco et al., 2014; Marcarino et al., 2020; Wang et al., 2020; Domínguez et al., 2021; Zanardi and Sarotti, 2021; Marcarino et al., 2022). Empirical approaches present the fastest alternatives, such as additive methods based on the cumulative effect of substituents (Fürst and Pretsch, 1990). However, the quality of those predictions can be insufficient for some applications. More refined methods have been developed to enhance predictive performance. For instance, hierarchically ordered spherical description of environments (HOSE) encodes the neighborhood of each atom from a 2D representation of the molecule (Bremser, 1978; Jonas et al., 2022), achieving good predictive accuracies (mean absolute errors, MAE, 1.7 ppm for ¹³C and 0.2 ppm for ¹H) (Smurnyy et al., 2008). Using graph neural networks (GNN), Jonas and Kuhn developed ML based on 2D molecular connectivity with good results (1.43 ppm ¹³C and 0.28 ppm ¹H) (Jonas and Kuhn, 2019). In spite of these excellent results, stereochemical and conformational effects are often ignored by most empirical approaches, being limited to accounting for the impact of geometrical factors on chemical shifts. At this point, it is important to highlight that highly accurate predictions are required to differentiate between similar structures, such as that provided by NMR at density functional theory (DFT) level. The main drawback of such approaches is their high computational cost, in terms of resources and time. The latter rapidly becomes longer with a greater number of atoms, so description of large systems often becomes prohibitive. One of the best ways to reduce computational cost while maintaining predictive capacity is to employ ML-based schemes. This will be discussed with the following methods as typical examples.

2.1 ShiftML

Paruzzo et al. (2018) Solid-state NMR is a powerful tool for analyzing powdered and amorphous solids at the atomic level, with great utility in pharmaceutical sciences. The discipline has been revolutionized by the advent of quantum methods to calculate chemical shifts with high accuracy. Chemical shift-based NMR crystallography has therefore become a popular strategy to identify polymorphs, and also for de novo determination of crystal structures from powders (Facelli and Grant, 1993). This has been enabled by plane wave DFT methods developed for periodic systems based on gauge, including projected augmented wave (GIPAW), which provides good accuracy to emulate the local atomic environments (Blöchl, 1994). However, computational cost is again prohibitive for large systems and/or for higher levels of theory. In this seminal work, Paruzzo and co-workers developed a local environment-ML based method to predict chemical shifts of molecular solids with an accuracy comparable to DFT (Paruzzo et al., 2018). Due to limitations in experimental databases for NMR of solids, the authors decided to train and validate their ML using GIPAW DFT-calculated chemical shifts of a wide variety of structures taken from the Cambridge Structural Database (CSD) (Groom et al., 2016). Bypassing the experimental information has several advantages, which include avoiding biased or incorrectly reported data, as well as offering an unlimited number of virtual structures. In fact, this interesting practical idea has been replicated by other subsequent studies, as detailed below. From an initial database of 61,000 structures, the authors selected 2,000 diverse molecules for training, and 500 for validation. The first subset was selected using the farthest point sampling algorithm (FPS), and the second one randomly. The NMR properties of the resulting structures were calculated at the GIPAW DFT level, the local environments being based on the smooth overlap of atomic positions (SOAP) kernel (Bartók et al., 2013; De et al., 2016). This approach is based on encoding the atomic environment by a 3D neighborhood density defined by a superimposition of Gaussians, one centered at each atom located within a spherical boundary from the core atom. The ML was trained using the Gaussian process regression (GPR) framework, which previously performed well when coupled with SOAP (Figure 2). (Bartók et al., 2022) Once trained, ShiftML was highly accurate, particularly for ¹H (MAE 0.49 ppm). The other nuclei showed larger errors relative to DFT (4.3 ppm for ¹³C, 13.3 ppm for ¹⁵N, and 17.7 ppm for ¹⁷O). According to the authors, this was due to the significantly fewer training environments for heteronuclei than for ¹H. However, the reduction in computational cost is remarkable (less than 1 min vs. 62–150 CPU hours), demonstrating its amazing predictive ability in short periods of time.

FIGURE 2

FIGURE 2. Scheme of the ML model used in ShiftML. Reproduced from Ref (Paruzzo et al., 2018).

2.2 IMPRESSION

Gerrard et al. (2020) A few years after ShiftML, Craig Butts and co-workers developed their first generation solution-state NMR prediction machines entitled IMPRESSION (Intelligent Machine PREdiction of Shift and Scalar Information Of Nuclei). Inspired by ShiftML, the training and validation was done using DFT-predicted values rather than scarce and potentially misassigned experimental data. The key step of selecting the example molecules followed an interesting adaptive sampling procedure starting from a superset of 75,382 structures taken from CSD, with the boundary condition of only comprising C, H, N, O, and F atoms. The active learning sampling took 100 randomly selected structures, and then the trained ML predicted the NMR shifts of the remaining structures in the superset. The 100 with the highest variance (after a 5-fold cross validation) were added to the training set, and the procedure was iterated, leading to 882 final structures. Each structure was submitted to geometry optimizations at the mPW1PW91/6-311G** level, for further NMR calculations at the wB97XD/6-311G** level. The training procedure applied KRR (Kernel Ridge Regression) (Vu et al., 2015), with three different methods to encode similarity between atomic environments: CM (Coulomb Matrixes) (Rupp et al., 2015), aSLATM, and FCHL (Faber et al., 2018). As expected, better results were obtained with aSLATM and FCHL (which involve 3-body interactions) than CM (2-body interactions). The FCHL method was selected for its minimal computational cost. After optimization, IMPRESSION achieved MAE of 0.23 ppm/2.45 ppm/0.87 Hz for ¹H/¹³C/¹J_CH predictions and a root mean squared error (RMSE) of 0.35 ppm/3.88 ppm/1.39 Hz against the validation set. These results were considerably better than with ShiftML (Paruzzo et al., 2018), confirming that selection of environments and training models are fundamental elements in the ML process. Nevertheless, the authors detected some chemical environments in the test set (around 2.5% of the total) that were not successfully emulated by IMPRESSION, with errors up to 11 ppm (¹H), 63 ppm (¹³C) and 25 Hz (J). To overcome this problem, an “prediction variance filter” was applied to improve the quality of the results by removing poorly described environments that show high variance across a 5-fold cross validation. With this modification, IMPRESSION achieved an improvement in accuracy relative to DFT comparable with that of DFT relative to experiment. However, note that this version of IMPRESSION only accelerated NMR prediction, it still required DFT-optimized structures that demand from hours to days, depending on the system. The authors recognized that this could be improved by using 3D structures derived directly from the molecular mechanics, with a resultant time saving. When IMPRESSION was re-trained under this modification, the average errors increased ∼30%–50% for ¹H and J data, whereas ¹³C data remained almost insensitive. The method was successfully tested in the prediction of experimental NMR data, and in structural discrimination.

2.3 CASCADE

Guan et al. (2021) Paton’s group developed a ML model to tackle the usual difficulties in predicting ¹H and ¹³C chemical shifts, namely: computation time demand and reaching the required accuracy to select the correct structure from among several candidates. For this purpose, a huge amount of experimental data is necessary, but are not always easily accessible, complete, well assigned or parseable. Therefore, to surpass these problems the authors decided to use their own dataset obtained from DFT calculations to train a neural network (NN). However, such an approach is restricted by the DFT methodology used (basis set and solvation model, among others). Therefore, to solve these limitations, they used a Transfer Learning (TL) approach (Taylor and Stone, 2009). In that direction, they developed three GNN models (St. John et al., 2019), namely: DFTNN, ExpNN-dft and ExpNN-ff. Software architectures and hyper-parameters are identical for all of these, but they were developed using different input structures (Figures 3A,B). DFTNN used a vast array of structurally diverse organic molecules from the DFT8K dataset. Their NMR chemical shifts were calculated at the mPW1PW91/6-311+G(d,p) level of theory, implementing optimized geometries at the M06-2X/def2-TZVP level, which were subsequently used to train the GNN (¹H and ¹³C separately). However, a weak point of this first approach was that the neural network was only trained against DFT-calculated data. To face this problem, the authors used TL to retrain the DFTNN model against experimental NMR data from the Exp5K dataset, creating a new model named ExpNN-dft. However, this model also needed structure optimization that caused a performance bottleneck. The authors’ final answer was the ExpNN-ff model, where ExpNN-dft was retrained replacing the starting geometries with those directly obtained from molecular mechanics conformational searches using the MMFF94 force field. This replacement drastically reduced CPU time. The ExpNN-ff model was tested with good results in three different applications: i) structure elucidation by comparison between predicted and experimental NMR data, ii) NMR data reassignment and iii) forecast of regioselectivity of electrophilic aromatic substitution sites using simulated NMR data as descriptors. Moreover, the model differentiated between stereoisomers and even showed distinct predictions for different conformations of the same molecule. Differences between GNN predicted NMR chemical shifts and those obtained from DFT calculations resulted in mean average errors (MAE) of 1.26 ppm for ¹³C and 0.16 ppm for ¹H. Importantly, ExpNN-ff showed a comparable accuracy to normal DFT calculations but with a 6000-fold reduction in CPU time. Therefore, this model can perform NMR data predictions for large flexible structures that are unfeasible for DFT calculations (Daranas and Sarotti, 2021). According to the authors, there is still room for improvement in its results. Their main concern is regarding dependency of the outcome on the quality of the input candidate 3D structures obtained in the molecular mechanics conformational search step. Recent work on this issue confirms the importance of this stage (Cuadrado et al., 2022). Thus, the authors suggest the use of semi-empirical structures as an alternative. It is worth noting that they make this analysis tool easily available in a webpage (http://nova.chem.colostate.edu/cascade/) to perform chemical shift predictions.

FIGURE 3

FIGURE 3. Scheme of the ML model used in CASCADE. Reproduced from Ref (Guan et al., 2021). With permission from the Royal Society of Chemistry.

2.4 ML-J-DP4

(Tsai et al., 2022) In 2019, Hernández Daranas, Sarotti, and co-workers reported J-DP4 (Grimblat et al., 2019), an updated version of DP4 (10) that incorporates J values into the method´s architecture in 2 different ways. First, the J values are used to restrict conformational sampling, keeping only those structures with dihedral angles in agreement with the experimental data. This not only reduces computational cost considerably, but also improves the conformational landscape description by neglecting spurious conformations that otherwise might make high Boltzmann contributions. Next, the remaining shapes are submitted to chemical shifts and J calculations at DFT level. The J calculations include the Fermi contact term (FC), being correlated with the experimental values by using an additional Bayesian component to account for the probability term, given by ³J_HH. Despite the excellent results obtained, J-DP4 was computationally costly. We accelerated it with a new workflow in 2022, involving fast Karplus-type J calculations (Navarro-Vázquez et al., 2018). These were coupled with NMR chemical shift predictions at the cheapest HF/STO-3G level, enhanced by machine learning (ML). The decision to use a hybrid representation of the molecular environments was inspired in the work of Beran and co-workers (Unzueta et al., 2021). This representation included the isotropic shielding constants computed at the very fast HF/STO-3G//MMFF level coupled with local descriptors. The research demonstrated that a Δ-ML approach can be highly accurate. In Δ-ML, the chemical shifts calculated at a lower level (PBE0/6-31G//ωB97XD) can be improved to PBE0/6-311+G (2d,p)//ωB97XD (high level) through artificial neural networks using the AEV (atomic environment vector) to encode the geometric data of the atoms. Based on this background, we hypothesized that the negligible additional cost involved in running NMR calculations at a fast quantum level would be justified by the quality of the NMR predictions, suitable for stereochemical discrimination. The workflow (Figure 4) involved selecting 27,000 diverse structures by computing the Morgan fingerprint (Rogers and Hahn, 2010) of the 170,000 original structures taken from COCONUT (Sorokina et al., 2021) and then using the MinMax algorithm to pick the most diverse of them. The data were randomly divided into 17,000 molecules for training (232,560 ¹³C and 280,446 ¹H values, T17k set) and 10,000 molecules for validation (150,760 ¹³C and 183,612 ¹H values, V10k set). In this hybrid approach, the GIAO NMR shielding constants computed at the HF/STO-3G//MMFF level were complemented with different geometric and electronic descriptors that capture the local environments, including charges, hybridizations, distances, angles, long-range interactions (Coulomb and tensorial matrices, etc). As in IMPRESSION (Gerrard et al., 2020), KRR was the data correlation strategy that afforded the best results. However, the main difference was that the adaptive training used the individual environments (rather than individual molecules) that maximized the predictive capacity of the ML. This was supported by the fact that NMR properties are local in nature, so it is not considered mandatory to use all environments from a test molecule, but rather the most important ones. After selecting the most influential descriptors, adaptive learning was conducted to select the best set of environments based on a 25-step iterative incorporation of the 1,000 worst-predicted environments provided by the previous training set. The hyperparameters of the resulting machines (composed of 25 K selected ¹³C and ¹H environments) were further optimized using a 5-fold approach, and the optimal machines were tested against the independent test set (V10k, 150,760 ¹³C and 183,612 ¹H values). The predictions were highly satisfactory, with MAE of 1.21/0.14 ppm, RMS of 1.63/0.19 ppm, and MaxErr of 20.74/1.89 ppm for ¹³C and ¹H data, respectively. These results were excellent compared to those obtained with other recent ML approaches discussed above. It is true that ML requires quantum computation of isotropic shielding values, but we consider that the quality of the results justifies that extremely low additional cost. The entire process was automated in the form of a Python script and released under an open-source MIT license available at https://github.com/Sarotti-Lab/ML_J_DP4.

FIGURE 4

FIGURE 4. Scheme of the ML model used in ML-J-DP4. Reprinted with permission from (Tsai et al., 2022). Copyright (2022) American Chemical Society.

2.5 DU8ML

Novitskiy and Kutateladze (2022a) In this paper, Novitskiy and Kutateladze started from their previously developed DU8+ hybrid DFT-parametric method (Kutateladze and Reddy, 2017). DU8+ incorporates binomial correction functions to improve the calculation of NMR parameters of carbons attached to heavy atoms. According to the authors, their approach was the seed of what was later called ML-augmented DFT (Gao et al., 2020). Thus, adding ML methods they developed a DU8+ augmented method called DU8ML, which calculates NMR chemical shifts and spin-spin coupling constants (SSCC) of large natural products with high accuracy, in short computational times. The RMSD deviations calculated from correct structures were 0.95 ppm for ¹³C (11.000 values were used as training set) and 0.28 Hz for SSCC (from 4,000 experimental values). To enhance accuracy, molecular fragments from these datasets showing the highest deviations were selected as the ML training set (using experimental chemical shifts and SSCC) to identify and correct any inconsistencies. Specifically, a first step of optimization and calculations was nuclear magnetic shieldings at ωB97XD/6-31G (d) PCM and Fermi contact at B3LYP/DU8 under Gaussian computations. This step was followed by the necessary ML-derived corrections for both NMR parameters. The authors present several examples to demonstrate the applicability of DU8ML. In most cases, ¹³C chemical shift RMSD values were chosen to detect misassignments. In this field, the selected examples illustrate problems related to bad atom connectivities, the type of substituents selected or those associated with flipped fused rings. However, not only were incorrect 2D assignments confronted, but also stereochemical ones. The later are much more challenging and continue to be the most usual source of errors in structural elucidation. Thus, inversion of stereoconfigurations (including an N-oxide), fused rings, and the tricky epoxide rings were tackled. Moreover, the authors also show the usefulness of the method in detecting wrong assignments due to molecule protonation by NMR solvents, SSCC and disagreements between NMR and X-ray or mass spectroscopy data. They also introduce a novel application of DU8ML that amends a previously proposed reaction mechanism, by correcting the assignment of structures involved in the process (Novitskiy and Kutateladze, 2022b). Some aspects that could improve the workflow—discussed in the paper—were the design of a fully automated program for every kind of molecule and the addition of a probability calculation for each candidate structure.

3 Data correlation

The accuracy of quantum NMR calculations using affordable levels of theory can be more than enough to differentiate very different structures, as in the case of constitutional isomers. However, for stereoisomers the situation becomes more challenging because of their spectroscopic resemblance. For that reason, in addition to the advances made in improving and accelerating NMR predictions, the development of robust data correlation methods is essential (Grimblat and Sarotti, 2016). To date, a wide variety of methods have been reported (Grimblat and Sarotti, 2016; Lauro and Bifulco, 2020; Marcarino et al., 2020; Costa et al., 2021; Marcarino et al., 2022). In this section, state-of-the-art ML-based methods will be discussed.

3.1 ANN-PRA

Sarotti (2013), Zanardi and Sarotti, 2015) These methods were developed by Sarotti´s group to tackle the structural validation problem. That is, to decide the correctness of structural proposals based on the correlation of the experimental NMR data collected for a given molecule and the theoretical chemical shifts calculated for it. By that time, the leading strategy in DFT-based structural elucidation was based on a direct comparison between potential candidates (for example, CP3 or DP4) (Smith and Goodman, 2009; Smith and Goodman, 2010). Regardless of the performance of each method, the underlying hypothesis assumes that the correct structure is included within the candidate set. The approach followed to determine a potential structural misassignment using one set of experimental and calculated data was based on the use of pattern recognition analysis (PRA), with the aid of artificial neural networks (ANNs). The latter are mathematical models in which interconnected artificial neurons emulate the function of a biological brain able to learn from the data. The proof-of-concept was based on monodimensional ¹³C NMR data correlation with the aid of two-layer feed-forward ANNs, using a test set of 200 structures. Different descriptors were assessed as reference standards, including R², MAE, maximum error (MaxErr), each computed using TMS and MSTD (multi-standard approach) (Sarotti and Pellegrinet, 2009; Sarotti and Pellegrinet, 2012). A large number of ANNs featuring different numbers of input and hidden layers were built and trained, then those with optimal classification ability were kept for validation using a set of 26 natural products originally misassigned, with their respective revised structures. This first generation of ANNs performed excellently in identifying connectivity mistakes (such as constitutional isomerism), though they were not conceived to tackle subtler differences like stereoisomerism (Sarotti, 2013). This motivated development of a new generation of ANNs by merging mono-dimensional ¹H and ¹³C NMR data with 2D HSQC correlations (Figures 5A,B). Hundreds of different ANNs were trained using the standard correlation parameters described above, as well as 18 new descriptors accounting for the global correlation between experimental and simulated HSQC data. The training set was composed of 200 structures (100 correct and 100 artificially-made incorrect ones). The most efficient ANNs were validated using a set of 32 originally misassigned natural products, along with their revised structures. The performance achieved was noteworthy, identifying subtle structural errors in an efficient and simple manner (Zanardi and Sarotti, 2015).

FIGURE 5

FIGURE 5. Schematic representation of the ANN-PRA method. Adapted with permission from Marcarino et al. (2020). Copyright (2020) American Chemical Society.

3.2 DP4-AI

Howarth et al. (2020) A common feature of structure elucidation of small molecules assisted by computational methods is that they all need candidate structures, as well as user-assigned ¹H and/or ¹³C NMR experimental data as input (Smith and Goodman, 2010; Grimblat et al., 2015; Lauro and Bifulco, 2020; Marcarino et al., 2020; Zanardi and Sarotti, 2021). These data are then used in different ways to find the best match between measured and computed values. Currently, the most human-time consuming stage within this workflow is NMR data assignment. DP4-AI is an attempt to solve this by means of an automatic interpretation of NMR spectra, coupled to a standard DP4 calculation (Smith and Goodman, 2010). This is a complex task that involves several stages, where the result of each affects subsequent steps (Cobas, 2019). Therefore, after Fourier transformation of NMR data, a hybrid method to phase the resulting spectra was selected. The resulting baseline distortions are corrected, and peaks are picked using first and second derivative methods. Next, the detected peaks are grouped into multiples and integrated (Chen et al., 2002; Wang et al., 2013; Zorin et al., 2017), following similar procedures for both ¹H and ¹³C spectra. However, the core of DP4-AI is the assignment algorithm (AA) that is responsible for assigning the atoms in each candidate structure, according to the previously detected experimental peaks. The system also predicts chemical shift values by means of DFT GIAO methods. Using these values, the AA calculates the assignment probability matrix M to find the most likely identity of each experimental peak (Kuhn, 1955). The M derives from a statistical model that considers error distribution of DFT-predicted values at the selected computational settings. DP4-AI was evaluated using 47 molecules with an average of 3.49 asymmetric centers each, and a diversity of carbon backbones. Their NMR spectra were recorded in different solvents, adding other analytical difficulties such as low signal to noise ratio spectra or even using mixtures of compounds. Four different statistical models were tested. The best results were found for a single region three Gaussian model, fitted to an empirical prediction error distribution obtained from the same test set. Importantly, the efficacy of this tool depends greatly on the level of theory, since accuracy of the chemical shift calculation underpins both the assignments and the subsequent DP4 calculation. As expected, the best overall results were obtained with the most accurate chemical shift predictions tested, after geometry optimization by the B3LYP functional followed by chemical shift prediction using the previous structures utilizing PCM/mPW1PW91/6-311G(d) and single point energies calculated at the M06-2X/def2-TZVP levels of theory (Ermanis et al., 2019). At this level, the probability of obtaining the results by chance was about 3 × 10^–8, indicating high performance. The authors provide DP4-AI as an open-source software with a GUI. The capability of this system to greatly increase processing speed with minimal human intervention enables high-throughput data analysis. It was estimated that one molecule per minute can be processed. Therefore, DP4-AI facilitates exploration of large data sets and the discovery of new structural information via machine learning techniques. This software tool also could be potentially used to support CASE software.

3.3 DP5 probability

Howarth and Goodman (2022) The DP5 probability is a new methodology complemented by a software package that draws on DP4-AI sources (Howarth et al., 2020). DP5 goes conceptually further than other methods such as CP3 or DP4, since it faces the important challenge of assessing the probability of a single structure being correct (Figure 6). This is a very important step forward because previous methods must assume that the correct structure has been included within the panel of candidate structures. In other words, in earlier approaches, if all the proposed structures are erroneous they cannot be applied because they are designed to necessary select one of them. Whereas ANN-PRA (Sarotti, 2013; Zanardi and Sarotti, 2015) categorizes candidate structures in a binary fashion either as correct or incorrect, DP5 estimates normalized stand-alone probabilities without assuming that one of the possibilities must necessarily be correct. To do this, DP5 considers the spatial geometry for each atom, to calculate the probability of a DFT prediction error individually. This advance solves the problem that the associated errors vary in complex ways depending on their atomic environments. DFT calculations were undertaken using the same levels of theory employed in DP4-AI. It must be noted that the CASCADE training data were the source of the optimized geometries and NMR data predictions (Guan et al., 2021). Interestingly, a single conformation was selected for each molecule. At the core of this method there is a prediction error distribution for each atom that was found empirically by a Kernel Density Estimation, using a test-set of 5,140 molecules obtained from NMRShiftDB. Importantly, DP5 was developed using only ¹³C NMR data. DP5 global efficacy was evaluated using all molecules in a leave-one-out cross validation experimental design. The system works well even when tested using incorrect proposals with errors comparable to those obtained for DFT predictions derived from the correct structures. This is because the statistical model applied considers the proposed structure, something not possible in previous error analysis. A very interesting feature of this study was the maximum possible DP5 probability. Thus, a maximum confidence of 72% was found that a proposed structure is correct. On the other hand, the user can sometimes be 100% sure that a certain proposal is erroneous if further data is available. The DP5 workflow was further tested with 13 examples of reassigned molecular structures obtained from the literature. Notably in all of them, this methodology showed an average 41% more confidence in the correct structures than in the rejected ones. Moreover, 42 examples of stereochemical problems were faced and the results were almost equal to those using DP4.

FIGURE 6

FIGURE 6. Schematic representation of DP5. Reproduced from Ref (Howarth and Goodman, 2022). with permission from the Royal Society of Chemistry.

4 Conclusion and future perspectives

On assessing the evolution of ML applied in the field of NMR, one can be totally optimistic towards the results that will certainly appear in the coming years. The development of new ML procedures, augmented with more powerful computers, will surely improve the capabilities of current methods. However, as stated by Cobas (Cobas, 2019), one of the most important challenges to overcome is the enormously immense diversity of molecular environments, coupled with the lack of massive and reliable experimental NMR data sets. This is the main reason why most ML-NMR methods use DFT NMR chemical shifts as the output layer, which might be good for some applications but will not provide the ultimate solution to the problem. After all, it has been well documented that in many cases DFT predictions can be poor for certain systems (Zanardi et al., 2020). Based on the above, perhaps a next stage in this discipline would be merging the two so far disconnected approaches discussed in this article (prediction and correlation). That is, a fully based and automated ML method that predicts, at real time, the NMR data with high accuracy (relative to the experimental NMR data) and simultaneously correlates it with the experimental information to facilitate the assignment. To achieve that goal, it is critically important to improve the predictive levels of current ML approaches, as well as to solve the calculation of the right Boltzmann amplitudes of flexible molecules. If we ever achieve that, many problems in structural elucidation will be solved. Perhaps this may sound utopian, but as the Uruguayan writer Eduardo Galeano said “Utopia is on the horizon. I move two steps closer; she moves two steps further away. I walk ten steps and the horizon runs ten steps further away. No matter how much I walk, I´ll never reach her. So, what´s the point of utopia? The point is this: to keep walking”.

Author contributions

IC, CC, AD, and AS contributed to the study concept and design. IC, CC, AD, and AS wrote the sections of the manuscript. All authors contributed to manuscript revision and review, and approved the submitted version.

Funding

Our research was funded by the UNR (BIO 500 and 567), ANPCyT (PICT-2016-0116, PICT-2017-1524, and PICT-2019-4052), CONICET (PIP 11220200102205CO), MICINN (PID 2019-109476RB-C21), and ACIISI-FEDER (ProID2021010118).

Acknowledgments

IC thanks CONICET for a postdoctoral fellowship, and CC thanks ACIISI and FSE (Programa Operativo Integrado de Canarias 2014–2020, Eje 3, Tema Prioritario 74%–85%) for a predoctoral fellowship.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Bagno, A., Rastrelli, F., and Saielli, G. (2006). Toward the complete prediction of the 1H and 13C NMR spectra of complex organic molecules by DFT methods: Application to natural substances. Chem. – A Eur. J. [Internet] 12 (21), 5514–5525. Available from. doi:10.1002/chem.200501583

PubMed Abstract | CrossRef Full Text | Google Scholar

Barone, G., Duca, D., Silvestri, A., Gomez-Paloma, L., Riccio, R., and Bifulco, G. (2002). Determination of the relative stereochemistry of flexible organic compounds by ab initio methods: Conformational analysis and Boltzmann-averaged GIAO 13C NMR chemical shifts. Chem. – A Eur. J. [Internet] 8 (14), 3240–3245. Available from:14%3C3240:AID-CHEM3240%3E3.0.CO[. doi:10.1002/1521-3765(20020715)8:14<3240:AID-CHEM3240>3.0.CO;2-G

PubMed Abstract | CrossRef Full Text | Google Scholar

Barone, G., Gomez-Paloma, L., Duca, D., Silvestri, A., Riccio, R., and Bifulco, G. (2002). Structure validation of natural products by quantum-mechanical GIAO calculations of 13C NMR chemical shifts. Chemistry 8 (14), 3233–3239. doi:10.1002/1521-3765(20020715)8:14<3233::AID-CHEM3233>3.0.CO;2-0

PubMed Abstract | CrossRef Full Text | Google Scholar

Bartók, A. P., De, S., Poelking, C., Bernstein, N., Kermode, J. R., Csányi, G., et al. (2022). Machine learning unifies the modeling of materials and molecules. Sci. Adv. [Internet] 3 (12), e1701816. Available from. doi:10.1126/sciadv.1701816

Machine learning in computational NMR-aided structural elucidation

1 Introduction

2 NMR prediction

2.1 ShiftML

2.2 IMPRESSION

2.3 CASCADE

2.4 ML-J-DP4

2.5 DU8ML

3 Data correlation

3.1 ANN-PRA

3.2 DP4-AI

3.3 DP5 probability

4 Conclusion and future perspectives

Author contributions

Funding

Acknowledgments

Conflict of interest

Publisher’s note

References

94% of researchers rate our articles as excellent or good