Machine learning (ML) became an unavoidable tool in contemporary science and technology. Its ability to offer solutions to hard, multidimensional problems makes it an attractive tool for practitioners in diverse fields, from social engineering to structural biology and molecular dynamics, and to complete information games such as chess or Go. The disruptive power of recent advances in deep learning (DL) such as attention networks, or denoising diffusion probabilistic models (DDPM) changed the field of molecular and structural biology and offered a solution to long-standing protein folding problem. However, ML poses a serious problem and that is the dissociation of practical results achieved with ML in a given field from theory. ML is based on solid mathematical work, and new improvements rigorously follow thorough research protocols, but the key problem stays largely unsolved, and that is how rules which trained ML algorithms (neural DL networks in particular) use to make classifications are organized. We understand the algorithms used to optimize the decision rules, and we can analyze the weights distributions in hidden layers, but we cannot clearly decipher, in human-readable terms, how those algorithms emphasize features important for classification (deterministic chaos in latent space). We are, therefore, in possession of a powerful system with non-transferable knowledge (in a sense of non-transferable artistic ability; an artist can be trained, but the ability cannot be simply copied). In some instances, that is not an issue, for example, if the underlying theoretical foundations are well known (e.g., in computational physics/mechanics), and the major task is to produce a more efficient algorithm than the one based on classical theory. For those tasks, the ML algorithms showed to be meaningful, efficient, and useful.
The inability of ML to offer simple, human-readable interpretations of the rules for converting multi-dimensional input data stream into output classes makes ML a closed system, an inaccessible black box, a sort of Deux-ex-Machina, and the movements in latent space that stay largely hidden from human observers (the latent space can be perceived as a space of “hidden variables” in Quantum mechanics). One can argue that this is just an issue of linguistics or epistemology, but we find it crucial if we aim to fully understand biological or physical processes. Conversely, the ability of ML protocols to efficiently extract key features from a given data set makes them vulnerable to input data sampling bias. In scientific terms, the ML algorithms easily find the local minima given the input stream, but the global minima may be out of reach due to limited sampling. This issue is especially emphasized in natural sciences, where sampling bias is almost inevitable. All this shows that we may end up where we have started, with observable phenomena with no clear explanation. That may lead to a saturation effect of the explanatory ability of science, with weak feedback for knowledge improvement.
ML lacks the overall knowledge of the world, i.e., artificial general intelligence (AGI) still does not exist, and the ability to access general knowledge was instrumental to researchers’ ability to cross the sampling gap and guided them toward correct explanations (the Copernican revolution in astronomy, given observational data in the XVI century, would not be possible if Nicolas Copernicus did not possess a wider understanding and knowledge of the world).
In computer-aided-drug-design ML has been used to recognize drug binding sites, binding modes, and conformations, to speed up costly MD calculations (QM/MM in particular), and to optimize potential hits. The aim of practitioners of ML was to reduce the exorbitant costs of drug development and shorten the development and, correspondingly, lengthen the applicability of patent rights. ML can help in that regard, but the above comments still apply. Without a full compendium of cell signaling process, and still not fully resolved principles of molecular interactions and dynamics, machine learning’s ability to filter out unnecessary details from the input stream (experimentally obtained molecular structures, interactions, and clinical data), seems like a short-term success. What is necessary is to understand cause and effect both on the micro and macro level (cells, tissues, organs, individuals), together with effects of timescales and time relativity inside cells and tissues.
To address that we would like to invite practitioners of ML and drug hunters to submit manuscripts to this Research topic showing research that utilizes ML protocols/architectures but offers a detailed and comprehensive interpretation of observed phenomena. The topics that can be addressed with ML and that we are interested in are molecular dynamics (MD) acceleration techniques, implicit solvent improvements, small molecules force field optimizations, cryptic pocket discovery and their physical interpretation, allosteric effects discovery, and their application to undruggable targets. The toxicity in drug discovery is also a topic we would like to see addressed. We would also encourage the application of ML to the analysis of interaction frequencies between genome loci in the nucleus as statistical averages over cell populations, and their relation (and potential clinical use) to single cells analyses via 3C and fluorescence in situ hybridization techniques. The already established fractal and continuous polymer models of chromatin are ripe for deeper interpretations with the help of ML tools together with the experimental data on covalent modifications.
As a tool of choice for data analysis in experimental and observational sciences, ML helped produce a deluge of research papers, many with limited/questionable scientific merit. We would, therefore, avoid simple ML analysis of input data sets, or comparison of different ML algorithms (e.g., neural networks vs. random forests, etc.). Manuscripts that use machine learning, but at the same time offer solid theoretical interpretations of results are, on the other hand welcome. That may not only help drug discovery and molecular biology but also benefit the machine learning field as well, as it may shed light on the underlying processes in the latent space of variables.
Dr. Gunady is currently an employee of Illumina Inc; Dr.Perišic is an employee (Research Scientist) of Redesign Science. All other Topic Editors declare no competing interests.
Machine learning (ML) became an unavoidable tool in contemporary science and technology. Its ability to offer solutions to hard, multidimensional problems makes it an attractive tool for practitioners in diverse fields, from social engineering to structural biology and molecular dynamics, and to complete information games such as chess or Go. The disruptive power of recent advances in deep learning (DL) such as attention networks, or denoising diffusion probabilistic models (DDPM) changed the field of molecular and structural biology and offered a solution to long-standing protein folding problem. However, ML poses a serious problem and that is the dissociation of practical results achieved with ML in a given field from theory. ML is based on solid mathematical work, and new improvements rigorously follow thorough research protocols, but the key problem stays largely unsolved, and that is how rules which trained ML algorithms (neural DL networks in particular) use to make classifications are organized. We understand the algorithms used to optimize the decision rules, and we can analyze the weights distributions in hidden layers, but we cannot clearly decipher, in human-readable terms, how those algorithms emphasize features important for classification (deterministic chaos in latent space). We are, therefore, in possession of a powerful system with non-transferable knowledge (in a sense of non-transferable artistic ability; an artist can be trained, but the ability cannot be simply copied). In some instances, that is not an issue, for example, if the underlying theoretical foundations are well known (e.g., in computational physics/mechanics), and the major task is to produce a more efficient algorithm than the one based on classical theory. For those tasks, the ML algorithms showed to be meaningful, efficient, and useful.
The inability of ML to offer simple, human-readable interpretations of the rules for converting multi-dimensional input data stream into output classes makes ML a closed system, an inaccessible black box, a sort of Deux-ex-Machina, and the movements in latent space that stay largely hidden from human observers (the latent space can be perceived as a space of “hidden variables” in Quantum mechanics). One can argue that this is just an issue of linguistics or epistemology, but we find it crucial if we aim to fully understand biological or physical processes. Conversely, the ability of ML protocols to efficiently extract key features from a given data set makes them vulnerable to input data sampling bias. In scientific terms, the ML algorithms easily find the local minima given the input stream, but the global minima may be out of reach due to limited sampling. This issue is especially emphasized in natural sciences, where sampling bias is almost inevitable. All this shows that we may end up where we have started, with observable phenomena with no clear explanation. That may lead to a saturation effect of the explanatory ability of science, with weak feedback for knowledge improvement.
ML lacks the overall knowledge of the world, i.e., artificial general intelligence (AGI) still does not exist, and the ability to access general knowledge was instrumental to researchers’ ability to cross the sampling gap and guided them toward correct explanations (the Copernican revolution in astronomy, given observational data in the XVI century, would not be possible if Nicolas Copernicus did not possess a wider understanding and knowledge of the world).
In computer-aided-drug-design ML has been used to recognize drug binding sites, binding modes, and conformations, to speed up costly MD calculations (QM/MM in particular), and to optimize potential hits. The aim of practitioners of ML was to reduce the exorbitant costs of drug development and shorten the development and, correspondingly, lengthen the applicability of patent rights. ML can help in that regard, but the above comments still apply. Without a full compendium of cell signaling process, and still not fully resolved principles of molecular interactions and dynamics, machine learning’s ability to filter out unnecessary details from the input stream (experimentally obtained molecular structures, interactions, and clinical data), seems like a short-term success. What is necessary is to understand cause and effect both on the micro and macro level (cells, tissues, organs, individuals), together with effects of timescales and time relativity inside cells and tissues.
To address that we would like to invite practitioners of ML and drug hunters to submit manuscripts to this Research topic showing research that utilizes ML protocols/architectures but offers a detailed and comprehensive interpretation of observed phenomena. The topics that can be addressed with ML and that we are interested in are molecular dynamics (MD) acceleration techniques, implicit solvent improvements, small molecules force field optimizations, cryptic pocket discovery and their physical interpretation, allosteric effects discovery, and their application to undruggable targets. The toxicity in drug discovery is also a topic we would like to see addressed. We would also encourage the application of ML to the analysis of interaction frequencies between genome loci in the nucleus as statistical averages over cell populations, and their relation (and potential clinical use) to single cells analyses via 3C and fluorescence in situ hybridization techniques. The already established fractal and continuous polymer models of chromatin are ripe for deeper interpretations with the help of ML tools together with the experimental data on covalent modifications.
As a tool of choice for data analysis in experimental and observational sciences, ML helped produce a deluge of research papers, many with limited/questionable scientific merit. We would, therefore, avoid simple ML analysis of input data sets, or comparison of different ML algorithms (e.g., neural networks vs. random forests, etc.). Manuscripts that use machine learning, but at the same time offer solid theoretical interpretations of results are, on the other hand welcome. That may not only help drug discovery and molecular biology but also benefit the machine learning field as well, as it may shed light on the underlying processes in the latent space of variables.
Dr. Gunady is currently an employee of Illumina Inc; Dr.Perišic is an employee (Research Scientist) of Redesign Science. All other Topic Editors declare no competing interests.