AI and scientific publications

Posted on Jan 24, 2024

The apparent “intelligence” of large language models, such as ChatGPT, is impressive. Probably equally impressive is the speed at which such a tool is adopted to help in writing reports, summaries or scientific articles. It is thus not surprising that the word “ChatGPT” was, according to Google trends, the most searched for in the news in France in 2023. Less well known by the general public, but equally impressive, have been the results of AlphaFold in predicting protein structures, surpassing all competing algorithms at its initial launch in 2018. The second version of AlphaFold remained unchallenged in 2020 and AlphaFold 2 remains the basis for new algorithms seeking to predict the structure of protein complexes or the structural variation expected from point mutations.

Are ChatGPT, for text, and AlphaFold for protein structures, success stories likely to be replicated in other domains of the human knowledge ? The cited examples are based on a large corpus of training data, written text for ChatGPT and existing protein structures and multiple sequence alignments for AlphaFold. It is not clear whether the quality and quantity of available results are sufficient in most domains, and particularly in biology. Flawed experimental results would inevitably lead to a biased predictive model. Thus, before the use of the new tools for model training, scientific research most likely needs best quality results, large amounts of data and easy ways to tell if a predictive model is good. It will be thus interesting to see to what extent the availability of tools allowing modeling of various biological data sets will lead to predictive models and how good the predictions of these models will be.

For the moment, most modeling in biology is descriptive, in that it tries to find the factors that best explain the variability in an experimental data set. A deep understanding of the underlying biology is likely to need the simplest possible model that gives accurate predictions. We are still far from the robot biologist described by the Stephen Oliver lab back in 2004. Until then, ChatGPT might help non-native English speakers loose less time for writing papers and projects. Thus, a disadvantage of non-native English scientists, might be partially corrected by tools similar to ChatGPT. It is to be hoped that in the near future, equivalent tools built by academic laboratories under a collaborative licence would become available.

As impressive as ChatGPT may be, it is by no means an artificial general intelligence (AGI), having cognitive functions similar to humans. It is unclear when and if an AGI will be developed. Suffice to say that some of the scientific reasoning can be modeled and self-correcting algorithms, such as the one described 20 years ago as “the robot scientist”. The proof of concept was not truly followed by an explosion of such robot scientists in research laboratories, probably because most of the time the original questions to which a scientific inquiry answers are not known beforehand. What exploded, however, are the amounts of biological results obtained by large-scale approaches and the number of publications describing them. It became increasingly difficult to navigate these results. Active long-term funding of curated biology databases would be crucial to ensure persistence of the obtained results and the possibility of using them. Moreover, this increase in amounts of data also opened a venue for confusion and false conclusions.

Peer-reviewing manuscripts that describe large-scale results and their analysis is difficult. This explains why even the most famous scientific journals publish a fraction of papers in which improper use of statistical tools on effects of low lead to conclusions that are factually wrong. The authors themselves may be unaware of the errors that were done and take the results in good faith as the basis for further research, which leads to important financial and work time losses. A better understanding of the limits of the statistical methods employed widely in biology would be required at all levels to minimize such losses.