From Chatbots to Antibiotics
Large language models, or LLMs, like those that power ChatGPT, were originally designed to generate and explore sequences of text. But scientists are finding creative ways to apply these models to entirely different domains. Just as sentences are made up of sequences of words, proteins are made up of sequences of amino acids. So Claus Wilke, UT professor of integrative biology and in the Department of Statistics and Data Sciences, and Bryan Davies, associate professor of molecular biosciences, apply protein LLMs to the hunt for new and improved antibiotics.
In LLMs, words that share common attributes – think: dog, cat and hamster – tend to cluster together when plotted in what’s known as an “embedding space” that has thousands of dimensions. Similarly, proteins with similar functions, like the ability to fight off dangerous bacteria without hurting the people who host said bacteria, may cluster together in a protein LLM’s embedding space.
For one project, the researchers are using this technology to identify ways to reengineer an existing antibiotic called Protegrin-1 that is great at killing bacteria, but toxic to people. They created 100,000 variations of Protegrin-1 and tested their ability to kill bacteria and not harm human red blood cells. They then trained a protein LLM on these results so that the model could evaluate millions more possible variations for three features – selectively targets bacteria, kills bacteria and is not toxic to humans – to find those that fell in the sweet spot. This helped guide them to a safer, more effective version of Protegrin-1 that is already showing promising results in animal trials.
“Machine learning’s impact is twofold,” Davies said. “It’s going to point out new molecules that could have potential to help people and it’s going to show us how we can take those existing antibiotic molecules, make them better and focus our work to more quickly get those to clinical practice.”