Protein AI

Sequence to Structure

3D structure prediction in AI drug development using

Protein Folding and

Protein Pathway

Structure-based

Sequence-based

Protein AIs

EvoDiff

RoseTTAFold

Protein AI Usages

Protein Longformer Algorithm

OpenProteinSet

Protein surface GNN

Alphafold Protein Structure Database

AlphaFold Protein Structure Database

https://alphafold.ebi.ac.uk/

GPT Language Model Spells Out New Proteins

So how much of the material that goes into the typical bin avoids a trip to landfill? For countries that do curbside recycling, the number-called the recovery rate-appears to average around 70 to 90 percent, though widespread data isn't available. That doesn't seem bad. But in some municipalities, it can go as low as 40 percent.

https://spectrum.ieee.org/gpt-2-language-model-proteins

GPT Language Model Spells Out New Proteins

Generalists vs. Specialists
Control Barrier Function

LLMs have shown promise across many domains, but they still struggle with black-box optimization (BBO) problems that must satisfy precise biophysical constraints such as protein stability or solubility. To address these challenges, specialist solutions such as LaMBO-2 have been developed, but applying them to new domains requires substantial domain expertise and engineering effort. This paper investigates whether general-purpose models like LLMs can, with an appropriate training framework, achieve performance comparable to specialist solutions on these highly constrained optimization tasks.

The authors propose LLOME (Language Model Optimization with Margin Expectation), a bilevel optimization routine that leverages an LLM. They also derive a new loss function, MargE (Margin-Aligned Expectation), to overcome limitations of SFT and DPO. Its basic form is:

Here, is the current model policy being trained, is the reference policy from the previous step, is a generated sequence, $r(x, y)$ is an oracle reward for the sequence, is the sequence length, and is a regularization coefficient. This objective is structured to minimize the gap between length-normalized log-likelihood and reward while preventing the model from drifting too far from the reference policy.

The LLOME framework consists of an outer loop that trains the LLM using oracle-labeled data, and an inner loop in which the LLM iteratively refines candidate sequences without direct oracle access. In particular, the paper introduces a new synthetic benchmark called Ehrlich functions to mimic the complex geometric constraints of protein design, enabling fair evaluation of existing LLMs without training-data leakage. The algorithm also includes automatic temperature tuning to prevent the model’s outputs from collapsing in diversity, and is designed to efficiently search for high-reward sequences.

Generalists vs. Specialists: Evaluating LLMs on Highly-Constrained...

Although large language models (LLMs) have shown promise in biomolecule optimization problems, they incur heavy computational costs and struggle to satisfy precise constraints. On the other hand,...

https://arxiv.org/abs/2410.22296

Protein AI

Sequence to Structure

Generalists vs. Specialists Control Barrier Function

Recommendations

Generalists vs. Specialists
Control Barrier Function