Protein AI

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2022 Aug 15 14:33
Editor
Edited
Edited
2026 Jun 12 18:12

Sequence to Structure

3D structure prediction in AI drug development using
Protein Folding
and
Protein Pathway
  • Structure-based
  • Sequence-based
Protein AIs
 
 
Protein AI Usages
 
 
 
Alphafold Protein Structure Database
AlphaFold Protein Structure Database
GPT Language Model Spells Out New Proteins
So how much of the material that goes into the typical bin avoids a trip to landfill? For countries that do curbside recycling, the number-called the recovery rate-appears to average around 70 to 90 percent, though widespread data isn't available. That doesn't seem bad. But in some municipalities, it can go as low as 40 percent.
GPT Language Model Spells Out New Proteins

Generalists vs. Specialists
cortex
prescient-designUpdated 2026 May 27 18:19
poli-baselines
MachineLearningLifeScienceUpdated 2026 Apr 27 6:28
Control Barrier Function

LLMs have shown promise across many domains, but they still struggle with black-box optimization (BBO) problems that must satisfy precise biophysical constraints such as protein stability or solubility. To address these challenges, specialist solutions such as LaMBO-2 have been developed, but applying them to new domains requires substantial domain expertise and engineering effort. This paper investigates whether general-purpose models like LLMs can, with an appropriate training framework, achieve performance comparable to specialist solutions on these highly constrained optimization tasks.
The authors propose LLOME (Language Model Optimization with Margin Expectation), a bilevel optimization routine that leverages an LLM. They also derive a new loss function, MargE (Margin-Aligned Expectation), to overcome limitations of SFT and DPO. Its basic form is:
Here, is the current model policy being trained, is the reference policy from the previous step, is a generated sequence, $r(x, y)$ is an oracle reward for the sequence, is the sequence length, and is a regularization coefficient. This objective is structured to minimize the gap between length-normalized log-likelihood and reward while preventing the model from drifting too far from the reference policy.
The LLOME framework consists of an outer loop that trains the LLM using oracle-labeled data, and an inner loop in which the LLM iteratively refines candidate sequences without direct oracle access. In particular, the paper introduces a new synthetic benchmark called Ehrlich functions to mimic the complex geometric constraints of protein design, enabling fair evaluation of existing LLMs without training-data leakage. The algorithm also includes automatic temperature tuning to prevent the model’s outputs from collapsing in diversity, and is designed to efficiently search for high-reward sequences.
Generalists vs. Specialists: Evaluating LLMs on Highly-Constrained...
Although large language models (LLMs) have shown promise in biomolecule optimization problems, they incur heavy computational costs and struggle to satisfy precise constraints. On the other hand,...
Generalists vs. Specialists: Evaluating LLMs on Highly-Constrained...
 
 

Recommendations