Cell2Sentence

Creator

Creator

Seonglae Cho

Created

Created

2025 Oct 20 9:37

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Oct 20 9:41

Refs

Refs

A model that understands cell expression data to predict drug candidate mechanisms uncovering existing compounds for immune-system cancer defense

Train on single-cell data to develop a model that understands cell states and signaling patterns.

Run simulations of 4,000 drugs on this model, comparing "immune-context-positive" environments (with some immune response) to "immune-context-neutral" environments (completely neutral).

Results predict that silmitasertib, when combined with low-dose interferon, significantly increases antigen presentation (MHC-I), turning "cold" tumors "hot".

This prediction was confirmed in actual in vitro experiments.

How a Gemma model helped discover a new potential cancer therapy pathway

We’re launching a new 27 billion parameter foundation model for single-cell analysis built on the Gemma family of open models.

https://blog.google/technology/ai/google-gemma-ai-cancer-therapy-discovery/

How a Gemma model helped discover a new potential cancer therapy pathway

Scaling Large Language Models for Next-Generation Single-Cell Analysis

Single-cell RNA sequencing has transformed our understanding of cellular diversity, yet current singlecell foundation models (scFMs) remain limited in their scalability, flexibility across diverse tasks, and ability to natively integrate textual information. In this work, we build upon the Cell2Sentence (C2S) framework, which represents scRNA-seq profiles as textual “cell sentences,” to train Large Language Models (LLMs) on a corpus comprising over one billion tokens of transcriptomic data, biological text, and metadata. Scaling the model to 27 billion parameters yields consistent improvements in predictive and generative capabilities and supports advanced downstream tasks that require synthesis of information across multi-cellular contexts. Targeted fine-tuning with modern reinforcement learning techniques produces strong performance in perturbation response prediction, natural language interpretation, and complex biological reasoning. This predictive strength directly enabled a dualcontext virtual screen that uncovered a striking context split for the kinase inhibitor silmitasertib (CX-4945), suggesting its potential as a synergistic, interferon-conditional amplifier of antigen presentation. Experimental validation in human cell models unseen during training confirmed this hypothesis, demonstrating that C2S-Scale can generate biologically grounded, testable discoveries of context-conditioned biology. C2S-Scale unifies transcriptomic and textual data at unprecedented scales, surpassing both specialized single-cell models and general-purpose LLMs to provide a platform for next-generation single-cell analysis and the development of “virtual cells.” ### Competing Interest Statement The authors have declared no competing interest.

https://www.biorxiv.org/content/10.1101/2025.04.14.648850v2

Scaling Large Language Models for Next-Generation Single-Cell Analysis

Recommendations

/////