Language Model RL

Sample efficiency in larger samples and only finds better reasoning paths within its existing capacity which makes its total problem solving coverage smaller

limit-of-RLVR

LeapLabTHU • Updated 2025 Oct 25 2:10

www.arxiv.org

https://www.arxiv.org/pdf/2504.13837

SFT Memorizes, RL Generalizes (

AI Memory,

Model Generalization,

OOD)

While the provocative title is not exactly correct, it provides insight even for Multimodality

SFT Memorizes, RL Generalizes

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

https://tianzhechu.com/SFTvsRL/

Appendix is awesome for
Language Model RL

arxiv.org

https://arxiv.org/pdf/2402.03300

Fine-tuning 20B LLMs with RLHF on a 24GB consumer GPU

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

https://huggingface.co/blog/trl-peft

How to Fine-Tune LLMs in 2024 with Hugging Face

In this blog post you will learn how to fine-tune LLMs using Hugging Face TRL, Transformers and Datasets in 2024. We will fine-tune a LLM on a text to SQL dataset.

https://www.philschmid.de/fine-tune-llms-in-2024-with-trl

Agent RL vulnerability

Search LLMs trained with agentic RL may appear safe, but can be easily jailbroken by manipulating the timing of the search step. The RL objective itself fails to suppress harmful queries, making "search first" behavior a critical vulnerability.

arxiv.org

https://arxiv.org/pdf/2510.17431

Language Model RL

Era of Experience

Appendix is awesome for
Language Model RL

Agent RL vulnerability

Backlinks

Recommendations

Language Model RL

Era of Experience

Appendix is awesome for Language Model RL

Agent RL vulnerability

Backlinks

Recommendations

Appendix is awesome for
Language Model RL