Self-Adapting LLMs

SEAL

SEAL is a framework that trains models to generate 'self-edit' instructions and synthetic training data to update their weights. The model reconstructs training data or suggests hyperparameters based on input context, performs lightweight fine-tuning using LoRA, and then improves the self-edit generation policy through

ReSTEM using performance as a reward.

Specifically, instead of directly learning QuAD sentences, the model fine-tunes using implication sentences it creates, increasing QA accuracy from 33.5% to 47.0% even without context. Additionally, in the ARC reasoning task, it automatically selects appropriate data augmentation and optimization settings, raising the success rate from 0% to 72.5%.

Outer RL loop

Optimizes self-edit generation method using reinforcement learning.

Inner loop

Updates actual model parameters with SFT using self-edit.

Long-term adaptation: Permanently modifies weights through actual finetuning (SFT).

Evaluates self-edit quality with ReSTEM and optimizes in the outer loop. Relatively slow (performs actual finetuning, takes tens of seconds).

www.arxiv.org

https://www.arxiv.org/pdf/2506.10943

Self-Adapting LLMs

SEAL

Outer RL loop

Inner loop

Recommendations