SEAL
SEAL is a framework that trains models to generate 'self-edit' instructions and synthetic training data to update their weights. The model reconstructs training data or suggests hyperparameters based on input context, performs lightweight fine-tuning using LoRA, and then improves the self-edit generation policy through ReSTEM using performance as a reward.
Specifically, instead of directly learning QuAD sentences, the model fine-tunes using implication sentences it creates, increasing QA accuracy from 33.5% to 47.0% even without context. Additionally, in the ARC reasoning task, it automatically selects appropriate data augmentation and optimization settings, raising the success rate from 0% to 72.5%.

Outer RL loop
Optimizes self-edit generation method using reinforcement learning.
Inner loop
Updates actual model parameters with SFT using self-edit.
Long-term adaptation: Permanently modifies weights through actual finetuning (SFT).
Evaluates self-edit quality with ReSTEM and optimizes in the outer loop. Relatively slow (performs actual finetuning, takes tens of seconds).