RLFR

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2026 Feb 13 12:58
Editor
Edited
Edited
2026 Feb 18 14:29
Refs
Refs

Reinforcement Learning from Feature Rewards

Detects suspected factual claims (entities) in model outputs. Uses internal model features (interpretable representations) as rewards for RL. → Enables learning for open-ended tasks without human scoring or LLM judges. In other words, internal belief = supervision signal. Uses the model's own internal judgment as reward.
This is a "self-supervised alignment" approach, suggesting that alignment may be possible without humans or LLMs.
  • Detect suspected factual claims (entities) in model outputs
  • Model chooses to maintain / correct / retract
  • Evaluate actions using feature-based rewards
  • Update policy with RL
Rewards also come from features: span-level / intervention-level reward
  • Good corrections → high reward
  • Maintaining hallucinations → low reward
Results with Gemma-3-12B:
  • Hallucinations reduced by 58%
  • ~90x cheaper than LLM judge
  • Maintains general benchmark performance
Can also be used for test-time scaling methods like Best-of-N sampling.
Open-ended tasks (hard to define correct answers):
  • factuality
  • helpfulness
  • honesty
  • alignment
→ Difficult to train with traditional RL
This paper's claim:
Interpretability → can be used as supervision signal
teacher model = frozen reference model + feature extractor + reward computation
 
 

Browser

Longfact++ Rollout Viewer - Features as Rewards (RLFR)
Selected rollouts of Gemma 3 12B on LongFact++ before and after RLFR to detect and reduce hallucinations.
Longfact++ Rollout Viewer - Features as Rewards (RLFR)
arxiv.org
 
 

Recommendations