- implicit memory (learning/RL): generalizes well but is a black box, suffers from catastrophic forgetting, and is hard to interpret
- explicit memory (prompts/memory): transparent but static, weak at adaptation/generalization
This paper bridges the gap between the two. Instead of just storing experiences, it standardizes trajectories using FSM, extracts meta-strategies (meta-cognition) from them, and creates a trainable graph memory that learns the "usefulness" of these strategies via RL signals.
- Graph Memory Construction (hierarchical):Query nodes → FSM canonical path nodes → Meta-cognition nodes (human-readable strategy sentences).
- Weight Learning (REINFORCE):Based on the difference ΔR between reward with meta-strategy R_with vs without R_w/o, strengthen/weaken graph edge weights leading to that strategy. (Utility-based selection)
- Integration into RL Training:For each training query, extract top-k meta-strategies and prepend them to the prompt as policy prior, then train with GRPO.
Improved inference and RL training performance/convergence across 7 QA benchmarks.
arxiv.org
https://arxiv.org/pdf/2511.07800

Seonglae Cho