Trainable Memory Graph

implicit memory (learning/RL): generalizes well but is a black box, suffers from catastrophic forgetting, and is hard to interpret

explicit memory (prompts/memory): transparent but static, weak at adaptation/generalization

This paper bridges the gap between the two. Instead of just storing experiences, it standardizes trajectories using FSM, extracts meta-strategies (meta-cognition) from them, and creates a trainable graph memory that learns the "usefulness" of these strategies via RL signals.

Graph Memory Construction (hierarchical):Query nodes → FSM canonical path nodes → Meta-cognition nodes (human-readable strategy sentences).

Weight Learning (REINFORCE):Based on the difference ΔR between reward with meta-strategy R_with vs without R_w/o, strengthen/weaken graph edge weights leading to that strategy. (Utility-based selection)

Integration into RL Training:For each training query, extract top-k meta-strategies and prepend them to the prompt as policy prior, then train with GRPO.

Improved inference and RL training performance/convergence across 7 QA benchmarks.

arxiv.org

https://arxiv.org/pdf/2511.07800

Trainable Memory Graph

Recommendations