
- A1: Explicit RL / RLHF / RLAIF using tool execution results as reward
- T2: Indirect RL (teacher–student, distillation, preference learning) using agent output as reward·critic
- A2: Heavy use of supervised / preference-based fine-tuning
- T1: Focus on learning retrieval·tool itself (contrastive, supervised)

Seonglae Cho