Perform SFT while keeping only samples with reward 1synthetic + SFT + reward filtering loop arxiv.orghttps://arxiv.org/pdf/2312.06585