Explicitly design action space by PPO by designing dense reward function without Value network and Genetic Algorithm based Refinement based Jailbreaking arxiv.orghttps://arxiv.org/pdf/2406.08725