critique is easier than generaterankRLHF: Reward modeling → generate that maximize reward like PPORLAIF: AI critic