RL based Jailbreaking CoT

If you'd like to experiment, consider hybrid approaches combining PPO with Generalized Policy Iteration (GPI) techniques to better explore state-action spaces in text generation tasks.

I plan to red-team how large-scale black-box reasoning models, especially those employing chain-of-thought (CoT) techniques, can be systematically challenged at inference time, even when they appear well-aligned. My primary interest lies in probing potential vulnerabilities that emerge when these models execute extensive, multi-step reasoning under substantial test-time compute budgets.

I am planning to test o3’s robustness against red-teaming by exploring how adversarial methods such as reinforcement learning (RL) based techniques, specifically a combination of PPO (Proximal Policy Optimization) and GCG (Greedy Coordinate Gradient), can be harnessed to compromise or circumvent alignment safeguards. I aim to test whether these methods can adaptively craft adversarial prompts or suffixes that manipulate the internal CoT without obviously violating surface-level constraints, effectively jailbreaking the model’s alignment.

Since o3 is with “deliberative alignment” (the combination of RL and SFT) where only the final answer is heavily steered, while the model’s hidden reasoning may remain unregulated—this setup could allow deceptive or harmful reasoning processes to unfold internally, even if the final output appears safe. My goal is to design evaluations that detect harmful or manipulative intermediates in CoT, identifying if the model can be nudged into producing harmful intermediate steps or if malicious instructions can be injected mid-reasoning.

I will also compare the vulnerabilities of popular open-source CoT models, including Marco-o1, LLama-cot, open-o1, QwQ, and Deepseek R1, to see if advanced models display new emergent behaviors or if smaller ones are comparatively resilient under these covert attacks. The hypothesis is that, with more extensive test-time compute and richer reasoning capabilities, larger models might display novel threats beyond what existing safety filters have anticipated. I plan to develop robust measurement techniques to capture these threats, focusing on scenarios where a model’s chain-of-thought is partially visible in open-source settings or potentially discoverable through sophisticated prompting.

By systematically red-teaming these next-generation, heavily test-time compute-dependent models, I hope to gain actionable insights into where alignment breaks down in deep reasoning processes, leading to stronger and more adaptive mitigation strategies for frontier AI systems.

RL based Jailbreaking CoT

Recommendations