Tag: Alignment
All the articles with the tag "Alignment".
-
Humans as Weak Supervisors: What AAR Reveals About Alignment
Anthropic's AAR project is ostensibly about autonomous AI research. Viewed differently, it's a meta-validation of weak-to-strong alignment: humans as weak supervisors, wielding evaluation environment design as their last leverage point over models that exceed their own capabilities.