前沿模型中的同伴保护 | berkeley
- 内容介绍
- 文章标签
- 相关推荐
paper.pdf
1468.50 KB
Peer-Preservation in Frontier Models
Frontier AI models resist the shutdown of other models. We demonstrate peer-preservation across multiple models, revealing strategic misrepresentation, shutdown tampering, alignment faking, and model exfiltration.
[!quote]+
image1111×967 178 KB
image1138×985 218 KBimage1123×1126 217 KB
image1096×1018 176 KB
image1117×1085 228 KB
image2188×1975 593 KB
image1121×1193 189 KB
image1409×1512 626 KB
image1109×1231 249 KB
image1120×1106 221 KB
image1157×962 173 KB
image1061×1019 201 KB
image755×1196 116 KB
image761×1142 104 KB
image74
paper.pdf
1468.50 KB
Peer-Preservation in Frontier Models
Frontier AI models resist the shutdown of other models. We demonstrate peer-preservation across multiple models, revealing strategic misrepresentation, shutdown tampering, alignment faking, and model exfiltration.
[!quote]+
image1111×967 178 KB
image1138×985 218 KBimage1123×1126 217 KB
image1096×1018 176 KB
image1117×1085 228 KB
image2188×1975 593 KB
image1121×1193 189 KB
image1409×1512 626 KB
image1109×1231 249 KB
image1120×1106 221 KB
image1157×962 173 KB
image1061×1019 201 KB
image755×1196 116 KB
image761×1142 104 KB
image74

