Reward Models Inherit Value Biases from Pretraining
Brian Christian, John Thompson, Elle, V. Adam, Hannah Rose Kirk, Chris Summerfield, T. Dumbalska
2026 International Conference on Learning Representations | January 2026
Brian Christian, John Thompson, Elle, V. Adam, Hannah Rose Kirk, Chris Summerfield, T. Dumbalska
2026 International Conference on Learning Representations | January 2026
2025 Empirical Methods in Natural Language Processing | October 2025
Brian Christian, John Thompson, Elle, V. Adam, Hannah Rose Kirk, Chris Summerfield, T. Dumbalska
2026 International Conference on Learning Representations | January 2026
2025 Empirical Methods in Natural Language Processing | October 2025
Brian Christian, John Thompson, Elle, V. Adam, Hannah Rose Kirk, Chris Summerfield, T. Dumbalska
2026 International Conference on Learning Representations | January 2026
2025 Empirical Methods in Natural Language Processing | October 2025