{"id":1167325,"date":"2026-04-01T09:21:34","date_gmt":"2026-04-01T16:21:34","guid":{"rendered":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/?post_type=msr-research-item&#038;p=1167325"},"modified":"2026-04-01T09:21:34","modified_gmt":"2026-04-01T16:21:34","slug":"general-scales-unlock-ai-evaluation-with-explanatory-and-predictive-power-nature","status":"publish","type":"msr-research-item","link":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/publication\/general-scales-unlock-ai-evaluation-with-explanatory-and-predictive-power-nature\/","title":{"rendered":"General Scales Unlock AI Evaluation with Explanatory and Predictive Power (Nature)"},"content":{"rendered":"<p>Ensuring safe and effective use of artificial intelligence (AI) requires understanding and anticipating its performance on new tasks, from advanced scientific challenges to transformed workplace activities<sup><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" id=\"ref-link-section-d24357297e901\" title=\"Boiko, D. A., MacKnight, R., Kline, B. & Gomes, G. Autonomous chemical research with large language models. Nature 624, 570\u2013578 (2023).\" href=\"https:\/\/www.nature.com\/articles\/s41586-026-10303-2#ref-CR1\" data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\">1<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" id=\"ref-link-section-d24357297e901_1\" title=\"Abramson, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630, 493\u2013500 (2024).\" href=\"https:\/\/www.nature.com\/articles\/s41586-026-10303-2#ref-CR2\" data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\">2<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" id=\"ref-link-section-d24357297e904\" title=\"Eloundou, T., Manning, S., Mishkin, P. & Rock, D. GPTs are GPTs: labor market impact potential of LLMs. Science 384, 1306\u20131308 (2024).\" href=\"https:\/\/www.nature.com\/articles\/s41586-026-10303-2#ref-CR3\" data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 3\">3<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/sup>. So far, benchmarking has guided progress in AI but has offered limited explanatory and predictive power for general-purpose AI systems<sup><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" id=\"ref-link-section-d24357297e908\" title=\"Rahwan, I. et al. Machine behaviour. Nature 568, 477\u2013486 (2019).\" href=\"https:\/\/www.nature.com\/articles\/s41586-026-10303-2#ref-CR4\" data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\">4<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" id=\"ref-link-section-d24357297e908_1\" title=\"Shiffrin, R. & Mitchell, M. Probing the psychology of AI models. Proc. Natl Acad. Sci. USA 120, e2300963120 (2023).\" href=\"https:\/\/www.nature.com\/articles\/s41586-026-10303-2#ref-CR5\" data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\">5<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" id=\"ref-link-section-d24357297e908_2\" title=\"Zhou, L. et al. Larger and more instructable language models become less reliable. Nature 634, 61\u201368 (2024).\" href=\"https:\/\/www.nature.com\/articles\/s41586-026-10303-2#ref-CR6\" data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\">6<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" id=\"ref-link-section-d24357297e908_3\" title=\"Zhou, L. et al. Predictable artificial intelligence. Artif. Intell. 353, 104491 (2026).\" href=\"https:\/\/www.nature.com\/articles\/s41586-026-10303-2#ref-CR7\" data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\">7<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" id=\"ref-link-section-d24357297e911\" title=\"Burden, J., Te\u0161i\u0107, M., Pacchiardi, L. & Hern\u00e1ndez-Orallo, J. Paradigms of AI evaluation: mapping goals, methodologies and culture. In Proc. Thirty-Fourth International Joint Conference on Artificial Intelligence 10381\u201310390 (IJCAI, 2025).\" href=\"https:\/\/www.nature.com\/articles\/s41586-026-10303-2#ref-CR8\" data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 8\">8<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/sup>, attributed to limited transferability across specific tasks<sup><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" id=\"ref-link-section-d24357297e915\" title=\"Burnell, R. et al. Rethink reporting of evaluation results in AI. Science 380, 136\u2013138 (2023).\" href=\"https:\/\/www.nature.com\/articles\/s41586-026-10303-2#ref-CR9\" data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\">9<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" id=\"ref-link-section-d24357297e915_1\" title=\"Eriksson, M. et al. Can we trust AI benchmarks? An interdisciplinary review of current issues in AI evaluation. In Proc. Eighth AAAI\/ACM Conference on AI, Ethics, and Society 850\u2013864 (AAAI Press, 2025).\" href=\"https:\/\/www.nature.com\/articles\/s41586-026-10303-2#ref-CR10\" data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\">10<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" id=\"ref-link-section-d24357297e918\" title=\"Mitchell, M. The metaphors of artificial intelligence. Science 386, eadt6140 (2024).\" href=\"https:\/\/www.nature.com\/articles\/s41586-026-10303-2#ref-CR11\" data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 11\">11<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/sup>. Here we introduce general scales for AI evaluation that elicit demand profiles explaining what capabilities common AI benchmarks truly measure, extract ability profiles quantifying the general strengths and limits of AI systems and robustly predict AI performance for new task instances. Our fully automated methodology builds on 18 rubrics, capturing a broad range of cognitive and intellectual demands, which place different task instances on the same general scales, illustrated on 15 large language models (LLMs) and 63 tasks. Both the demand and the ability profiles on these scales bring new insights such as construct validity through benchmark sensitivity and specificity and explain conflicting claims about whether AI has reasoning capabilities. Ultimately, high predictive power at the instance level becomes possible using the general scales, providing superior estimates over strong black-box baseline predictors, especially in out-of-distribution settings (new tasks and benchmarks). The scales, rubrics, battery, techniques and results presented here constitute a solid foundation for a science of AI evaluation, underpinning the reliable deployment of AI in the years ahead.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Ensuring safe and effective use of artificial intelligence (AI) requires understanding and anticipating its performance on new tasks, from advanced scientific challenges to transformed workplace activities1,2,3. So far, benchmarking has guided progress in AI but has offered limited explanatory and predictive power for general-purpose AI systems4,5,6,7,8, attributed to limited transferability across specific tasks9,10,11. Here we [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_publishername":"","msr_publisher_other":"","msr_booktitle":"","msr_chapter":"","msr_edition":"","msr_editors":"","msr_how_published":"","msr_isbn":"","msr_issue":"","msr_journal":"","msr_number":"","msr_organization":"","msr_pages_string":"","msr_page_range_start":"58","msr_page_range_end":"67","msr_series":"","msr_volume":"652","msr_copyright":"","msr_conference_name":"","msr_doi":"","msr_arxiv_id":"","msr_s2_paper_id":"","msr_mag_id":"","msr_pubmed_id":"","msr_other_authors":"","msr_other_contributors":"","msr_speaker":"","msr_award":"","msr_affiliation":"","msr_institution":"","msr_host":"","msr_version":"","msr_duration":"","msr_original_fields_of_study":"","msr_release_tracker_id":"","msr_s2_match_type":"","msr_citation_count_updated":"","msr_published_date":"2026-04-01","msr_highlight_text":"","msr_notes":"","msr_longbiography":"","msr_publicationurl":"","msr_external_url":"","msr_secondary_video_url":"","msr_conference_url":"","msr_journal_url":"","msr_s2_pdf_url":"","msr_year":0,"msr_citation_count":0,"msr_influential_citations":0,"msr_reference_count":0,"msr_s2_match_confidence":0,"msr_microsoftintellectualproperty":false,"msr_s2_open_access":false,"msr_s2_author_ids":[],"msr_pub_ids":[],"msr_hide_image_in_river":null,"footnotes":""},"msr-research-highlight":[],"research-area":[13556],"msr-publication-type":[193715],"msr-publisher":[],"msr-focus-area":[],"msr-locale":[268875],"msr-post-option":[269142],"msr-field-of-study":[246694,246691],"msr-conference":[],"msr-journal":[268308],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-1167325","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-include-in-river","msr-field-of-study-artificial-intelligence","msr-field-of-study-computer-science"],"msr_publishername":"","msr_edition":"","msr_affiliation":"","msr_published_date":"2026-04-01","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"652","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":0,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"url","viewUrl":"false","id":"false","title":"https:\/\/www.nature.com\/articles\/s41586-026-10303-2","label_id":"243109","label":0}],"msr_related_uploader":"","msr_citation_count":0,"msr_citation_count_updated":"","msr_s2_paper_id":"","msr_influential_citations":0,"msr_reference_count":0,"msr_arxiv_id":"","msr_s2_author_ids":[],"msr_s2_open_access":false,"msr_s2_pdf_url":null,"msr_attachments":[],"msr-author-ordering":[{"type":"text","value":"Lexin Zhou","user_id":0,"rest_url":false},{"type":"text","value":"Lorenzo Pacchiardi","user_id":0,"rest_url":false},{"type":"text","value":"Fernando Mart&#039;inez-Plumed","user_id":0,"rest_url":false},{"type":"text","value":"Katherine M. Collins","user_id":0,"rest_url":false},{"type":"text","value":"Yael Moros-Daval","user_id":0,"rest_url":false},{"type":"text","value":"Seraphina Zhang","user_id":0,"rest_url":false},{"type":"text","value":"Qinlin Zhao","user_id":0,"rest_url":false},{"type":"text","value":"Yitian Huang","user_id":0,"rest_url":false},{"type":"text","value":"Luning Sun","user_id":0,"rest_url":false},{"type":"text","value":"Jonathan E Prunty","user_id":0,"rest_url":false},{"type":"text","value":"Zongqian Li","user_id":0,"rest_url":false},{"type":"text","value":"Pablo S&#039;anchez-Garc&#039;ia","user_id":0,"rest_url":false},{"type":"text","value":"Kexin Chen","user_id":0,"rest_url":false},{"type":"text","value":"Pablo Antonio Moreno Casares","user_id":0,"rest_url":false},{"type":"text","value":"Jiyun Zu","user_id":0,"rest_url":false},{"type":"text","value":"John Burden","user_id":0,"rest_url":false},{"type":"text","value":"Behzad Mehrbakhsh","user_id":0,"rest_url":false},{"type":"text","value":"David Stillwell","user_id":0,"rest_url":false},{"type":"text","value":"Manuel Cebrian","user_id":0,"rest_url":false},{"type":"text","value":"Jindong Wang","user_id":0,"rest_url":false},{"type":"text","value":"Peter Henderson","user_id":0,"rest_url":false},{"type":"text","value":"Sherry Wu","user_id":0,"rest_url":false},{"type":"text","value":"Patrick C Kyllonen","user_id":0,"rest_url":false},{"type":"text","value":"Lucy G. Cheke","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Xing Xie","user_id":34906,"rest_url":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Xing Xie"},{"type":"text","value":"Jos&#039;e Hern&#039;andez-Orallo","user_id":0,"rest_url":false}],"msr_impact_theme":[],"msr_research_lab":[],"msr_event":[],"msr_group":[953601],"msr_project":[995412],"publication":[],"video":[],"msr-tool":[],"msr_publication_type":"article","related_content":{"projects":[{"ID":995412,"post_title":"Societal AI","post_name":"societal-ai","post_type":"msr-project","post_date":"2023-12-27 22:56:21","post_modified":"2025-12-01 22:38:07","post_status":"publish","permalink":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/project\/societal-ai\/","post_excerpt":"The emerging general-purpose AI models (e.g., LLMs) have shown potential to enhance productivity, creative expression, and scientific research with their capabilities that are close to humans. As Brad Smith noted, \u201cThe more powerful the tool, the greater the benefit or damage it can cause.\u201d Despite the benefits, their significant technical and social challenges, such as the requirements of new research paradigm, the emergence of unforeseeable risks, the fair and inclusive usage of AI technologies, andthe&hellip;","_links":{"self":[{"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/995412"}]}}]},"_links":{"self":[{"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1167325","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":1,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1167325\/revisions"}],"predecessor-version":[{"id":1167326,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1167325\/revisions\/1167326"}],"wp:attachment":[{"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1167325"}],"wp:term":[{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=1167325"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1167325"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=1167325"},{"taxonomy":"msr-publisher","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-publisher?post=1167325"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=1167325"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1167325"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1167325"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=1167325"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=1167325"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=1167325"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1167325"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=1167325"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}