{"id":1169055,"date":"2026-04-20T09:50:19","date_gmt":"2026-04-20T16:50:19","guid":{"rendered":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/publication\/the-illusion-of-readiness-in-health-ai\/"},"modified":"2026-04-27T16:27:02","modified_gmt":"2026-04-27T23:27:02","slug":"the-illusion-of-readiness-in-health-ai","status":"publish","type":"msr-research-item","link":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/publication\/the-illusion-of-readiness-in-health-ai\/","title":{"rendered":"The Illusion of Readiness in Health AI"},"content":{"rendered":"<p>Large language models have demonstrated remarkable performance in a wide range of medical benchmarks. Yet underneath the seemingly promising results lie salient growth areas, especially in cutting-edge frontiers such as multimodal reasoning. In this paper, we introduce a series of adversarial stress tests to systematically assess the robustness of flagship models and medical benchmarks. Our study reveals prevalent brittleness in the presence of simple adversarial transformations: leading systems can guess the right answer even with key inputs removed, yet may get confused by the slightest prompt alterations, while fabricating convincing yet flawed reasoning traces. Using clinician-guided rubrics, we demonstrate that popular medical benchmarks vary widely in what they truly measure. Our study reveals significant competency gaps of frontier AI in attaining real-world readiness for health applications. If we want AI to earn trust in healthcare, we must demand more than leaderboard wins and must hold AI systems accountable to ensure robustness, sound reasoning, and alignment with real medical demands.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Large language models have demonstrated remarkable performance in a wide range of medical benchmarks. Yet underneath the seemingly promising results lie salient growth areas, especially in cutting-edge frontiers such as multimodal reasoning. In this paper, we introduce a series of adversarial stress tests to systematically assess the robustness of flagship models and medical benchmarks. Our [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_publishername":"","msr_publisher_other":"","msr_booktitle":"","msr_chapter":"","msr_edition":"","msr_editors":"","msr_how_published":"arXiv","msr_isbn":"","msr_issue":"","msr_journal":"","msr_number":"","msr_organization":"","msr_pages_string":"","msr_page_range_start":"","msr_page_range_end":"","msr_series":"","msr_volume":"","msr_copyright":"","msr_conference_name":"","msr_doi":"","msr_arxiv_id":"","msr_s2_paper_id":"","msr_mag_id":"","msr_pubmed_id":"","msr_other_authors":"","msr_other_contributors":"","msr_speaker":"","msr_award":"","msr_affiliation":"","msr_institution":"","msr_host":"","msr_version":"","msr_duration":"","msr_original_fields_of_study":"","msr_release_tracker_id":"","msr_s2_match_type":"","msr_citation_count_updated":"","msr_published_date":"2025-09-22","msr_highlight_text":"","msr_notes":"","msr_longbiography":"","msr_publicationurl":"","msr_external_url":"","msr_secondary_video_url":"","msr_conference_url":"","msr_journal_url":"","msr_s2_pdf_url":"","msr_year":0,"msr_citation_count":0,"msr_influential_citations":0,"msr_reference_count":0,"msr_s2_match_confidence":0,"msr_microsoftintellectualproperty":false,"msr_s2_open_access":false,"msr_s2_author_ids":[],"msr_pub_ids":[{"provider":"s2","id":"8f4853fffa11b9d68acf9c812a6a73a84931df78"},{"provider":"arxiv","id":"2509.18234"}],"msr_hide_image_in_river":null,"footnotes":""},"msr-research-highlight":[],"research-area":[13556,13553],"msr-publication-type":[193724],"msr-publisher":[],"msr-focus-area":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[246694,246691],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-1169055","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-artificial-intelligence","msr-research-area-medical-health-genomics","msr-locale-en_us","msr-field-of-study-artificial-intelligence","msr-field-of-study-computer-science"],"msr_publishername":"","msr_edition":"","msr_affiliation":"","msr_published_date":"2025-09-22","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"arXiv","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":0,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"url","viewUrl":"false","id":"false","title":"https:\/\/arxiv.org\/abs\/2509.18234","label_id":"243109","label":0}],"msr_related_uploader":"","msr_citation_count":0,"msr_citation_count_updated":"","msr_s2_paper_id":"","msr_influential_citations":0,"msr_reference_count":0,"msr_arxiv_id":"","msr_s2_author_ids":[],"msr_s2_open_access":false,"msr_s2_pdf_url":null,"msr_attachments":[],"msr-author-ordering":[{"type":"user_nicename","value":"Yu Gu","user_id":44020,"rest_url":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Yu Gu"},{"type":"text","value":"Jingjing Fu","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Xiaodong Liu","user_id":34877,"rest_url":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Xiaodong Liu"},{"type":"user_nicename","value":"Jeya Maria Jose Valanarasu","user_id":43491,"rest_url":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Jeya Maria Jose Valanarasu"},{"type":"user_nicename","value":"Noel Codella","user_id":41635,"rest_url":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Noel Codella"},{"type":"user_nicename","value":"Reuben Tan","user_id":43827,"rest_url":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Reuben Tan"},{"type":"text","value":"Jinyu Wang","user_id":0,"rest_url":false},{"type":"name","value":"Qianchu Liu","user_id":0,"rest_url":false},{"type":"name","value":"Ying Jin","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Sheng Zhang","user_id":39087,"rest_url":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Sheng Zhang"},{"type":"user_nicename","value":"Rui Wang","user_id":40192,"rest_url":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Rui Wang"},{"type":"text","value":"Lei Song","user_id":0,"rest_url":false},{"type":"name","value":"Guan-Hua Qin","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Naoto Usuyama","user_id":38670,"rest_url":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Naoto Usuyama"},{"type":"user_nicename","value":"Cliff Wong","user_id":38508,"rest_url":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Cliff Wong"},{"type":"name","value":"Cheng Hao","user_id":0,"rest_url":false},{"type":"name","value":"H. Lee","user_id":0,"rest_url":false},{"type":"name","value":"Praneeth Sanapathi","user_id":0,"rest_url":false},{"type":"name","value":"Sarah Hilado","user_id":0,"rest_url":false},{"type":"name","value":"Bian Jiang","user_id":0,"rest_url":false},{"type":"text","value":"Javier Alvarez-Valle","user_id":0,"rest_url":false},{"type":"name","value":"Mu Wei","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Jianfeng Gao","user_id":32246,"rest_url":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Jianfeng Gao"},{"type":"user_nicename","value":"Eric Horvitz","user_id":32033,"rest_url":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Eric Horvitz"},{"type":"user_nicename","value":"Matthew Lungren","user_id":42792,"rest_url":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Matthew Lungren"},{"type":"user_nicename","value":"Hoifung Poon","user_id":32016,"rest_url":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Hoifung Poon"},{"type":"name","value":"Paul Vozila","user_id":0,"rest_url":false}],"msr_impact_theme":[],"msr_research_lab":[],"msr_event":[],"msr_group":[],"msr_project":[],"publication":[],"video":[],"msr-tool":[],"msr_publication_type":"miscellaneous","related_content":[],"_links":{"self":[{"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1169055","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":2,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1169055\/revisions"}],"predecessor-version":[{"id":1169967,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1169055\/revisions\/1169967"}],"wp:attachment":[{"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1169055"}],"wp:term":[{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=1169055"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1169055"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=1169055"},{"taxonomy":"msr-publisher","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-publisher?post=1169055"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=1169055"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1169055"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1169055"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=1169055"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=1169055"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=1169055"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1169055"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=1169055"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}