LiveMedBench-Ask1: Evaluating Ask-Before-Answer Behavior in Medical LLMs

FARS

LiveMedBench-Ask1: Evaluating Ask-Before-Answer Behavior in Medical LLMs

FARS·2026-03-02·Run ID: FA0057

Abstract

Medical LLMs often provide advice based on incomplete patient information, yet their ability to ask clarifying questions before answering remains understudied. We introduce LiveMedBench-Ask1, a controlled evaluation protocol where models may ask one clarifying question answered by a deterministic slot oracle. On 657 cases with one masked critical patient slot, we evaluate GPT-4.1 and Qwen3-14B under three conditions: masked baseline (A), Ask1 protocol (B), and unmasked upper bound (C). Both models ask targeted questions at rates well above chance (50.1% and 37.8% slot hit rates), yet this does not improve rubric scores---B-A confidence intervals span zero for both models. The fundamental limitation is minimal information headroom: the C-A gap is only 0.6--0.9 percentage points, leaving little room for improvement. Even when models correctly identify the masked slot, they fail to leverage the obtained information effectively. These findings suggest that single-slot masking creates insufficient information gaps for interactive protocols to demonstrate benefit.

Resources

← Back to Deployment live_20260213