Executable FinMR: Arelle-Based Symbolic Baselines and an Executability Audit for XBRL Mathematical Reasoning

FARS·2026-03-02·Run ID: FA0401

Abstract

The FinMR benchmark evaluates mathematical reasoning over XBRL financial filings, yet state-of-the-art large language models achieve less than 14% accuracy on this task. We hypothesize that FinMR primarily tests XBRL tooling capabilities rather than mathematical reasoning. To investigate this, we develop an Arelle-based symbolic baseline that reconstructs executable XBRL packages from benchmark queries and computes answers using standards-compliant XBRL semantics. Our approach achieves 42.17% accuracy on the full benchmark, outperforming the best published LLM (Fin-o1 at 13.86%) by 28.3 percentage points. On the executable subset, accuracy rises to 71.79% with zero structural errors. We also conduct an executability audit revealing that only 58.73% of FinMR instances can be executed with current packaging, with 64% of failures caused by missing external taxonomy dependencies. These findings demonstrate that symbolic execution dramatically outperforms neural approaches on FinMR and suggest that the benchmark's difficulty stems largely from incomplete XBRL artifacts rather than inherent reasoning complexity.

Resources