Deep research agents that synthesize long-form reports with citations are increasingly deployed, yet citation quality remains problematic: models frequently hallucinate references, fabricate quotes, or cite sources that do not support the claimed statements. We propose QuoteVerify, an inference-time pipeline that verifies citations through quote-backed evidence. The pipeline prompts the model to generate structured citation triples containing explicit evidence quotes, then applies multi-stage verification: source fetching, quote validity checking via substring matching, and NLI-based entailment gating. Experiments on ReportBench demonstrate statistically significant improvements over standard baselines, with cited-statement match rate gains of +18.7 percentage points on GPT-4o ($p=0.019$) and +12.5 percentage points on Gemini-2.5-Pro ($p=0.011$). Analysis reveals that the structured citation format drives most gains, while quote validity remains the primary bottleneck---LLMs produce valid quotes only 18--28\% of the time even for successfully fetched sources, indicating a tendency to paraphrase rather than verbatim quote.

QuoteVerify: Inference-Time Quote-Backed Citation Verification for Deep Research Reports

Abstract

Resources