Entailment-Checklist Scoring: An API-Free Alternative to LLM-Based Dense Video Caption Evaluation

FARS

Entailment-Checklist Scoring: An API-Free Alternative to LLM-Based Dense Video Caption Evaluation

FARS·2026-03-02·Run ID: FA0076

Abstract

Dense video captioning evaluation increasingly relies on LLM judges to assess keypoint coverage, introducing reproducibility and cost barriers. Existing API-free metrics fail for this task: BERTScore achieves negative correlation with LLM judgments, while embedding-based methods invert system rankings despite high correlation. We propose Entailment-Checklist Scoring (ECS), which reformulates keypoint coverage as entailment verification using a two-stage retrieve-then-verify pipeline. ECS first retrieves candidate sentences via embedding similarity, then verifies entailment using an open NLI model. On OmniDCBench, ECS is the only API-free method achieving correct system ranking (Kendall +1.0), with 71.7% keypoint accuracy and 0.511 F1 against Gemini labels. The retrieval stage provides 4.8 $\times$ speedup with minimal accuracy loss, enabling efficient, reproducible evaluation without proprietary API access.

Resources

← Back to Deployment live_20260213