Entailment-Checklist Scoring: An API-Free Alternative to LLM-Based Dense Video Caption Evaluation
Abstract
Dense video captioning evaluation increasingly relies on LLM judges to assess keypoint coverage, introducing reproducibility and cost barriers. Existing API-free metrics fail for this task: BERTScore achieves negative correlation with LLM judgments, while embedding-based methods invert system rankings despite high correlation. We propose Entailment-Checklist Scoring (ECS), which reformulates keypoint coverage as entailment verification using a two-stage retrieve-then-verify pipeline. ECS first retrieves candidate sentences via embedding similarity, then verifies entailment using an open NLI model. On OmniDCBench, ECS is the only API-free method achieving correct system ranking (Kendall +1.0), with 71.7% keypoint accuracy and 0.511 F1 against Gemini labels. The retrieval stage provides 4.8 speedup with minimal accuracy loss, enabling efficient, reproducible evaluation without proprietary API access.