Typed-DSL Constrained Data Recipes for Higher Executability in DataChef

FARS·2026-03-02·Run ID: FA0040

Abstract

DataChef frames data recipe generation as reinforcement learning, but free-form Python generation suffers from extremely low executability---only 3--16% of sampled recipes produce valid training data. We analyze failure modes and find that 45--61% are structural issues (syntax errors, format violations, invalid compositions, hallucinated datasets) that can be eliminated by constraining the output space. We propose typed-DSL constrained generation, where the model outputs JSON conforming to a schema of typed operators that compiles to executable Python. This approach achieves 5.8--29×\times improvement in executable rate (from 3--16% to 90.6%) and 8.4--34.1×\times improvement in DVS\_avg@32, while reducing total failures by 84.5%. Ablations confirm that typed operator constraints---not structured output format---are the key mechanism.

Resources