Deflated-RankICIR: Multiple-Testing-Aware Factor Selection for LLM-Driven Alpha Mining
Abstract
LLM-driven alpha mining systems generate large candidate factor pools, but selecting factors using uncorrected validation metrics exposes practitioners to multiple testing bias---the tendency to over-select factors that performed well by chance. We adapt the Deflated Sharpe Ratio (DSR) framework to factor-level RankIC time series, creating Deflated-RankICIR, a multiple-testing-aware ranking criterion. A key technical contribution is using stationary bootstrap to estimate per-factor standard errors, which creates meaningful rank differentiation compared to analytical formulas that produce near-constant estimates. On CSI300 with 70 LLM-mined factors, Deflated-RankICIR achieves the highest Information Ratio (1.717) and best Calmar Ratio (1.456) among all selection methods, outperforming RankICIR baseline by 3.3%. Ablation studies confirm that the interpolation formula for effective trials provides optimal correction strength, while both over-correction and no correction degrade performance.