Learning manipulation skills from human demonstration videos offers a promising path toward generalizable
and interpretable robotic intelligence—particularly through the lens of actionable affordances. However,
transferring such knowledge remains challenging due to: 1) a lack of large-scale datasets with precise
affordance annotations, and 2) insufficient exploration of affordances in diverse manipulation contexts.
To address these gaps, we introduce HOVA-500K, a large-scale, affordance-annotated dataset comprising
500,000 images across 1,726 object categories and 675 actions. We also release a standardized benchmarking
suite for multi-modal affordance reasoning. Built upon HOVA-500K, we present GLOVER++, a global-to-local
affordance training framework that effectively transfers actionable affordance knowledge from human
demonstrations to downstream open-vocabulary reasoning tasks. GLOVER++ achieves state-of-the-art results
on the HOVA-500K benchmark and demonstrates strong generalization across diverse downstream robotic
manipulation tasks. By explicitly modeling actionable affordances, GLOVER++ facilitates robust transfer
across scenes, modalities, and tasks. We hope that HOVA-500K and the GLOVER++ framework will serve as
valuable resources for bridging the gap between human demonstrations and robotic manipulation
capabilities.