TY - JOUR
T1 - Text-Guided Diverse Scene Interaction Synthesis by Disentangling Actions From Scenes
AU - Teshima, Hitoshi
AU - Wake, Naoki
AU - Thomas, Diego
AU - Nakashima, Yuta
AU - Kawasaki, Hiroshi
AU - Ikeuchi, Katsushi
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2025
Y1 - 2025
N2 - Generating human motion within 3D scenes from textual descriptions remains a challenging task because of the scarcity of hybrid datasets encompassing text, 3D scenes, and motion. Existing approaches suffer from fundamental limitations: a lack of datasets that integrate text, 3D scenes, and motion, and a reliance on end-to-end methods, which constrain the diversity and realism of generated human-scene interactions. In this paper, we propose a novel method to generate motions of humans interacting with objects in a 3D scene given a textual prompt. Our key innovation focuses on decomposing the motion generation task into distinct steps: 1) generating key poses from textual and scene contexts and 2) synthesizing full motion trajectories guided by these key poses and path planning. This approach eliminates the need for hybrid datasets by leveraging independent text-motion and pose datasets, significantly expanding action diversity and overcoming the constraints of prior works. Unlike previous methods, which focus on limited action types or rely on scarce datasets, our approach enables scalable and adaptable motion generation. Through extensive experiments, we demonstrate that our framework achieves unparalleled diversity and contextually accurate motions, advancing the state-of-the-art in human-scene interaction synthesis.
AB - Generating human motion within 3D scenes from textual descriptions remains a challenging task because of the scarcity of hybrid datasets encompassing text, 3D scenes, and motion. Existing approaches suffer from fundamental limitations: a lack of datasets that integrate text, 3D scenes, and motion, and a reliance on end-to-end methods, which constrain the diversity and realism of generated human-scene interactions. In this paper, we propose a novel method to generate motions of humans interacting with objects in a 3D scene given a textual prompt. Our key innovation focuses on decomposing the motion generation task into distinct steps: 1) generating key poses from textual and scene contexts and 2) synthesizing full motion trajectories guided by these key poses and path planning. This approach eliminates the need for hybrid datasets by leveraging independent text-motion and pose datasets, significantly expanding action diversity and overcoming the constraints of prior works. Unlike previous methods, which focus on limited action types or rely on scarce datasets, our approach enables scalable and adaptable motion generation. Through extensive experiments, we demonstrate that our framework achieves unparalleled diversity and contextually accurate motions, advancing the state-of-the-art in human-scene interaction synthesis.
KW - 3D scene understanding
KW - Text-to-motion generation
KW - affordance-based interaction
KW - human-object interaction
KW - motion diffusion models
UR - http://www.scopus.com/inward/record.url?scp=105002865843&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105002865843&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2025.3562086
DO - 10.1109/ACCESS.2025.3562086
M3 - Article
AN - SCOPUS:105002865843
SN - 2169-3536
VL - 13
SP - 73818
EP - 73830
JO - IEEE Access
JF - IEEE Access
ER -