Exploring Disentangled and Controllable Human Image Synthesis:
From End-to-End to Stage-by-Stage

Zhengwentai Sun ^{1 2}, Chenghong Li ^{1 2}, Hongjie Liao ², Xihe Yang ², Keru Zheng ²,
Heyuan Li ², Yihao Zhi ^{1 2}, Shuliang Ning ^{1 2}, Shuguang Cui ^{2 1}, Xiaoguang Han ^{2 1 *}

¹FNii, CUHKSZ ²SSE, CUHKSZ
^*Indicates Corresponding Author

Paper Code arXiv

Abstract

Fine-grained controllability in human image synthesis remains challenging, especially when jointly controlling viewpoint, pose, clothing, and identity. We formulate a disentangled human synthesis task for these factors and first study an end-to-end model on MVHumanNet, which struggles to generalize to in-the-wild data due to domain and data-format gaps. To better leverage both MVHumanNet and VTON data, we propose a stage-by-stage pipeline with three steps: clothed A-pose generation, back-view synthesis, and pose/view control. Experiments show this design improves both visual quality and disentanglement, with stronger generalization in real-world scenarios.

Disentanglement and control of faces, clothes, shoes, views, and poses.

We introduce a new task that explicitly disentangles key human attributes within a unified framework. This enables fine-grained and controllable human synthesis.

More challenging poses.

Method

Overview of the proposed pipelines. (a) The end-to-end pipeline directly synthesizes the final image from disentangled inputs, including a face image, clothing images, and a pose map. (b) The stage-by-stage pipeline decomposes the process into three steps: front-view synthesis with identity and clothing control, back-view synthesis, and free-view synthesis under the target pose and viewpoint. Both pipelines are implemented using DiscoHuman, with details provided in below.

DiscoHuman model \( \varepsilon \) consists of a VisualDiT \( \varepsilon_V \) and a HumanDiT \( \varepsilon_H \). The VisualDiT is responsible for encoding visual conditions, with different input settings depending on the pipeline or stage in which DiscoHuman is applied. The upper left blocks illustrate three possible input configurations. In this figure, the active configuration corresponds to Stage 3, while the inactive settings are indicated by grey dashed lines. To maintain simplicity, the denoising timestep t is not shown in this figure.

Comparison

Comparison on MVHumanNet (1).

Comparison on MVHumanNet (2).

Comparison on THuman 4.0 and AvatarRex dataset.

Comparison on in-the-wild data generated by Stable Diffusion.

We compare our method against AA (AnimateAnyone), MA (MagicAnimate), and CFLD. Our method achieves the best performance across multiple datasets.

BibTeX


@article{sun2025exploring,
  title={Exploring Disentangled and Controllable Human Image Synthesis: From End-to-End to Stage-by-Stage},
  author={Sun, Zhengwentai and Li, Heyuan and Yang, Xihe and Zheng, Keru and Ning, Shuliang and Zhi, Yihao and Liao, Hongjie and Li, Chenghong and Cui, Shuguang and Han, Xiaoguang},
  journal={arXiv preprint arXiv:2503.19486},
  year={2025}
}