SAT-HMR: Real-Time Multi-Person 3D Mesh Estimation via Scale-Adaptive Tokens

Chi Su1     Xiaoxuan Ma1     Jiajun Su2     Yizhou Wang1
1Peking University     2International Digital Economy Academy (IDEA)
Interpolate start reference image.

TL;DR: We propose scale-adaptive tokens to efficiently encode image features, which makes our SAT-HMR the best real-time model for multi-person 3D mesh estimation.

Abstract

We propose a one-stage framework for real-time multi-person 3D human mesh estimation from a single RGB image. While current one-stage methods, which follow a DETR-style pipeline, achieve state-of-the-art (SOTA) performance with high-resolution inputs, we observe that this particularly benefits the estimation of individuals in smaller scales of the image (e.g., those far from the camera), but at the cost of significantly increased computation overhead. To address this, we introduce scale-adaptive tokens that are dynamically adjusted based on the relative scale of each individual in the image within the DETR framework. Specifically, individuals in smaller scales are processed at higher resolutions, larger ones at lower resolutions, and background regions are further distilled. These scale-adaptive tokens more efficiently encode the image features, facilitating subsequent decoding to regress the human mesh, while allowing the model to allocate computational resources more effectively and focus on more challenging cases. Experiments show that our method preserves the accuracy benefits of high-resolution processing while substantially reducing computational cost, achieving real-time inference with performance comparable to SOTA methods.

Video

Pipeline

Interpolate start reference image.

We begin with a low-resolution image and encode low-resolution tokens through a shallow transformer encoder. Then a scale head network predicts a patch-level scale map from these tokens, classifying them into three categories: background tokens, small-scale tokens, and large-scale tokens. Small-scale tokens are replaced by their corresponding high-resolution tokens, while background tokens are distilled by spatially pooling every four neighboring tokens. The remaining low-resolution tokens, representing large-scale instances, remain unchanged. These tokens are integrated, resulting in scale-adaptive tokens. Compared to the uniform tokens, this approach allocates feature details more efficiently, preserving different levels of detail for different individuals and regions. These scale-adaptive tokens are then processed by another encoder and further decoded in subsequent stages.

Results on Internet Images

BibTeX

@article{su2024sathmr,
      title={SAT-HMR: Real-Time Multi-Person 3D Mesh Estimation via Scale-Adaptive Tokens},
      author={Su, Chi and Ma, Xiaoxuan and Su, Jiajun and Wang, Yizhou},
      journal={arXiv preprint arXiv:2411.19824},
      year={2024}
    }