A powerful and flexible framework for 3D human pose estimation, leveraging the strengths of LSTM and transformer-based networks, inspired by the state-of-the-art research in the field.
A significant portion of the code related to data processing and visualization is derived from the following outstanding projects:
Big shoutout to the contributors of these projects for their exceptional work!
This project has been developed and tested with the following environment:
- Python: 3.9
- PyTorch: 1.13.0
- CUDA: 11.7
To set up your environment, follow these steps:
conda create -n 3dposenet python=3.9
conda activate 3dposenet
pip install torch==1.13.0+cu117 torchvision==0.14.0+cu117 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu117
pip install -r requirements.txt
Please refer to VideoPose3D to set up the Human3.6M dataset as follows:
code_root/
└── data/
├── data_2d_h36m_gt.npz
├── data_2d_h36m_cpn_ft_h36m_dbb.npz
└── data_3d_h36m.npz
You can train our model on a single GPU with the following command:
python train.py
The training script includes several configurable parameters, allowing you to experiment with different setups. The current configuration is as follows:
batch_size = 512
num_input_frames = 81
num_epoch = 15
lr = 0.0001
model_pos = LSTM_PoseNet(num_joints, num_frames=receptive_field, input_dim=2, output_dim=3)
checkpoint = 'checkpoint'
The model processes 81 frames at a time, dividing a video into overlapping windows of 81 frames. But feel free to experiment on this parameter.
First, you need to download the pretrained weights for YOLOv3 (here), HRNet (here) and put them in the ./demo/lib/checkpoint
directory. Then, put your in-the-wild videos in the ./demo/video
directory.
Show correct checkpoint path (from the trained model) in vis.py
and Run the command below:
python demo/vis.py --video sample_video.mp4
Our models achieved the following performance on the Human3.6M benchmark using the MPJPE evaluation metric:
• LSTM_PoseNet: 55 mm
• Transformer-based models: 64 mm
The current state-of-the-art (SOTA) performance on this benchmark is around 30 mm, as reported in this paper (here)