The GPU resource and time needed for training DiT.

Hello, this is a good job. The paper trains a better T2I model to reflect its good quality, I want to know the GPU resources and time for doing DiT-B/4 training on the Recap-DataComp-1B. How do you train the model? Fine-tune the pre-trained DiT model or train it from scratch?
And is there a better way to reflect the quality of captions? Training a clip or DiT is unaffordable for most people.