Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use semantic segmentation tokenizer for precomputing tokens for this modality? #24

Open
HITESH2002-JAIN opened this issue Aug 12, 2024 · 0 comments

Comments

@HITESH2002-JAIN
Copy link

HITESH2002-JAIN commented Aug 12, 2024

I am working on precomputing tokens for each modality in my 4M training pipeline. I’m using grayscale semantic segmentation masks as input, but I’m encountering an issue where the regenerated output does not match the original mask.

Screenshot 2024-08-12 at 4 45 24 PM

This is the code I am using for precomputing tokens

from fourm.vq.vqvae import VQVAE, DiVAE
from PIL import Image
from torchvision import transforms
from fourm.utils import denormalize, IMAGENET_INCEPTION_MEAN, IMAGENET_INCEPTION_STD
from torchvision.transforms import Normalize
transform = transforms.ToTensor()
resize = transforms.Resize((224, 224))

tok = VQVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_semseg_4k_224-448').cuda()
tensors_b3hw = []
for image_path in selected_images :
    image = Image.open(image_path)
    rgb_b3hw = transform(resize(image)).unsqueeze(0)  
    tensors_b3hw.append(rgb_b3hw)
stacked_tensors_b3hw = torch.cat(tensors_b3hw, dim=0).int()
squeezed_tensor = torch.squeeze(stacked_tensors_b3hw, dim=1)
squeezed_tensor.shape
_, _, tokens = tok.encode(squeezed_tensor.cuda())
image_size = rgb_b3hw.shape[-1]
output_rgb_b3hw  = tok.decode_tokens(tokens, image_size=image_size)

The output_rgb_b3hw tensor, which is the regenerated output, consists of 134 channels. However, this does not match the original mask that I passed to the tokenizer. I expected the output to have the same number of channels as the input mask.
Am I missing something in the preprocessing or tokenization process? Is there a step I need to adjust to ensure that the regenerated output matches the original mask in terms of channel dimensions?

Any guidance or suggestions would be appreciated. Thank you!

@garjania Please help me out. Am I doing something wrong here ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant