Hi, thanks for your excellent work!
I have a minor clarification question regarding the Cross-View Reconstruction setting.
Given a scene with multiple captured views, when one view is masked during training and the model is required to reconstruct it, how does the model identify which specific viewpoint should be reconstructed?
In particular, does the model rely on any additional information (e.g., camera pose, intrinsic parameters, view indices, or positional embeddings) to disambiguate the target view, or is this implicitly inferred from the input representation?
Thanks.
Hi, thanks for your excellent work!
I have a minor clarification question regarding the Cross-View Reconstruction setting.
Given a scene with multiple captured views, when one view is masked during training and the model is required to reconstruct it, how does the model identify which specific viewpoint should be reconstructed?
In particular, does the model rely on any additional information (e.g., camera pose, intrinsic parameters, view indices, or positional embeddings) to disambiguate the target view, or is this implicitly inferred from the input representation?
Thanks.