You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm curious about the choice of subspace sizes mentioned in the paper, which are set to 64(2^6) and 4096(2^12). What was the reasoning behind this specific configuration? Why not use two subspaces of the same size, such as both being 512(2^9)?
Thank you for your insights!
The text was updated successfully, but these errors were encountered:
Hi @Lqf-HFNJU. With two different subspace sizes (first 2^6 and second 2^12), we can first make a coarse classification and narrow down the search space, then make a precise classification in the reduced space. Moreover, this asymmetric token factorization introduces more learnable embeddings (64+4096 vs. 512+512), which increases the model capacity.
Hi,
I'm curious about the choice of subspace sizes mentioned in the paper, which are set to 64(2^6) and 4096(2^12). What was the reasoning behind this specific configuration? Why not use two subspaces of the same size, such as both being 512(2^9)?
Thank you for your insights!
The text was updated successfully, but these errors were encountered: