I-JEPA Paper Implementation

Details

Implemented Vision Transformer from scratch as used in all three blocks of model : Target Encoder, Context Encoder and Predictor.
The paper mentions they have not used [CLS] token in any of the blocks. For Predictor block they have also mentioned of keeping number of self-attention heads equal to that of the backbone context-encoder but changing depth of predictor.
NOTE: Working on preparing image input for context encoder and combining these two blocks with Predictor model.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock