[CLEANUP]

Kye · Kye · commit 884b47886bcb · 2023-12-22T20:24:21.000-05:00
diff --git a/README.md b/README.md
@@ -23,7 +23,7 @@
 ---
 
 
-Robotic Transformer 2 (RT-2) leverages both web and robotics data to generate actionable instructions for robotic control. 
+This is my implementation of the model behind RT-2. RT-2 leverages PALM-E as the backbone with a Vision encoder and language backbone where images are embedded and concatenated in the same space as the language embeddings. This architecture is quite easy to architect but suffers from a lack of deep understanding of both the unified multi modal representation or the individual modality representations.
 
 [CLICK HERE FOR THE PAPER](https://robotics-transformer2.github.io/assets/rt2.pdf)
 
@@ -35,13 +35,6 @@ RT-2 can be easily installed using pip:
 ```bash
 pip install rt2
 ```
-
-Additionally, you can manually install the dependencies:
-
-```bash
-pip install -r requirements.txt
-```
-
 # Usage
 
 
@@ -53,26 +46,21 @@ First, you need to initialize the `RT2` class. You can do this by providing the
 
 ```python
 
-import torch 
+import torch
 from rt2.model import RT2
 
-model = RT2()
-
-video = torch.randn(2, 3, 6, 224, 224)
+# img: (batch_size, 3, 256, 256)
+# caption: (batch_size, 1024)
+img = torch.randn(1, 3, 256, 256)
+caption = torch.randint(0, 20000, (1, 1024))
 
-instructions = [
-    'bring me that apple sitting on the table',
-    'please pass the butter'
-]
-
-# compute the train logits
-train_logits = model.train(video, instructions)
+# model: RT2
+model = RT2()
 
-# set the model to evaluation mode
-model.model.eval()
+# Run model on img and caption
+output = model(img, caption)
+print(output)  # (1, 1024, 20000)
 
-# compute the eval logits with a conditional scale of 3
-eval_logits = model.eval(video, instructions, cond_scale=3.)
 
 ```
 
@@ -101,24 +89,19 @@ RT-2 is fine-tuned using both web and robotics data. The resultant model interpr
 | Language-Table | Used for training on several prediction tasks. | Lynch et al. (2022) | N/A | N/A |
 
 
+## Datasets
+Datasets used in the paper
 
-# Appreciation
 
-* Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski,
-* Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu,
-* Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog,
-* Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang,
-* Isabel Leal, Lisa Lee, Tsang-Wei Edward Lee, Sergey Levine, Yao Lu, Henryk Michalewski, Igor Mordatch,
-* Karl Pertsch, Kanishka Rao, Krista Reymann, Michael Ryoo, Grecia Salazar, Pannag Sanketi,
-* Pierre Sermanet, Jaspiar Singh, Anikait Singh, Radu Soricut, Huong Tran, Vincent Vanhoucke, Quan Vuong,
-* Ayzaan Wahid, Stefan Welker, Paul Wohlhart, Jialin Wu, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu,
-* and Brianna Zitkovich
+| Dataset | Description | Source | Percentage in Training Mixture (RT-2-PaLI-X) | Percentage in Training Mixture (RT-2-PaLM-E) |
+|---------|-------------|--------|----------------------------------------------|----------------------------------------------|
+| WebLI | Around 10B image-text pairs across 109 languages, filtered to the top 10% scoring cross-modal similarity examples to give 1B training examples. | Chen et al. (2023b), Driess et al. (2023) | N/A | N/A |
+| Episodic WebLI | Not used in co-fine-tuning RT-2-PaLI-X. | Chen et al. (2023a) | N/A | N/A |
+| Robotics Dataset | Demonstration episodes collected with a mobile manipulation robot. Each demonstration is annotated with a natural language instruction from one of seven skills. | Brohan et al. (2022) | 50% | 66% |
+| Language-Table | Used for training on several prediction tasks. | Lynch et al. (2022) | N/A | N/A |
 
-for writing this amazing paper and advancing Humanity
 
-* LucidRains for providing the base repositories for [PALM](https://github.com/lucidrains/PaLM-rlhf-pytorch) and [RT-1](https://github.com/kyegomez/RT-2)
 
-* Any you yes the Human looking at this right now, I appreciate you and love you.
 
 ## Commercial Use Cases
 
@@ -128,25 +111,18 @@ The unique capabilities of RT-2 open up numerous commercial applications:
 - **Healthcare**: In robotic surgeries or patient care, RT-2 can assist in understanding and performing tasks based on both visual and verbal instructions.
 - **Smart Homes**: Integration of RT-2 in smart home systems can lead to improved automation, understanding homeowner instructions in a much more nuanced manner.
 
-## Examples and Documentation
-
-Detailed examples and comprehensive documentation for using RT-2 can be found in the [examples](https://github.com/kyegomez/RT-2/tree/master/examples) directory and the [documentation](https://github.com/kyegomez/RT-2/tree/master/docs) directory, respectively.
 
 ## Contributing
 
 Contributions to RT-2 are always welcome! Feel free to open an issue or pull request on the GitHub repository.
 
-## License
-
-RT-2 is provided under the MIT License. See the LICENSE file for details.
-
 ## Contact
 
 For any queries or issues, kindly open a GitHub issue or get in touch with [kyegomez](https://github.com/kyegomez).
 
 ## Citation
 
-```
+```bibtex
 @inproceedings{RT-2,2023,
   title={},
   author={Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski,
@@ -160,4 +136,9 @@ Ayzaan Wahid, Stefan Welker, Paul Wohlhart, Jialin Wu, Fei Xia, Ted Xiao, Peng X
 and Brianna Zitkovich},
   year={2024}
 }
-```
+```
+
+
+## License
+
+RT-2 is provided under the MIT License. See the LICENSE file for details.
diff --git a/example.py b/example.py
@@ -1,11 +1,14 @@
 import torch
 from rt2.model import RT2
 
-# usage
+# img: (batch_size, 3, 256, 256)
+# caption: (batch_size, 1024)
 img = torch.randn(1, 3, 256, 256)
 caption = torch.randint(0, 20000, (1, 1024))
 
+# model: RT2
 model = RT2()
 
+# Run model on img and caption
 output = model(img, caption)
-print(output.shape)  # (1, 1024, 20000)
+print(output)  # (1, 1024, 20000)
diff --git a/rt2/model.py b/rt2/model.py
@@ -1,6 +1,6 @@
 import torch
 from torch import nn
-from zeta import (
+from zeta.structs import (
     AutoregressiveWrapper,
     Decoder,
     Encoder,