Skip to content

Commit 884b478

Browse files
author
Kye
committed
[CLEANUP]
1 parent b25b62e commit 884b478

File tree

3 files changed

+32
-48
lines changed

3 files changed

+32
-48
lines changed

README.md

+26-45
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
---
2424

2525

26-
Robotic Transformer 2 (RT-2) leverages both web and robotics data to generate actionable instructions for robotic control.
26+
This is my implementation of the model behind RT-2. RT-2 leverages PALM-E as the backbone with a Vision encoder and language backbone where images are embedded and concatenated in the same space as the language embeddings. This architecture is quite easy to architect but suffers from a lack of deep understanding of both the unified multi modal representation or the individual modality representations.
2727

2828
[CLICK HERE FOR THE PAPER](https://robotics-transformer2.github.io/assets/rt2.pdf)
2929

@@ -35,13 +35,6 @@ RT-2 can be easily installed using pip:
3535
```bash
3636
pip install rt2
3737
```
38-
39-
Additionally, you can manually install the dependencies:
40-
41-
```bash
42-
pip install -r requirements.txt
43-
```
44-
4538
# Usage
4639

4740

@@ -53,26 +46,21 @@ First, you need to initialize the `RT2` class. You can do this by providing the
5346

5447
```python
5548

56-
import torch
49+
import torch
5750
from rt2.model import RT2
5851

59-
model = RT2()
60-
61-
video = torch.randn(2, 3, 6, 224, 224)
52+
# img: (batch_size, 3, 256, 256)
53+
# caption: (batch_size, 1024)
54+
img = torch.randn(1, 3, 256, 256)
55+
caption = torch.randint(0, 20000, (1, 1024))
6256

63-
instructions = [
64-
'bring me that apple sitting on the table',
65-
'please pass the butter'
66-
]
67-
68-
# compute the train logits
69-
train_logits = model.train(video, instructions)
57+
# model: RT2
58+
model = RT2()
7059

71-
# set the model to evaluation mode
72-
model.model.eval()
60+
# Run model on img and caption
61+
output = model(img, caption)
62+
print(output) # (1, 1024, 20000)
7363

74-
# compute the eval logits with a conditional scale of 3
75-
eval_logits = model.eval(video, instructions, cond_scale=3.)
7664

7765
```
7866

@@ -101,24 +89,19 @@ RT-2 is fine-tuned using both web and robotics data. The resultant model interpr
10189
| Language-Table | Used for training on several prediction tasks. | Lynch et al. (2022) | N/A | N/A |
10290

10391

92+
## Datasets
93+
Datasets used in the paper
10494

105-
# Appreciation
10695

107-
* Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski,
108-
* Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu,
109-
* Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog,
110-
* Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang,
111-
* Isabel Leal, Lisa Lee, Tsang-Wei Edward Lee, Sergey Levine, Yao Lu, Henryk Michalewski, Igor Mordatch,
112-
* Karl Pertsch, Kanishka Rao, Krista Reymann, Michael Ryoo, Grecia Salazar, Pannag Sanketi,
113-
* Pierre Sermanet, Jaspiar Singh, Anikait Singh, Radu Soricut, Huong Tran, Vincent Vanhoucke, Quan Vuong,
114-
* Ayzaan Wahid, Stefan Welker, Paul Wohlhart, Jialin Wu, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu,
115-
* and Brianna Zitkovich
96+
| Dataset | Description | Source | Percentage in Training Mixture (RT-2-PaLI-X) | Percentage in Training Mixture (RT-2-PaLM-E) |
97+
|---------|-------------|--------|----------------------------------------------|----------------------------------------------|
98+
| WebLI | Around 10B image-text pairs across 109 languages, filtered to the top 10% scoring cross-modal similarity examples to give 1B training examples. | Chen et al. (2023b), Driess et al. (2023) | N/A | N/A |
99+
| Episodic WebLI | Not used in co-fine-tuning RT-2-PaLI-X. | Chen et al. (2023a) | N/A | N/A |
100+
| Robotics Dataset | Demonstration episodes collected with a mobile manipulation robot. Each demonstration is annotated with a natural language instruction from one of seven skills. | Brohan et al. (2022) | 50% | 66% |
101+
| Language-Table | Used for training on several prediction tasks. | Lynch et al. (2022) | N/A | N/A |
116102

117-
for writing this amazing paper and advancing Humanity
118103

119-
* LucidRains for providing the base repositories for [PALM](https://github.com/lucidrains/PaLM-rlhf-pytorch) and [RT-1](https://github.com/kyegomez/RT-2)
120104

121-
* Any you yes the Human looking at this right now, I appreciate you and love you.
122105

123106
## Commercial Use Cases
124107

@@ -128,25 +111,18 @@ The unique capabilities of RT-2 open up numerous commercial applications:
128111
- **Healthcare**: In robotic surgeries or patient care, RT-2 can assist in understanding and performing tasks based on both visual and verbal instructions.
129112
- **Smart Homes**: Integration of RT-2 in smart home systems can lead to improved automation, understanding homeowner instructions in a much more nuanced manner.
130113

131-
## Examples and Documentation
132-
133-
Detailed examples and comprehensive documentation for using RT-2 can be found in the [examples](https://github.com/kyegomez/RT-2/tree/master/examples) directory and the [documentation](https://github.com/kyegomez/RT-2/tree/master/docs) directory, respectively.
134114

135115
## Contributing
136116

137117
Contributions to RT-2 are always welcome! Feel free to open an issue or pull request on the GitHub repository.
138118

139-
## License
140-
141-
RT-2 is provided under the MIT License. See the LICENSE file for details.
142-
143119
## Contact
144120

145121
For any queries or issues, kindly open a GitHub issue or get in touch with [kyegomez](https://github.com/kyegomez).
146122

147123
## Citation
148124

149-
```
125+
```bibtex
150126
@inproceedings{RT-2,2023,
151127
title={},
152128
author={Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski,
@@ -160,4 +136,9 @@ Ayzaan Wahid, Stefan Welker, Paul Wohlhart, Jialin Wu, Fei Xia, Ted Xiao, Peng X
160136
and Brianna Zitkovich},
161137
year={2024}
162138
}
163-
```
139+
```
140+
141+
142+
## License
143+
144+
RT-2 is provided under the MIT License. See the LICENSE file for details.

example.py

+5-2
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,14 @@
11
import torch
22
from rt2.model import RT2
33

4-
# usage
4+
# img: (batch_size, 3, 256, 256)
5+
# caption: (batch_size, 1024)
56
img = torch.randn(1, 3, 256, 256)
67
caption = torch.randint(0, 20000, (1, 1024))
78

9+
# model: RT2
810
model = RT2()
911

12+
# Run model on img and caption
1013
output = model(img, caption)
11-
print(output.shape) # (1, 1024, 20000)
14+
print(output) # (1, 1024, 20000)

rt2/model.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
import torch
22
from torch import nn
3-
from zeta import (
3+
from zeta.structs import (
44
AutoregressiveWrapper,
55
Decoder,
66
Encoder,

0 commit comments

Comments
 (0)