willy-b · willy-b · Dec 17, 2024 · Dec 16, 2024 · Dec 16, 2024
diff --git a/rasp-for-recogs_pos-wbruns-2024-draft.pdf b/rasp-for-recogs_pos-wbruns-2024-draft.pdf
diff --git a/rasp-for-recogs_pos-wbruns-2024-draft.tex b/rasp-for-recogs_pos-wbruns-2024-draft.tex
@@ -127,6 +127,7 @@ \section{Model}
 \end{tiny}
 }.
 
+
 See Appendix 8 - Model Detail.
 \section{Methods}
 We use the RASP \cite{Weiss2021} interpreter\footnote{
@@ -316,12 +317,11 @@ \section{Results}
 
 \textbf{Wu et al 2023 Encoder-Decoder baseline 2-layer Transformer does not improve on the obj\_pp\_to\_subj\_pp split when adding 1 or 2 additional layers}\footnote{This is consistent with the flat, non-tree solution we argue for in ths paper, that e.g. cannot learn to combine `np\_det pp np -> np\_pp -> np`, makes the predicted mistakes in the subj pp when the agent is left of the verb, and does poorly on our new v\_dat\_p2\_pp\_moved\_to\_recipient split  (see discussion).}\textbf{(even allowing parameter count to increase)}\footnote{Since no improvement was observed, we did not run the costly experiments to increase the layers while controlling the parameter count (which would be a follow up to distinguish if the improvement was from the layer increase or the parameter increase).}
 
-3-layer Wu et al 2023 Encoder-Decoder on ReCOGS\_pos obj\_pp\_to\_subj\_pp split: 16.1\% +/- 1.7\% Semantic Exact Match (sample mean +/- std) with 95\% confidence interval for sample mean (n=8) of 14.9\% to 17.3\%.
-
-4-layer Wu et al 2023 Encoder-Decoder on ReCOGS\_pos obj\_pp\_to\_subj\_pp split: 19.2\% +/- 4.4\% Semantic Exact Match (sample mean +/- std) with 95\% confidence interval for sample mean (n=5) of 15.4\% to 23.1\%.
+%3-layer Wu et al 2023 Encoder-Decoder on ReCOGS\_pos obj\_pp\_to\_subj\_pp split: 16.1\% +/- 1.7\% Semantic Exact Match (sample mean +/- std) with 95\% confidence interval for sample mean (n=8) of 14.9\% to 17.3\%. (n=8, updated when additional 2 runs were available below; same notebook had been queued originally but completed after initial reporting deadline)
+3-layer Wu et al 2023 Encoder-Decoder on ReCOGS\_pos obj\_pp\_to\_subj\_pp split: 16.2\% +/- 2.7\% Semantic Exact Match (sample mean +/- std, n=10) with 95\% confidence interval for sample mean (n=10) of 14.6\% to 17.9\% .\footnote{ https://colab.research.google.com/drive/12mXX5L1I4rpwl1Jk8hCm-xyAkqiKJEo7 }
 
-% Note: I should increase N on these two experiments for consistency.
-% However, I am confident in the result because I had already run many layer experiments on the ReCOGS dataset before re-running on ReCOGS_pos above for consistency in input with what the RASP model ended up being tested on. The RASP model I moved to ReCOGS_pos instead of ReCOGS because I wanted to show it could do String Exact Match not just Semantic Exact Match (and as in the original COGS dataset, the order of the output in the ground truth data for ReCOGS_pos is deterministic.) Since I had already done a large number of runs of 3,4 layer variants of Wu et al 2023's Transformer on the base ReCOGS dataset, once I had a pretty close 95% confidence interval as above showing no improvement for adding 1 to 2 layers (3,4 layer variants) when training Wu et al 2023's baseline Transformer on ReCOGS_pos as I saw in ReCOGS, I reported it as I had a deadline for initial communication. But for publication I will increase the N here as well though training Wu et al 2023's Transformer from scratch is not unfortunately not free.
+%4-layer Wu et al 2023 Encoder-Decoder on ReCOGS\_pos obj\_pp\_to\_subj\_pp split: 19.2\% +/- 4.4\% Semantic Exact Match (sample mean +/- std) with 95\% confidence interval for sample mean (n=5) of 15.4\% to 23.1\%. (n=5, updated when additional 5 runs were available below; same notebook had been queued originally but completed after initial reporting deadline)
+4-layer Wu et al 2023 Encoder-Decoder on ReCOGS\_pos obj\_pp\_to\_subj\_pp split: 19.3\% +/- 4.1\% Semantic Exact Match (sample mean +/- std, n=10) with 95\% confidence interval for sample mean (n=10) of 16.8\% to 21.8\%.\footnote{ https://colab.research.google.com/drive/12mXX5L1I4rpwl1Jk8hCm-xyAkqiKJEo7 and https://colab.research.google.com/drive/13FRQeAjyPOhBtTdrpW8caL25rNryLn5- }
 
 \textbf{Error Analysis for Wu et al 2023 baseline Encoder-Decoder Transformer on obj\_pp\_to\_subj\_pp split}
 
@@ -349,7 +349,7 @@ \section{Results}
 \begin{figure}
 \includegraphics[scale=0.38]{new_difficult_generalization_v_dat_p2_recipient_pp_modification_predicted_and_confirmed_for_transformers_trained_from_scratch.png}
 \caption[Wu et al 2023 Encoder-Decoder Transformer trained from scratch generalizing to new v\_dat\_p2 pp moved to recipient (from theme) split is as hard as the previously reported hardest generalization split as predicted by the flat/non-recursive/non-tree hypothesis by RASP modeling]{\begin{small}Wu et al 2023 Encoder-Decoder Transformer trained from scratch generalizing to new v\_dat\_p2 pp moved to recipient (from theme) split is as hard as the previously reported hardest generalization split as predicted by the flat/non-recursive/non-tree hypothesis by RASP modeling. \textit{Actual COGS training example and modified version shown, which have different meanings but in a compositional solution are identical at a medium level of abstraction, as both are `np v\_dat\_p2 np np`\protect\footnotemark, so model should be able to translate both to their appropriate logical forms.} The RASP program predicted this `np v\_dat\_p2 np pp np np` prepositional phrase modification would also require learning to ignore the in-between distractor "pp np" in inferring related words same as required for the obj\_pp\_to\_subj\_pp split so would be equally difficult for the Transformer trained from scratch. To test this, Wu et al 2023's baseline Encoder-Decoder Transformer was trained with default data (ReCOGS\_pos train.tsv) and tested on modified v\_dat\_p2 pp training examples where only the word order was changed to move the pp from the theme to the recipient (logical form properly updated see Appendix 4 for all examples). \textit{The baseline Wu et al 2023 Encoder-Decoder Transformer was only able to achieve a Semantic Exact Match of 13\% (n=10 Transformers trained from scratch with randomly initialized weights and data shuffling) with a 95\% confidence interval for the sample mean when n=10 of 4\% to 23\%}. Thus, this new split we introduce here as v\_dat\_p2\_pp\_moved\_to\_recipient is as difficult as the previous reported "hardest split" obj\_pp\_to\_subj\_pp. \\
-\textit{(Note, the RASP model which includes the rule to ignore "pp np" when searching for noun indexes to report in relationships (agent, theme, recipient, etc) derived to parsimoniously cover all pp modification cases in training data, also performs well on all pp generalization splits, e.g. obj\_pp\_to\_subj\_pp, WITHOUT needing a recursive/tree representation, so it may be possible to fix these errors everywhere via data augmentation to get the Transformer trained from scratch to learn that rule.)}\end{small} \\
+\textit{(Note, the RASP model which includes the rule to ignore "pp np" when searching for noun indexes to report in relationships (agent, theme, recipient, etc) derived to parsimoniously cover all pp modification cases in training data, also performs well on all pp generalization splits, e.g. obj\_pp\_to\_subj\_pp, WITHOUT needing a recursive/tree representation, so it may be possible to fix these errors everywhere via data augmentation to get the Transformer trained from scratch to learn that more general rule (while still not exposing to the held out grammar forms).)}\end{small} \\
 }
 % Note, need to regenerate the image in vector graphics for higher quality
 % Note, may add the example of the correct LF and semantic graph vs the misunderstood one for the second form to make it clear and then remove a lot of explainer text from caption
@@ -1525,6 +1525,166 @@ \section{Appendix 8 - Model Detail}
 \end{tiny}
 }.
 
+For ease of reference, the model architecture generated by the Wu et al 2023 baseline Encoder-Decoder Transformer script (trained from scratch, not pretrained) is as follows with N BertLayers set to 2 per \cite{Wu2023} for all baseline experiments except the layer variation experiments:
+\begin{tiny}
+\begin{verbatim}
+# For Wu et al 2023 Encoder-Decoder Transformer baselines 
+# (we predict and analyze errors made by these 
+# in the paper using what we learned about how Transformers 
+# can perform the task from the 
+# Restricted Access Sequence Processing model),
+# we use the official scripts provided at 
+# https://github.com/frankaging/ReCOGS/blob/
+# 1b6eca8ff4dca5fd2fb284a7d470998af5083beb/run\_cogs.py
+# and 
+# https://github.com/frankaging/ReCOGS/blob/
+# 1b6eca8ff4dca5fd2fb284a7d470998af5083beb/
+# model/encoder\_decoder\_hf.py
+# where the architecture generated is as follows:
+EncoderDecoderModel(
+ (encoder): BertModel(
+  (embeddings): BertEmbeddings(
+   (word_embeddings): Embedding(762, 300, padding_idx=0)
+   (position_embeddings): Embedding(512, 300)
+   (token_type_embeddings): Embedding(2, 300)
+   (LayerNorm): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
+   (dropout): Dropout(p=0.1, inplace=False)
+  )
+  (encoder): BertEncoder(
+   (layer): ModuleList(
+    # substitute N=2 for all baseline experiments
+    # per Wu et al 2023 paper; 
+    # N can be 3 or 4 in our layer variation experiments only.
+    (0-(N-1)): N x BertLayer(
+     (attention): BertAttention(
+      (self): BertSdpaSelfAttention(
+       (query): 
+        Linear(in_features=300, out_features=300, bias=True)
+       (key): 
+        Linear(in_features=300, out_features=300, bias=True)
+       (value): 
+        Linear(in_features=300, out_features=300, bias=True)
+       (dropout): Dropout(p=0.1, inplace=False)
+      )
+      (output): BertSelfOutput(
+       (dense): 
+        Linear(in_features=300, out_features=300, bias=True)
+       (LayerNorm): 
+        LayerNorm((300,), eps=1e-12, elementwise_affine=True)
+       (dropout): Dropout(p=0.1, inplace=False)
+      )
+     )
+     (intermediate): BertIntermediate(
+      (dense): 
+       Linear(in_features=300, out_features=512, bias=True)
+      (intermediate_act_fn): GELUActivation()
+     )
+     (output): BertOutput(
+      (dense): 
+       Linear(in_features=512, out_features=300, bias=True)
+      (LayerNorm): 
+       LayerNorm((300,), eps=1e-12, elementwise_affine=True)
+      (dropout): Dropout(p=0.1, inplace=False)
+     )
+    )
+   )
+  )
+  (pooler): BertPooler(
+   (dense): 
+    Linear(in_features=300, out_features=300, bias=True)
+   (activation): Tanh()
+  )
+ )
+ (decoder): BertLMHeadModel(
+  (bert): BertModel(
+   (embeddings): BertEmbeddings(
+    (word_embeddings): Embedding(729, 300, padding_idx=0)
+    (position_embeddings): Embedding(512, 300)
+    (token_type_embeddings): Embedding(2, 300)
+    (LayerNorm): 
+     LayerNorm((300,), eps=1e-12, elementwise_affine=True)
+    (dropout): Dropout(p=0.1, inplace=False)
+   )
+   (encoder): BertEncoder(
+    (layer): ModuleList(
+     # substitute N=2 for all baseline experiments
+     # per Wu et al 2023 paper; 
+     # N can be 3 or 4 in our layer variation experiments only.
+     (0-(N-1)): N x BertLayer(
+      (attention): BertAttention(
+       (self): BertSdpaSelfAttention(
+        (query): 
+         Linear(in_features=300, out_features=300, bias=True)
+        (key): 
+         Linear(in_features=300, out_features=300, bias=True)
+        (value): 
+         Linear(in_features=300, out_features=300, bias=True)
+        (dropout): Dropout(p=0.1, inplace=False)
+       )
+       (output): BertSelfOutput(
+        (dense): 
+         Linear(in_features=300, out_features=300, bias=True)
+        (LayerNorm): 
+         LayerNorm((300,), eps=1e-12, elementwise_affine=True)
+        (dropout): Dropout(p=0.1, inplace=False)
+       )
+      )
+      (crossattention): BertAttention(
+       (self): BertSdpaSelfAttention(
+        (query): 
+         Linear(in_features=300, out_features=300, bias=True)
+        (key): 
+         Linear(in_features=300, out_features=300, bias=True)
+        (value): 
+         Linear(in_features=300, out_features=300, bias=True)
+        (dropout): Dropout(p=0.1, inplace=False)
+       )
+       (output): BertSelfOutput(
+        (dense): 
+         Linear(in_features=300, out_features=300, bias=True)
+        (LayerNorm): 
+         LayerNorm((300,), eps=1e-12, elementwise_affine=True)
+        (dropout): Dropout(p=0.1, inplace=False)
+       )
+      )
+      (intermediate): BertIntermediate(
+       (dense): 
+        Linear(in_features=300, out_features=512, bias=True)
+       (intermediate_act_fn): GELUActivation()
+      )
+      (output): BertOutput(
+       (dense): 
+        Linear(in_features=512, out_features=300, bias=True)
+       (LayerNorm): 
+        LayerNorm((300,), eps=1e-12, elementwise_affine=True)
+       (dropout): Dropout(p=0.1, inplace=False)
+      )
+     )
+    )
+   )
+  )
+  (cls): BertOnlyMLMHead(
+   (predictions): BertLMPredictionHead(
+    (transform): BertPredictionHeadTransform(
+     (dense): 
+      Linear(in_features=300, out_features=300, bias=True)
+     (transform_act_fn): GELUActivation()
+     (LayerNorm): 
+      LayerNorm((300,), eps=1e-12, elementwise_affine=True)
+    )
+    (decoder): Linear(in_features=300, out_features=729, bias=True)
+   )
+  )
+ )
+)
+\end{verbatim}
+\end{tiny}
+
+For the Wu et al 2023 baseline Encoder-Decoder Transformer layer variation experiments, 
+when we say e.g. 3 or 4 layers, we refer to 3 or 4 x BertLayer in the Encoder and Decoder, setting (3 or 4 Transformer blocks).
+(This is intended because only once per block, during cross/self-attention is information exchanged between sequence positions, and \cite{Csordas2022} hypothesize that the number of such blocks must be at least the depth of the parse tree in a compositional solution, as in a grammar parse tree at each level symbols are combined which requires transferring information between sequence positions).
+\clearpage
+
 \section{Appendix 9 - Methods Detail}
 
 We use the RASP \cite{Weiss2021} interpreter\footnote{