Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

increase N for experiments on Wu et al 2023 baseline, 3 or 4 layer variants reported in results (report additional runs that had not completed by original draft) #3

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified rasp-for-recogs_pos-wbruns-2024-draft.pdf
Binary file not shown.
172 changes: 166 additions & 6 deletions rasp-for-recogs_pos-wbruns-2024-draft.tex
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,7 @@ \section{Model}
\end{tiny}
}.


See Appendix 8 - Model Detail.
\section{Methods}
We use the RASP \cite{Weiss2021} interpreter\footnote{
Expand Down Expand Up @@ -316,12 +317,11 @@ \section{Results}

\textbf{Wu et al 2023 Encoder-Decoder baseline 2-layer Transformer does not improve on the obj\_pp\_to\_subj\_pp split when adding 1 or 2 additional layers}\footnote{This is consistent with the flat, non-tree solution we argue for in ths paper, that e.g. cannot learn to combine `np\_det pp np -> np\_pp -> np`, makes the predicted mistakes in the subj pp when the agent is left of the verb, and does poorly on our new v\_dat\_p2\_pp\_moved\_to\_recipient split (see discussion).}\textbf{(even allowing parameter count to increase)}\footnote{Since no improvement was observed, we did not run the costly experiments to increase the layers while controlling the parameter count (which would be a follow up to distinguish if the improvement was from the layer increase or the parameter increase).}

3-layer Wu et al 2023 Encoder-Decoder on ReCOGS\_pos obj\_pp\_to\_subj\_pp split: 16.1\% +/- 1.7\% Semantic Exact Match (sample mean +/- std) with 95\% confidence interval for sample mean (n=8) of 14.9\% to 17.3\%.

4-layer Wu et al 2023 Encoder-Decoder on ReCOGS\_pos obj\_pp\_to\_subj\_pp split: 19.2\% +/- 4.4\% Semantic Exact Match (sample mean +/- std) with 95\% confidence interval for sample mean (n=5) of 15.4\% to 23.1\%.
%3-layer Wu et al 2023 Encoder-Decoder on ReCOGS\_pos obj\_pp\_to\_subj\_pp split: 16.1\% +/- 1.7\% Semantic Exact Match (sample mean +/- std) with 95\% confidence interval for sample mean (n=8) of 14.9\% to 17.3\%. (n=8, updated when additional 2 runs were available below; same notebook had been queued originally but completed after initial reporting deadline)
3-layer Wu et al 2023 Encoder-Decoder on ReCOGS\_pos obj\_pp\_to\_subj\_pp split: 16.2\% +/- 2.7\% Semantic Exact Match (sample mean +/- std, n=10) with 95\% confidence interval for sample mean (n=10) of 14.6\% to 17.9\% .\footnote{ https://colab.research.google.com/drive/12mXX5L1I4rpwl1Jk8hCm-xyAkqiKJEo7 }

% Note: I should increase N on these two experiments for consistency.
% However, I am confident in the result because I had already run many layer experiments on the ReCOGS dataset before re-running on ReCOGS_pos above for consistency in input with what the RASP model ended up being tested on. The RASP model I moved to ReCOGS_pos instead of ReCOGS because I wanted to show it could do String Exact Match not just Semantic Exact Match (and as in the original COGS dataset, the order of the output in the ground truth data for ReCOGS_pos is deterministic.) Since I had already done a large number of runs of 3,4 layer variants of Wu et al 2023's Transformer on the base ReCOGS dataset, once I had a pretty close 95% confidence interval as above showing no improvement for adding 1 to 2 layers (3,4 layer variants) when training Wu et al 2023's baseline Transformer on ReCOGS_pos as I saw in ReCOGS, I reported it as I had a deadline for initial communication. But for publication I will increase the N here as well though training Wu et al 2023's Transformer from scratch is not unfortunately not free.
%4-layer Wu et al 2023 Encoder-Decoder on ReCOGS\_pos obj\_pp\_to\_subj\_pp split: 19.2\% +/- 4.4\% Semantic Exact Match (sample mean +/- std) with 95\% confidence interval for sample mean (n=5) of 15.4\% to 23.1\%. (n=5, updated when additional 5 runs were available below; same notebook had been queued originally but completed after initial reporting deadline)
4-layer Wu et al 2023 Encoder-Decoder on ReCOGS\_pos obj\_pp\_to\_subj\_pp split: 19.3\% +/- 4.1\% Semantic Exact Match (sample mean +/- std, n=10) with 95\% confidence interval for sample mean (n=10) of 16.8\% to 21.8\%.\footnote{ https://colab.research.google.com/drive/12mXX5L1I4rpwl1Jk8hCm-xyAkqiKJEo7 and https://colab.research.google.com/drive/13FRQeAjyPOhBtTdrpW8caL25rNryLn5- }

\textbf{Error Analysis for Wu et al 2023 baseline Encoder-Decoder Transformer on obj\_pp\_to\_subj\_pp split}

Expand Down Expand Up @@ -349,7 +349,7 @@ \section{Results}
\begin{figure}
\includegraphics[scale=0.38]{new_difficult_generalization_v_dat_p2_recipient_pp_modification_predicted_and_confirmed_for_transformers_trained_from_scratch.png}
\caption[Wu et al 2023 Encoder-Decoder Transformer trained from scratch generalizing to new v\_dat\_p2 pp moved to recipient (from theme) split is as hard as the previously reported hardest generalization split as predicted by the flat/non-recursive/non-tree hypothesis by RASP modeling]{\begin{small}Wu et al 2023 Encoder-Decoder Transformer trained from scratch generalizing to new v\_dat\_p2 pp moved to recipient (from theme) split is as hard as the previously reported hardest generalization split as predicted by the flat/non-recursive/non-tree hypothesis by RASP modeling. \textit{Actual COGS training example and modified version shown, which have different meanings but in a compositional solution are identical at a medium level of abstraction, as both are `np v\_dat\_p2 np np`\protect\footnotemark, so model should be able to translate both to their appropriate logical forms.} The RASP program predicted this `np v\_dat\_p2 np pp np np` prepositional phrase modification would also require learning to ignore the in-between distractor "pp np" in inferring related words same as required for the obj\_pp\_to\_subj\_pp split so would be equally difficult for the Transformer trained from scratch. To test this, Wu et al 2023's baseline Encoder-Decoder Transformer was trained with default data (ReCOGS\_pos train.tsv) and tested on modified v\_dat\_p2 pp training examples where only the word order was changed to move the pp from the theme to the recipient (logical form properly updated see Appendix 4 for all examples). \textit{The baseline Wu et al 2023 Encoder-Decoder Transformer was only able to achieve a Semantic Exact Match of 13\% (n=10 Transformers trained from scratch with randomly initialized weights and data shuffling) with a 95\% confidence interval for the sample mean when n=10 of 4\% to 23\%}. Thus, this new split we introduce here as v\_dat\_p2\_pp\_moved\_to\_recipient is as difficult as the previous reported "hardest split" obj\_pp\_to\_subj\_pp. \\
\textit{(Note, the RASP model which includes the rule to ignore "pp np" when searching for noun indexes to report in relationships (agent, theme, recipient, etc) derived to parsimoniously cover all pp modification cases in training data, also performs well on all pp generalization splits, e.g. obj\_pp\_to\_subj\_pp, WITHOUT needing a recursive/tree representation, so it may be possible to fix these errors everywhere via data augmentation to get the Transformer trained from scratch to learn that rule.)}\end{small} \\
\textit{(Note, the RASP model which includes the rule to ignore "pp np" when searching for noun indexes to report in relationships (agent, theme, recipient, etc) derived to parsimoniously cover all pp modification cases in training data, also performs well on all pp generalization splits, e.g. obj\_pp\_to\_subj\_pp, WITHOUT needing a recursive/tree representation, so it may be possible to fix these errors everywhere via data augmentation to get the Transformer trained from scratch to learn that more general rule (while still not exposing to the held out grammar forms).)}\end{small} \\
}
% Note, need to regenerate the image in vector graphics for higher quality
% Note, may add the example of the correct LF and semantic graph vs the misunderstood one for the second form to make it clear and then remove a lot of explainer text from caption
Expand Down Expand Up @@ -1525,6 +1525,166 @@ \section{Appendix 8 - Model Detail}
\end{tiny}
}.

For ease of reference, the model architecture generated by the Wu et al 2023 baseline Encoder-Decoder Transformer script (trained from scratch, not pretrained) is as follows with N BertLayers set to 2 per \cite{Wu2023} for all baseline experiments except the layer variation experiments:
\begin{tiny}
\begin{verbatim}
# For Wu et al 2023 Encoder-Decoder Transformer baselines
# (we predict and analyze errors made by these
# in the paper using what we learned about how Transformers
# can perform the task from the
# Restricted Access Sequence Processing model),
# we use the official scripts provided at
# https://github.com/frankaging/ReCOGS/blob/
# 1b6eca8ff4dca5fd2fb284a7d470998af5083beb/run\_cogs.py
# and
# https://github.com/frankaging/ReCOGS/blob/
# 1b6eca8ff4dca5fd2fb284a7d470998af5083beb/
# model/encoder\_decoder\_hf.py
# where the architecture generated is as follows:
EncoderDecoderModel(
(encoder): BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(762, 300, padding_idx=0)
(position_embeddings): Embedding(512, 300)
(token_type_embeddings): Embedding(2, 300)
(LayerNorm): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): BertEncoder(
(layer): ModuleList(
# substitute N=2 for all baseline experiments
# per Wu et al 2023 paper;
# N can be 3 or 4 in our layer variation experiments only.
(0-(N-1)): N x BertLayer(
(attention): BertAttention(
(self): BertSdpaSelfAttention(
(query):
Linear(in_features=300, out_features=300, bias=True)
(key):
Linear(in_features=300, out_features=300, bias=True)
(value):
Linear(in_features=300, out_features=300, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense):
Linear(in_features=300, out_features=300, bias=True)
(LayerNorm):
LayerNorm((300,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense):
Linear(in_features=300, out_features=512, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): BertOutput(
(dense):
Linear(in_features=512, out_features=300, bias=True)
(LayerNorm):
LayerNorm((300,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(pooler): BertPooler(
(dense):
Linear(in_features=300, out_features=300, bias=True)
(activation): Tanh()
)
)
(decoder): BertLMHeadModel(
(bert): BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(729, 300, padding_idx=0)
(position_embeddings): Embedding(512, 300)
(token_type_embeddings): Embedding(2, 300)
(LayerNorm):
LayerNorm((300,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): BertEncoder(
(layer): ModuleList(
# substitute N=2 for all baseline experiments
# per Wu et al 2023 paper;
# N can be 3 or 4 in our layer variation experiments only.
(0-(N-1)): N x BertLayer(
(attention): BertAttention(
(self): BertSdpaSelfAttention(
(query):
Linear(in_features=300, out_features=300, bias=True)
(key):
Linear(in_features=300, out_features=300, bias=True)
(value):
Linear(in_features=300, out_features=300, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense):
Linear(in_features=300, out_features=300, bias=True)
(LayerNorm):
LayerNorm((300,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(crossattention): BertAttention(
(self): BertSdpaSelfAttention(
(query):
Linear(in_features=300, out_features=300, bias=True)
(key):
Linear(in_features=300, out_features=300, bias=True)
(value):
Linear(in_features=300, out_features=300, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense):
Linear(in_features=300, out_features=300, bias=True)
(LayerNorm):
LayerNorm((300,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense):
Linear(in_features=300, out_features=512, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): BertOutput(
(dense):
Linear(in_features=512, out_features=300, bias=True)
(LayerNorm):
LayerNorm((300,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
)
(cls): BertOnlyMLMHead(
(predictions): BertLMPredictionHead(
(transform): BertPredictionHeadTransform(
(dense):
Linear(in_features=300, out_features=300, bias=True)
(transform_act_fn): GELUActivation()
(LayerNorm):
LayerNorm((300,), eps=1e-12, elementwise_affine=True)
)
(decoder): Linear(in_features=300, out_features=729, bias=True)
)
)
)
)
\end{verbatim}
\end{tiny}

For the Wu et al 2023 baseline Encoder-Decoder Transformer layer variation experiments,
when we say e.g. 3 or 4 layers, we refer to 3 or 4 x BertLayer in the Encoder and Decoder, setting (3 or 4 Transformer blocks).
(This is intended because only once per block, during cross/self-attention is information exchanged between sequence positions, and \cite{Csordas2022} hypothesize that the number of such blocks must be at least the depth of the parse tree in a compositional solution, as in a grammar parse tree at each level symbols are combined which requires transferring information between sequence positions).
\clearpage

\section{Appendix 9 - Methods Detail}

We use the RASP \cite{Weiss2021} interpreter\footnote{
Expand Down