Bert-like implementation #4

lhallee · 2023-10-25T17:19:39Z

Hello,

Amazing work!!!

I have a couple of questions regarding the bidirectional implementation of the model.

Does the MonarchMixerSequenceMixing have by default all the recommending settings used in the training of the bert-like models (obviously with bidirectional = True). If not, is it possible to share the settings used for the M2 large?
It seems like the input u is after a token embedding layer. Do you add positional embeddings?
Is any sort of attention mask required?
Is it really okay to say M2 outperforms BERT when trained on different data? I think C4 improves BERT base considerably if I remember correctly.

Best,
Logan

DanFu09 · 2023-10-25T17:41:26Z

Thanks for your questions and your interest!

Does the MonarchMixerSequenceMixing have by default all the recommending settings used in the training of the bert-like models (obviously with bidirectional = True). If not, is it possible to share the settings used for the M2 large?

You can find all the settings for M2-BERT-large here (260m) and here (341m).

There's also detailed instructions to run everything in this README.

It seems like the input u is after a token embedding layer. Do you add positional embeddings?

We add positional embeddings at the very beginning of the architecture, see here.

Is any sort of attention mask required?

Nope!

M2 comparisons on the same data

We found that C4 was more reliable of a training source - we have head-to-head comparisons trained from scratch with Wiki and Books with an older version of the architecture (no gating, and minus the residual conv) in the Appendix B.9 of the paper. We match BERT pretrained with that recipe, but we found it wasn't as good for downstream fine-tuning.

Really interested into diving more into these questions - and I also suspect that the optimal training recipe for an M2-style model will be pretty different from Transformers (where the recipe has been fine-tuned for ~6 years now).

lhallee · 2023-10-25T19:08:26Z

Thanks so much for the response. I am planning on building the model and a custom pytorch training loop for my own data; I work on biological sequences and we are always length-limited with traditional attention. If I understand it correctly the create_bert_mlm function returns a fully functional huggingface-wrapped M2 mixer bert?

Additionally, I have been messing around on a machine without a GPU and got the import error telling me about requirements-cpu.txt, but cannot find that file anywhere.
ImportError: Please make sure to pip install -r requirements-cpu.txt to get the requirements for the BERT benchmark.

Thanks for pointing my attention towards Appendix B.9! That is some really compelling data!

May I ask why the choice of 30% MLM? This is so interesting.

DanFu09 · 2023-10-25T21:14:15Z

I work on biological sequences and we are always length-limited with traditional attention.

Biological sequences are super interesting to us! Feel free to reach out privately if you want to discuss more, my email's on my website :)

If I understand it correctly the create_bert_mlm function returns a fully functional huggingface-wrapped M2 mixer bert?

I would call it HuggingFace-esque... it has a similar interface as the HuggingFace BERT MLM models, but we haven't implemented all the HuggingFace interfaces (just the equivalent of BertForSequenceClassification for GLUE).

I have been messing around on a machine without a GPU

For CPU, I've had some success using these Docker images and point-installing individual packages (basically try running something, and then if it complains install that individual package).

30% MLM

We found that this just makes it learn a bit faster in terms of steps you need (Mosaic found something similar).

lhallee · 2023-10-25T21:25:13Z

Awesome, I will reach out separately to chat more :) Thanks again!

lhallee closed this as completed Oct 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bert-like implementation #4

Bert-like implementation #4

lhallee commented Oct 25, 2023

DanFu09 commented Oct 25, 2023

lhallee commented Oct 25, 2023

DanFu09 commented Oct 25, 2023

lhallee commented Oct 25, 2023

Bert-like implementation #4

Bert-like implementation #4

Comments

lhallee commented Oct 25, 2023

DanFu09 commented Oct 25, 2023

lhallee commented Oct 25, 2023

DanFu09 commented Oct 25, 2023

lhallee commented Oct 25, 2023