Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bert-like implementation #4

Closed
lhallee opened this issue Oct 25, 2023 · 4 comments
Closed

Bert-like implementation #4

lhallee opened this issue Oct 25, 2023 · 4 comments

Comments

@lhallee
Copy link

lhallee commented Oct 25, 2023

Hello,

Amazing work!!!

I have a couple of questions regarding the bidirectional implementation of the model.

  1. Does the MonarchMixerSequenceMixing have by default all the recommending settings used in the training of the bert-like models (obviously with bidirectional = True). If not, is it possible to share the settings used for the M2 large?
  2. It seems like the input u is after a token embedding layer. Do you add positional embeddings?
  3. Is any sort of attention mask required?
  4. Is it really okay to say M2 outperforms BERT when trained on different data? I think C4 improves BERT base considerably if I remember correctly.

Best,
Logan

@DanFu09
Copy link
Collaborator

DanFu09 commented Oct 25, 2023

Thanks for your questions and your interest!

Does the MonarchMixerSequenceMixing have by default all the recommending settings used in the training of the bert-like models (obviously with bidirectional = True). If not, is it possible to share the settings used for the M2 large?

You can find all the settings for M2-BERT-large here (260m) and here (341m).

There's also detailed instructions to run everything in this README.

It seems like the input u is after a token embedding layer. Do you add positional embeddings?

We add positional embeddings at the very beginning of the architecture, see here.

Is any sort of attention mask required?

Nope!

M2 comparisons on the same data

We found that C4 was more reliable of a training source - we have head-to-head comparisons trained from scratch with Wiki and Books with an older version of the architecture (no gating, and minus the residual conv) in the Appendix B.9 of the paper. We match BERT pretrained with that recipe, but we found it wasn't as good for downstream fine-tuning.

Really interested into diving more into these questions - and I also suspect that the optimal training recipe for an M2-style model will be pretty different from Transformers (where the recipe has been fine-tuned for ~6 years now).

@lhallee
Copy link
Author

lhallee commented Oct 25, 2023

Thanks so much for the response. I am planning on building the model and a custom pytorch training loop for my own data; I work on biological sequences and we are always length-limited with traditional attention. If I understand it correctly the create_bert_mlm function returns a fully functional huggingface-wrapped M2 mixer bert?

Additionally, I have been messing around on a machine without a GPU and got the import error telling me about requirements-cpu.txt, but cannot find that file anywhere.
ImportError: Please make sure to pip install -r requirements-cpu.txt to get the requirements for the BERT benchmark.

Thanks for pointing my attention towards Appendix B.9! That is some really compelling data!

May I ask why the choice of 30% MLM? This is so interesting.

@DanFu09
Copy link
Collaborator

DanFu09 commented Oct 25, 2023

I work on biological sequences and we are always length-limited with traditional attention.

Biological sequences are super interesting to us! Feel free to reach out privately if you want to discuss more, my email's on my website :)

If I understand it correctly the create_bert_mlm function returns a fully functional huggingface-wrapped M2 mixer bert?

I would call it HuggingFace-esque... it has a similar interface as the HuggingFace BERT MLM models, but we haven't implemented all the HuggingFace interfaces (just the equivalent of BertForSequenceClassification for GLUE).

I have been messing around on a machine without a GPU

For CPU, I've had some success using these Docker images and point-installing individual packages (basically try running something, and then if it complains install that individual package).

30% MLM

We found that this just makes it learn a bit faster in terms of steps you need (Mosaic found something similar).

@lhallee
Copy link
Author

lhallee commented Oct 25, 2023

Awesome, I will reach out separately to chat more :) Thanks again!

@lhallee lhallee closed this as completed Oct 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants