-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bert-like implementation #4
Comments
Thanks for your questions and your interest!
You can find all the settings for M2-BERT-large here (260m) and here (341m). There's also detailed instructions to run everything in this README.
We add positional embeddings at the very beginning of the architecture, see here.
Nope!
We found that C4 was more reliable of a training source - we have head-to-head comparisons trained from scratch with Wiki and Books with an older version of the architecture (no gating, and minus the residual conv) in the Appendix B.9 of the paper. We match BERT pretrained with that recipe, but we found it wasn't as good for downstream fine-tuning. Really interested into diving more into these questions - and I also suspect that the optimal training recipe for an M2-style model will be pretty different from Transformers (where the recipe has been fine-tuned for ~6 years now). |
Thanks so much for the response. I am planning on building the model and a custom pytorch training loop for my own data; I work on biological sequences and we are always length-limited with traditional attention. If I understand it correctly the create_bert_mlm function returns a fully functional huggingface-wrapped M2 mixer bert? Additionally, I have been messing around on a machine without a GPU and got the import error telling me about requirements-cpu.txt, but cannot find that file anywhere. Thanks for pointing my attention towards Appendix B.9! That is some really compelling data! May I ask why the choice of 30% MLM? This is so interesting. |
Biological sequences are super interesting to us! Feel free to reach out privately if you want to discuss more, my email's on my website :)
I would call it HuggingFace-esque... it has a similar interface as the HuggingFace BERT MLM models, but we haven't implemented all the HuggingFace interfaces (just the equivalent of BertForSequenceClassification for GLUE).
For CPU, I've had some success using these Docker images and point-installing individual packages (basically try running something, and then if it complains install that individual package).
We found that this just makes it learn a bit faster in terms of steps you need (Mosaic found something similar). |
Awesome, I will reach out separately to chat more :) Thanks again! |
Hello,
Amazing work!!!
I have a couple of questions regarding the bidirectional implementation of the model.
Best,
Logan
The text was updated successfully, but these errors were encountered: