Papers
arxiv:2407.21783

The Llama 3 Herd of Models

Published on Jul 31, 2024
ยท Submitted by akhaliq on Aug 1, 2024
#1 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.

Community

Paper submitter

533 authors ...

What a shame(!), they wrote the name Santosh Janardhan wrongly

image.png

533 authors is insane ๐Ÿ’€

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Hi @meta-llama we're reviewing the paper and would like to get a bit of clarification on a fragment about curriculum learning.
We're trying to really understand the numbers and proportions of these different stages.
Specifically:


We pre-train Llama 3 405B using AdamW with a peak learning rate of 8 ร— 10โˆ’5 , a linear warm up of 8,000 steps, and a cosine learning rate schedule decaying to 8 ร— 10โˆ’7 over 1,200,000 steps. We use a lower batch size early in training to improve training stability, and increase it subsequently to improve efficiency. Specifically, we use an initial batch size of 4M tokens and sequences of length 4,096, and double these values to a batch size of 8M sequences of 8,192 tokens after pre-training 252M tokens. We double the batch size again to 16M after pre-training on 2.87T tokens. We found this training recipe to be very stable: we observed few loss spikes and did not require interventions to correct for model training divergence.

Does that mean

  1. Pretraining first stage
    252M tokens pretraining = 63 batches of 4M tokens (1000 samples x 4k sequence length) ~ 63 000 pretraining samples
  2. Pretraining second stage
    2.87T tokens pretraining = 360k batches of 8M tokens (1000 samples x 8k sequence length) ~ 360 000 000 main training samples
    3 Last pretraining stage
    ??? tokens pretraing = ??? batches of 16M tokens (2000 samples x 8k sequence length) ~ ???? how many samples did the model see in this stage?

Or should it be understood in some other way? Because it appears that either some information is missing or (more likely) I'm misunderstanding the statements. There are two pretraining steps mentioned, but 3 sizes of pretraining. Also the vocabulary used as batch size of X tokens, normally batch sizes are given in the number of samples per batch, so taht's a bit confusing too.
By 'training tokens' do we mean training tokens within unique samples? Or did you train more than 1 epoch?
Would really appreciate clarification, thank you! :)

Sign up or log in to comment

Models citing this paper 28

Browse 28 models citing this paper

Datasets citing this paper 2

Spaces citing this paper 14

Collections including this paper 25