Avishkar Gupta - Notes on "Mistral 7B’s secrets" talk by Lample

Note: The talk turned out to actually be about LLaMA

Introduction

Basically impossible to do anything without LLMs in NLP these days.
They were trying to solve mathematical problems (GSM8k?)
Codex was available at the time however API was a limiting factor. (Also imagine like any other competent ML guy they had ideas on what can be improved).

Deciding on the Size of the LLM

Chinchilla was out
70B - 1.4T tokens, scaling laws.
decide based on compute budget
Trained 5 times longer (GPT3 - 300B tokens)
Fater inference and more stable training are obvious advantages of small LMs.
LLaMA was born
- 7B (1T tokens)
- 13B (1T tokens)
- 33B (1.4T tokens)
- 65B (1.4T tokens)

Tokenisation

basic well known things
References an original paper on subword tokenisation - NMT of rare words with subword units (Sennrich et al 2015)

Finding so many tokens

Architecture

Rotary embeddings
SwiGLU instead of ReLU in feed-forward
Initialise like PaLM?
Chinchilla paper’s arch, but they had extra tokens so could do 70B, these guys could do 65B
LLaMA 7B - 500 GPUs for 1 week

Training

Hardware failures
Silently corrupts matrix multiplication output
Spikes occur, loss diverging and then loss goes wide very quickly
Manually re-implement gradient computations because at their scale data locality makes a lot of difference (model is sharded across multiple GPUs and for forward pass need to gather a bunch of data on one GPU).

Evaluation

Till BERT you could finetune for some standard benchmarks.
Now it’s zero shot and few shot tasks.
MMLU
Noise of even 1-5% is a lot because that’s the difference between 7B and 13B LLaMA.
People don’t believe your results because the implementations are different.
For stability, use few shot evalution

Their takeaways

LLaMA2

Training small and efficient models for businesses (7B)

Notes on "Mistral 7B’s secrets" talk by Lample