On August 24th, META released Code Llama, an AI model built on top of Llama 2 for generating and discussing code. This model is available under the same community license as Llama 2, making it free for both commercial and research use.

While many tech giants have jealously guarded the “secret sauce” behind their Large Language Models (LLMs), Meta took a different approach. They chose not only to share the code behind their groundbreaking Llama 2 model but also followed up with the release of Code Llama. Mark Zuckerberg, CEO of Meta, emphasized the importance of open sourcing in a Facebook post, stating, “Open source drives innovation because it enables many more developers to build with new technology. It also improves safety and security because when software is open, more people can scrutinize it to identify and fix potential issues.”

Conversely, there’s a growing concern among technologists about the potential risks of highly intelligent AI. There are fears of hypothetical doomsday scenarios where an AI could surpass human intelligence, leading to catastrophic outcomes such as the release of a biological super weapon or other unimaginable chaos. Addressing this, OpenAI’s co-founder, Ilya Sutskever, commented to The Verge earlier this year, stating that his company had been “flat-out wrong” in their previous open approach. He warned of the dangers of sharing details about highly advanced models, particularly if we ever reach the stage of Artificial General Intelligence (AGI).

Now, let’s dive into the secret sauce…

Overview of the training pipeline:

Training pipeline for Code Llama (Source)

The initial set-up and Base training (Base Training ) — Step 1

Starting Point:

The approach follows code training from foundational models (Llama 2) similar to Codex (Chen et al., 2021)

Dataset :

Overview of initial training data for code Llama (source)

Code Llama is predominantly trained on a near-deduplicated dataset of publicly available code consisting of 500 Billion tokens in the initial phase. The dataset also contains 8% of samples from natural language datasets related to code. The dataset also contains 7% of code on general Natural language data this is done for the model to retain its Natural Language understanding capabilities.

Tokenizer:

The data is tokenized using Byte Pair Encoding (BPE, Senrich et. al, 2016) similar to Llama and Llama 2.

Quick overview of BPE (Intuition):

Concepts related to BPE:

  • Vocabulary: A set of subword units that can be used to represent a text corpus.
  • Byte: A unit of digital information that typically consists of eight bits.
  • Character: A symbol that represents a written or printed letter or numeral.
  • Frequency: The number of times a byte or character occurs in a text corpus.
  • Merge: The process of combining two consecutive bytes or characters to create a new subword unit.

Suppose we have a string

aaabdaaabac

The byte pair ‘aa’ occurs the most. So it is replaced by another byte not present such as ‘Z’

ZabdZabac
Z = aa

The data can be compressed further such as by using ab = Y

ZYdZYac
Z = aa
Y = ab

Hyperparameters:

  • Optimizer Function:
    AdamW (Loshchilov & Hutter, 2019)
  • β Values:
    β1: 0.9
    β2: 0.95
  • Batch Size:
    4 million tokens
    Presented in sequences of 4096
  • Learning Rate:

A. Initial Base Training:
Cosine schedule with 1000 warm-up steps
The final learning rate is set to be 1/30th of the peak learning rate

B. For Finetuning (Same as Llama and Llama 2):
13 Billion parameter model: 3e−4
34 Billion parameter model: 5e−4

Addressing Infill Challenges in Language Models:

An example of Infilling

Traditionally, language models are honed to predict subsequent sequences in a given text. This brings us to an intriguing question: how do we seamlessly bridge the gaps between sentences, especially when some content is missing?

Enter Code Llama. Within its framework, the solution is framed as Code Infilling — a strategic task aimed at forecasting the absent segments of a program when provided with its contextual surroundings. This isn’t just a theoretical concept; it finds real-world applications such as assisting in code completion right where the cursor blinks in code IDEs, aiding in type inference, and even facilitating the generation of in-code documentation.

Diving deeper into the training mechanics, documents are dissected at the character echelon. They’re segmented into three distinct parts: a prefix, a middle part, and a suffix. These splitting points are not arbitrary; they are sampled independently from a uniform distribution spanning the entire length of the document. To add an element of randomness and diversity, the data is then bifurcated into two structures: the prefix-suffix-middle (PSM) format and its counterpart, the suffix-prefix-middle (SPM) format.

Craving a deeper dive? I’d recommend delving into this awesome paper for a comprehensive understanding.

Python code finetuning — Step 2

Pipeline progress

The Code Llama — Python model encounters an additional finetuning set for python related tasks on the below training data

Long Context-finetuning — Step 3

The Imperative of Long Context Fine-tuning in Code Llama:

Long context fine-tuning isn’t just a fancy term; it’s a quintessential component for any advanced language model like Code Llama. But what prompts its necessity?

  1. Extrapolation: Essentially, it’s about predicting sequence lengths that extend beyond what’s seen during training. It’s akin to preparing a student not just for the questions in the textbook but also for unseen questions that might appear in an examination. To delve deeper into this facet, I’d recommend this insightful paper from the University of Toronto titled — “Exploring Length Generalization in Large Language Models”.
  2. Complexity Concerns with Attention: Quadratic complexity in attention mechanisms is not just a theoretical challenge; it practically biases the model towards training on short-to-medium length inputs. If you have an appetite for rigorous mathematical proofs that underscore this point, there’s an authoritative paper from NYU that you might find intriguing: “On the Computational Complexity of Self-Attention”.

Code Llama uses a dedicated long context fine-tuning stage in which models are presented with sequences of 16,384 tokens, up from 4096 tokens used for Llama 2 and our initial code training stages. The approach followed is similar to fine-tuning by positional interpolation.

Positional Interpolation — Intuition :

At its core, the transformer architecture lacks a notion of the position of tokens in a sequence. This means that if the order of words in a sentence is changed, the transformer would interpret the two sentences as identical. To address this, we introduce “positional encodings” to give the model information about the position of each word.

Now, “positional interpolation” is a more advanced concept. Instead of using fixed positional encodings, as in vanilla transformer models, positional interpolation involves adjusting these encodings during the training process.

Imagine two separate sentences or sequences. During training, instead of feeding these sequences independently, we combine them. Now, the challenge is how to assign positional encodings to this combined sequence. This is where interpolation comes into play: we interpolate the positional encodings of the two original sequences to create encodings for the combined sequence.

In essence, positional interpolation helps the model understand and adapt to various sequence lengths and structures. This is especially useful when fine-tuning models on longer contexts or when trying to generalize across different sequence lengths.

This method enhances the flexibility of the model, making it more adept at handling varied text structures, which is especially important for applications like Code Llama where understanding the position and structure of code sequences is crucial.

Instruction finetuning — Step 4

Instruction Finetuning step

The model is trained on different datasets:

  1. Proprietary dataset —We use the instruction tuning dataset collected for Llama 2 and described in detail by Touvron et al. (2023b). Specifically, we use the version referred to in their paper as “RLHF V5”, collected trough several stages of reinforcement learning from human feedback and human feedback annotation (see their Section 3 for more details). It combines thousands of Supervised Fine-Tuning and millions of Rejection Sampling examples. Each example consists of a multi-turn dialogue between a user and an assistant. For Rejection Sampling, the output was selected among several generations using a reward model. The final dataset contains both Helpfulness and Safety data. This enables Code Llama to inherit Llama 2’s instruction following and safety properties.

2. Self-instruct. Our proprietary dataset contains few examples of code-related tasks. Collecting supervised data from human annotators or training from human feedback (Ouyang et al., 2022) is expensive for coding tasks as it requires input from professional developers. Instead of human feedback, we use execution feedback to select data to train our instruct model. We construct the self-instruction dataset following the recipe below, resulting in ∼14,000 question-tests-solution triplets:

  1. Generate 62,000 interview-style programming questions by prompting (Figure 9) Llama 2 70B.
  2. De-duplicate the set of questions by removing exact duplicates, resulting in ∼52,000 questions.

3. For each of these questions:

(a) Generate unit tests by prompting Code Llama 7B (Figure 10)
(b) Generate ten Python solutions by prompting Code Llama 7B (Figure 11)(c) Run the unit tests on the ten solutions.

Add the first solution that passes the tests (along with its corresponding questions and tests) to the self-instruct dataset.

We use Code Llama 7B to generate the tests and Python solutions, as we found it more efficient than generating fewer solutions per question with the 34B model for the same compute budget.

3. Rehearsal — In order to prevent the model from regressing on general coding and language understanding capabilities, Code Llama — Instruct is also trained with a small proportion of data from the coding dataset (6%) and our natural language dataset (2%).

The training Pipeline is complete and we are left with:

1. Code Llama – (7, 13, 34 billion parameters)

2. Code Llama — Instruct (7, 13, 34 billion parameters)

3. Code Llama — Python (7, 13, 34 billion parameters)

Bonus: What is META hiding in the paper?

Unnatural model — Code Llama — Python 34B finetuned on 15,000 unnatural instructions similar to Honovich et al. (2023) using the same prompts as for the self-instruct dataset. This model is not released but shows clear improvements on HumanEval and MBPP which are indicative of the improvements that can be reached with a small set of high-quality coding data.

This is the only mention of the model in the paper.

Results:

The Significance of Specialization: It’s evident that honing the focus yields dividends. Progressing from Llama2 to Code Llama and further refining to Code Llama Python, we observe a marked enhancement in code generation capabilities.

The Emergence of the Unnatural: Enter the Unnatural Code Llama, a model fine-tuned on the “Unnatural Instructions” dataset starting from the base of Code Llama Python. Although this new entrant outshines its predecessors from the Llama lineage, it hasn’t quite caught up with the prowess of GPT-4.

The Power in Scaling Specialized Models: As the specialization deepens and the model size (measured in parameters) balloons, there’s a notable upswing in performance. The larger, the better.

Multilingual Mastery: Code Llama flexes its muscles and outpaces Llama2 across a spectrum of languages: Python, Java, C++, C#, TS, and PHP. An interesting observation: Code Llama — Python falls a tad short when pitted against the generic Code Llama.

Embracing the “Fill in the Middle” Objective: While the introduction of this objective leads to a minor dip in performance, it ushers in a plethora of new use cases, making the trade-off worthwhile.

Probing into Long Context Evaluations: The META team has put Code Llama under the microscope with a couple of in-depth tests:

  1. Perplexity During Extrapolation: From the data provided, there’s a consistent drop in perplexity up until the 16K tokens mark, signaling the model’s adeptness in long context extrapolation. However, post the 100K tokens threshold, a surge in perplexity is observed.
  2. Key Retrieval Analysis: In this experiment, the model is handed a voluminous chunk of synthetically generated Python code, embedded with a specific function. The model’s task? To accurately discern and declare the value returned by this function.