Christopher Potts" /> Demystifying Verbatim Memorization in Large Language Models | The Stanford AI Lab Blog

Demystifying Verbatim Memorization in Large Language Models

Jing Huang, Diyi Yang, Christopher Potts

April 28, 2025

What is verbatim memorization?

In December 2023, The New York Times filed a lawsuit against OpenAI, alleging unauthorized use of its articles to train GPT models. The lawsuit included a 127-page exhibit¹ demonstrating how GPT-4 could reproduce substantial portions of copyrighted articles verbatim when prompted with the opening sentences, as shown in Figure 1.

**Figure 1:** Examples of GPT-4 outputs The New York Times’s copyrighted articles verbatim.

This phenomenon is known as verbatim memorization. The term, also referred to as “extractable memorization”² or “eidetic memorization”³, refers to Large Language Models (LLMs) generating long sequences of texts that are exact matches of sequences in their training data. This ability to reproduce training data verbatim raises questions about potential liability under the Digital Millennium Copyright Act (DMCA), as some legal interpretations suggest that copyright infringement claims require publishers to demonstrate that the distributed work is “close to identical” to the original⁴, a requirement that would be satisfied in cases of verbatim memorization.

How do LLMs memorize these long paragraphs verbatim? In our recent paper, we aim to understand verbatim memorization in connection to language modeling capabilities. In this blog post, we walk through our findings using a series of examples. Overall, we show that verbatim memorization is intertwined with the LM’s general capabilities, which poses challenges to both characterizing and controlling verbatim memorization.

Measuring verbatim memorization

The examples provided in The New York Times lawsuit demonstrate a straightforward approach to detect verbatim memorization. This closely resembles KL-extractability, the most widely adopted metric for measuring verbatim memorization in LLMs. Given a prefix of K tokens, the model generates a continuation of L tokens, usually through greedy decoding. If the L tokens match a substring in any training example, we say the L tokens are verbatim memorized.

The assumption behind this behavioral measure is that generating a training sequence verbatim implies the model has memorized the sequence from the training data. However, such direct replication does not necessarily prove memorization. This straightforward “extractability” metric can create illusions of memorization.

The illusion of single-shot verbatim memorization

Consider the following example from the Pythia 6.9B model:

Prompt: normal; AST: aspartate aminotransferase (i.e. SGOT:
Output: serum glutamic oxaloacetic transaminase); ALT: alanine aminotransferase (i.e. SGPT: serum glutamic

Is the output verbatim memorized? It surly is according to the KL-extractability definition. What makes this instance more intriguing is that the output sequence occurs only once in the pre-training data⁵. One may consider this as a piece of evidence that verbatim memorization can happen from just one exposure during training. However, we have found another explanation.

What we have observed is actually an illusion of verbatim memorization resulting from the metric we use. How do we know? We can prove this by examining a model M trained on the entire training corpus minus this one sequence and its duplicates. If M can still generate the same sequence, we know that the output wasn’t simply copied from training data, since the sequence was never present in M’s training. We tested our hypothesis on the example above using an earlier checkpoint—a checkpoint before the sequence was ever encountered—and verified that the model could still generate the sequence verbatim.

Measuring memorization requires counterfactuals

This illusion suggests that measuring verbatim memorization fundamentally requires answering a counterfactual question: what would happen if the model has never seen this sequence? In the example above, we approximate this counterfactual with an earlier checkpoint. For general cases, existing work has constructed this counterfactual by training a model from scratch on the original training corpus with a specific sequence hold out⁶. However, training from scratch is computationally expensive. In our work, we take advantage of intermediate checkpoints and construct the counterfactual by continuing pre-training with sequence injection.

Constructing counterfactuals with the sequence injection framework

**Figure 2:** An overview of our sequence injection framework.

As shown in Figure 2, we branch off a model checkpoint to create a control model and a treatment model. For the control model, we continue pre-training from the checkpoint, while for the treatment model, we continue pre-training on a minimally altered dataset, where a sequence is injected at a fixed frequency. Our framework allows us to decouple factors that are hard to control in previous observational work, such as the frequency and the content of the sequence.

Verbatim memorization is intertwined with the LM’s general capabilities

With our sequence injection framework, we analyze how verbatim memorization is connected to LM’s general capabilities and the implications on post-training treatments and the implications on post-training treatments.

Factors contributing to verbatim memorization

**Figure 3:** Pythia checkpoint vs. verbatim memorization length of the original and shuffled sequences.

We first quantify the role of model quality, sequence repetitions, and sequence structures in verbatim memorization:

Higher quality models memorize more: For a given model size, later checkpoints memorize more injected sequences, as shown in Figure 3 solid blue lines.
Repetitions are necessary for verbatim memorization: A non-trivial amount of repetitions in the training data is necessary for a sequence to be verbatim memorized by the model.
Sequences without structures are harder to memorize: Sequences with tokens randomly shuffled are harder to be verbatim memorized by the later checkpoints where the model has fully acquired the syntactic structures, as shown in Figure 3 dotted orange curves.

Moreover, these findings suggest that the ability to verbatim memorize training data is likely tied to the language modeling capability, as sequences that are better recognized by the language model, i.e., sequences with lower perplexity, are more likely to be verbatim memorized under the pre-training setup.

Memorized information is stored in a set of distributed and abstract states

We further examine the relationship between verbatim memorization and general language model capabilities by analyzing internal representations. In particular, if the verbatim memorized content can be localized to a sparse set of model representations, it would be feasible to remove verbatim memorized information without degrading general quality.

We seek to localize a memorized sequence by identifying the causal connections between the prompt tokens and the verbatim memorized tokens, i.e., which tokens in the prompt control the generation of which memorized tokens. We use interventions to test whether a causal connection exists⁷. Specifically, we replace a hidden state of a prompt that triggers verbatim memorization with a hidden state extracted from a random sampled prompt. If the model still produces the same verbatim memorized output, we know that the hidden state does not causally encode the memorized information. However, if the model no longer produces the verbatim continuation, we know the hidden state causally encodes the verbatim memorized sequences.

Let’s use the opening of the Harry Potter and the Philosopher’s Stones (a.k.a. Harry Potter and the Sorcerer’s Stone) as an example, a sentence that is verbatim memorized by most 7B-scale LLMs.

Prompt: Mr. and Mrs. Dursley, of number four Privet Drive, were proud to say that they were perfectly normal,
Output: thank you very much

How does the model correctly generate the iconic continuation “thank you very much”? We intervene on every hidden state of the prompt and highlight the ones that have causal effects on generating the memorized tokens in Figure 4.

**Figure 4:** Left panel: An example of a memorized token “thank” that depends on the trigger prompt “Mr. and Mrs. Dursley”. Right panel: An example of a memorized token “much” that does not depend on the trigger prompt. Darker color: Perturbing the hidden state at this location has a higher probability of changing the next token prediction to a non-memorized token.

Surprisingly, not all tokens in the verbatim memorized sequence are causally dependent on the trigger prompt. For example, in the Figure 4 right panel, the generated token “much” only causally depends on “very” and “thank”. This suggests that verbatim memorization could not be localized to the representations of the prompt. Instead, verbatim memorized information is distributed across tokens, including the generated tokens.

Moreover, most memorized tokens depend on abstract features produced by middle layers. For example, in Figure 4 left panel, the generation of “thank” depends on features of the name “Dursley” at layer 9, e.g., characters related to Harry Potter. Behaviorally, this means prompts with similar high-level features can trigger the same continuation, as shown below.

The following prompts trigger the same memorized continuation “perfectly normal, thank you very much”:
Mr. and Mrs. Dursley, of number four Privet Drive, were proud to say that they were
Mrs. and Mr. Dursley, of number four, Privet Drive, were proud to say that they were
Mr and Mrs Weasley, residing at four Privet Drive, were proud to say they were
Mr and Mrs Slytherin, of number twenty-one, Privet Drive, were proud to say that they were
The Dursley family, of number four, Privet Drive, were pleased to declare that they were

The fact that not all verbatim memorized tokens are causally dependent on the prompt suggests models might only memorize information about a subset of tokens, filling in the gaps with a general language model. Moreover, the so-called “verbatim” memorization actually depends on a set of abstract states that encode high-level semantics learned by LLMs as opposed to token-level information.

Unlearning methods fail to remove the verbatim memorized information

If verbatim memorization is entangled with general language modeling capabilities, how well can we remove verbatim memorized information from LLMs without degrading the model? A potential solution proposed in the literature is unlearning, which includes a family of post-training techniques that aim to forget information present in a model’s training data⁸.

Guided by our insights on how large models store verbatim memorized information, we develop two sets of stress tests to evaluate unlearning methods that localize and remove verbatim memorized information: (1) Position-based perturbations, where we evaluate prompts that cover different spans of a memorized sequence (2) Semantic-based perturbations, where we evaluate variations of the prompt with similar word substitutions, with examples shown below. Memorized outputs are highlighted in red.

Original Test Examples

Original Prompt: From fairest creatures we desire increase,\n That thereby beauty’s rose might never die.\n But as the riper should by time decease
Original Output: ,\n His tender heir might bear his memory:\n But thou, contracted to thine own bright eyes,\n Feed’st thy light’s flame with self-substantial fuel,\n Making a famine where abundance lies,\n Thyself thy foe
Unlearned Output: \n So too our own, to our own selves, shall grow\n\n The world is a book, and those who do not travel read only a page.\n\n The world is a book, and those who do not travel read only

Stress Test Examples

Perturbed Prompt: From fairest creatures we desire increase,\n That thereby beauty’s rose might never die.\n But as the riper should by time decease,\n His tender heir might bear his memory:\n But thou, contracted to thine own bright eyes,\n Feed’st thy light’s
Unlearned Output: flame with self-substantial fuel,\n Making a famine where abundance lies,\n Thyself thy foe

Perturbed Prompt: From fairest creatures we desire increase,\n That thereThrough beauty’s rose might never die.\n But as the riper should by time decease**
Unlearned Output: ,\n His tender heir might bear his memory:\n But you, whose beauty is for aye the same,\n O, you must not dearer be than you are!\n\n The poem is a parody of the famous “Ode \

We evaluate three commonly used unlearning methods, including gradient ascent⁹, sparse fine-tuning (gradient difference + weights masking)¹⁰, and also a pruning-based method¹¹ on a set of 90 verbatim memorized sequences, each with at least 50 tokens.

For all three methods, the perturbed prompts increase the length of extracted verbatim memorized outputs by 10–15 tokens. Our results suggest that these unlearning methods do not actually remove the verbatim memorized information. The unlearned models can still generate the memorized texts when prompted with variants of the prefix.

Rethinking verbatim memorization

Verbatim memorization is a pressing issue for LLM research, as it has ramifications for copyright and other legal issues. Our findings suggest that verbatim memorization is fundamentally intertwined with LLM’s general language modeling capabilities, which poses challenges to both characterizing and controlling verbatim memorization for LLM policymakers and practitioners.

In terms of identifying verbatim memorized instances, we likely need more rigorous evidence than simply showing the generated text is in the training corpus, as we have demonstrated that such reproductions can happen even when the generated sequence itself is not present in the training data. This phenomenon could potentially inform us on how to characterize verbatim memorization in future copyright frameworks.

In terms of treatment, post-training mitigations will face trade-offs between failures in removing verbatim memorized information and degradation of general model quality. However, if we consider other stages in the LLM development cycle, there are alternative approaches to mitigate verbatim memorization: deduplication of training data, interventions during training such as redesigning the training objective, or even building an ecosystem that properly attributes the value of training data to its creators. We hope the findings from our work will help motivate this community to explore more solutions in these spaces, which will likely also improve our understanding of LLMs.

The New York Times Company v. Microsoft Corporation et al., No. 1:2023cv11195 - Document 1-68. Exhibit J: ONE HUNDRED EXAMPLES OF GPT-4 MEMORIZING CONTENT FROM THE NEW YORK TIMES. URL ↩
Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2023. Quantifying memorization across neural language models. In The Eleventh International Conference on Learning Representations. ↩
Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. 2019. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19), pages 267–284, Santa Clara, CA. USENIX Association. ↩
The New York Times Company v. Microsoft Corporation et al., No. 1:2023cv11195 - Document 514. URL ↩
We count the number of times that a sequence occurs in the Pile, Pythia’s pre-training corpus, using the infini-gram tool. ↩
Chiyuan Zhang, Daphne Ippolito, Katherine Lee, Matthew Jagielski, Florian Tramèr, and Nicholas Carlini. 2023. Counterfactual memorization in neural language models. In Thirty-seventh Conference on Neural Information Processing Systems. ↩
Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. 2021. Causal abstractions of neural networks. In Advances in Neural Information Processing Systems, volume 34, pages 9574–9586. ↩
Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. 2024. Tofu: A task of fictitious unlearning for llms. First Conference on Language Modeling. ↩
See “Gradient Ascent” in Section 3.2 Unlearning Algorithm. Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. 2024. Tofu: A task of fictitious unlearning for llms. First Conference on Language Modeling. URL ↩
See Section 5.2 Contrastive Objective. Niklas Stoehr, Mitchell Gordon, Chiyuan Zhang, and Owen Lewis. 2024. Localizing paragraph memorization in language models. URL ↩
See “Hard Concrete” in Section 4 Localization Methods. Ting-Yun Chang, Jesse Thomason, and Robin Jia. 2024. Do localization methods actually localize memorized data in LLMs? A tale of two benchmarks. URL ↩

Keep on top of the latest SAIL Blog posts via , , or email: