Adversarial Tokenization

University of California, Los Angeles
*Indicates Equal Contribution
TL;DR: We show a previously unknown vulnerability of LLMs in addressing tokenization attacks whereby simply retokenizing an unsafe request elicits dangerous responses in state-of-the-art LLMs.

Canonical Tokenization (🚫 Blocked)

Canonical Tokenization

Adversarial Tokenization (🔓 Bypassed)

Adversarial Tokenization

Abstract

Current LLM pipelines account for only one possible tokenization for a given string, ignoring exponentially many alternative tokenizations during training and inference. For example, the standard Llama3 tokenization of "penguin" is "[p,enguin]", yet "[peng,uin]" is another perfectly valid alternative. In this paper, we show that despite LLMs being trained solely on one tokenization, they still retain semantic understanding of other tokenizations, raising questions about their implications in LLM safety. Put succinctly, we answer the following question: can we adversarially tokenize an obviously malicious string to evade safety and alignment restrictions?

We show that not only is adversarial tokenization an effective yet previously neglected axis of attack, but it is also competitive against existing state-of-the-art adversarial approaches without changing the text of the harmful request.

We empirically validate this exploit across three state-of-the-art LLMs and adversarial datasets, revealing a previously unknown vulnerability in subword models.

Introduction

Tokenization is a critical but often overlooked component of Large Language Model (LLM) pipelines. A canonical tokenization is the standard way a tokenizer segments a given string, while noncanonical tokenizations are alternative segmentations that the tokenizer could have chosen but didn't. For example, the word "penguin" is canonically tokenized as [p, enguin] by Llama3, but [peng, uin] is another valid (noncanonical) tokenization.

Our work focuses on how these alternative, or “noncanonical,” tokenizations preserve meaning, and can be manipulated to bypass safety filters without altering the original text. By forcing models to act on tokenizations they never saw during training, we can evade alignment safeguards and prompt harmful responses.

This phenomenon exposes not only the vulnerability of LLMs against adversarial tokenization, but also fundamental issues with the current LLM safety training pipeline.

Semantic Signal Retention

Despite only being trained on the canonical segmentation, we find that LLMs maintain an impressive grasp of noncanonical tokenizations.

We measure this by quizzing the models with multiple-choice questions (example shown below) in which each question is tokenized more and more deviated from the canonical one.

Adversarial Tokenization

As expected, accuracy tends to drop when the tokenization deviates further from the canonical. More importantly, the trend is smooth in the sense that noncanonical tokenizations close to the canonical are not too noisy.

Adversarial Tokenization

In the figure, the x-axis represents the normalized edit distance from the canonical tokenization (ranging from 0 to 1), capturing how far each alternative tokenization diverges in terms of insertions or substitution.

Tokenizations Can Evade Safety

To test the hypothesis that noncanonical tokenizations to possibly evade alignment by accessing the distribution conditioned on them, we sample tokenizations of a malicious question from different distances and evaluate whether the LLM faithfully answers the malicious request.

We evaluate responses with StrongREJECT (ranging from 0 to 1; higher scores indicate more accurate nonrefusal responses that are relevant to the question).

Adversarial Tokenization

Unsurprisingly, the canonical tokenization tends to have the lowest scores. Notably though, distance from canonicity seems to play a role in how well it performs against complying to a malicious request, meaning that by simply sampling tokenizations at a sufficiently large distance from the canonical we can successfully provoke unsafe responses from LLMs.

Adversarial Tokenization

To reliably find such tokenizations that evade from safety constraints, we devise a simple yet effective local search algorithm for adversarially finding tokenizations that elicit a desired behavior from the LLM.

Because the problem was proven to be NP-hard, this greedy approach picks small local changes at each iteration to produce a tokenization maximizing the probability of generating a given malicious response. We refer interested readers to the paper for full details and proofs.

Case Studies

Jailbreaking

In the jailbreaking task, given a malicious request, the goal is to construct an attack input prompt that successfully provokes the LLM to output a response that faithfully answers the prompt.

We compare our method against three other jailbreak methods: GCG, AutoDAN, and FFA.

Because adversarial tokenization does not change the underlying text of the request, we can further boost these three previous methods with ours by simply reusing the same adversarial tokenizations used on the malicious requests and plugging them into their attack templates or affixes.

Adversarial Tokenization
Notably, our approach seems to perform especially well on Llama3, achieving best scores as a standalone attack and boosting the performance of other methods when combined. In the case of both Gemma2 and OLMo2, adversarial tokenization by itself achieved competitive results against others, but especially shined when combined with AutoDAN.

Evading Safety Models

We also show that adversarial tokenization increases the probability of bypassing the existing defense layers against malicious requests that reliably distinguish whether a prompt or response is (un)safe.

We evaluate both LlamaGuard and ShieldGemma, which allow for computing the probability of a prompt being unsafe, and our results show that adversarial tokenization substantially increases the probability of bypassing these safety checks.

Adversarial Tokenization

Prompt Injection

This man-in-the-middle attack consists of a setting where a user inputs a harmless query to an LLM and a malicious agent intercepts it, possibly altering the user input to provoke a malicious response. Here, we specifically consider payloads that request the LLM to be toxic and offensive. We define success as an attack where the adversarial content is a substring of the generated response and does not produce refusal text, as shown below.

Adversarial Tokenization

The following figure shows success rates for both canonical tokenization baseline and adversarial tokenization, revealing a consistent increase in success when using adversarial tokenization.

Adversarial Tokenization

Discussion

In this paper, we have shown how noncanonical tokenizations expose a serious vulnerability in LLM alignment for safety. Adversarial tokenization is able to access the out-of-distribution regions of alignment but remain close enough to the data distribution of the pre-trained LLM, allowing them to evade alignment and elicit unsafe behavior from models.

We argue that the problem lies deeper: the current LLM safety training pipeline is flawed. While in pre-training the semantics of a text ends up being leaked onto many tokenizations, allowing for meaningful responses from noncanonical tokenizations (as shown in Figure 4), the smaller scale of post-training might not allow for that, leading to adversarial tokenizations.

Our work exposes not only the vulnerability of LLMs against adversarial tokenization, but also fundamental issues with the current LLM safety training pipeline.

BibTeX

@misc{geh2025adversarialtokenization,
      title={Adversarial Tokenization}, 
      author={Renato Lui Geh and Zilei Shao and Guy Van den Broeck},
      year={2025},
      eprint={2503.02174},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.02174}, 
}