A tokenizer optimized for the encoder you actually run

Vocabulary selection as a mixed-integer program that is optimal for greedy left-to-right inference, cutting tokenization cost below Byte Pair Encoding at the same vocabulary size.

The problem: the vocabulary and the encoder disagree

Byte Pair Encoding, still the default, builds a vocabulary greedily by merge frequency. But that training objective is not the thing we actually do at inference, which is greedy left-to-right tokenization with the finished vocabulary. The vocabulary is chosen under one objective and used under another, which leaves compression on the table.

That matters because tokenization cost, tokens per unit of text, sets your context budget and a floor on inference cost. Every token you do not spend is context you keep and money you save, on every single request.

The approach: optimize for greedy inference directly

I formulated vocabulary selection as a mixed-integer optimization in Gurobi that jointly optimizes which tokens are in the vocabulary (token-selection variables) and how each pretoken is segmented (pretoken-encoding variables).

The key is a set of greedy-consistency constraints: they force the learned vocabulary to be optimal for the greedy left-to-right encoder that is actually used at inference, rather than for an idealized optimal segmenter that no production system runs. The objective the solver optimizes is the cost you actually pay.

Result

Tokenization cost at matched vocabulary size (lower is better)

BPE (SOTA baseline)

1.00x

Joint MILP (ours)

-1.5%

The same string under BPE's merge order versus the greedy left-to-right encoder actually run at inference — fewer pieces is lower cost.

Joint vocabulary optimization reduces tokenization cost by 1.5% versus BPE at the same vocabulary size, a direct reduction in tokens and therefore in context use and inference cost, with no change to the inference procedure.

What I took away

Match the optimization objective to the inference procedure; a vocabulary that is optimal for a segmenter you never run is optimizing the wrong thing.
Small per-token wins compound: 1.5% off every request is a large absolute number at scale, for free.
Casting a fuzzy ML design choice as an explicit constrained optimization makes the tradeoffs legible and the result provable rather than empirical luck.