A look at Apple’s new Transformer-powered predictive text model

New York, NY — September 08, 2023
Shared on Hacker News and Reddit

At WWDC earlier this year, Apple announced that upcoming versions of iOS and macOS would ship with a new feature powered by “a Transformer language model” that will give users “predictive text recommendations inline as they type.”

Upon hearing this announcement, I was pretty curious about how this feature works. Apple hasn’t deployed many language models of their own, despite most of their competitors going all-in on large language models over the last couple years. I see this as a result of Apple generally priding themselves on polish and perfection, while language models are fairly unpolished and imperfect.

As a result, this may be one of the first Transformer-based models that Apple will ship in one of its operating systems, or at least one of the first that they’ve acknowledged publicly. This left me with some questions about the feature, notably:

What underlying model is powering this feature?
What is its architecture?
What data was used to train the model?

After spending some time with these questions, I was able to find some answers, but many of the details still remain unclear. If you’re able to get any further than I could, please get in touch!

How does the feature work?

After installing the macOS beta, I immediately opened the Notes app and started typing. Despite trying many different sentence structures, the feature generally appeared less often than I expected it to. It mostly completes individual words.

Predictive text completing one word at a time.

The feature will occasionally suggest more than one word at a time, but this is generally limited to instances where the upcoming words are extremely obvious, similar to the autocomplete in Gmail.

Predictive text completing two words at a time.

Can we dig deeper?

Finding the model itself was a little tough, but I eventually found the model being used by AppleSpell, an internal macOS application that checks for spelling and grammar mistakes as you type. With the help of xpcspy, I wrote a Python script that snoops on AppleSpell activity and streams the most probable suggestions from the predictive text model as you type in any application.

My “predictive spy” script in action.

Unfortunately, I wrote this script earlier in the summer, on the first macOS Sonoma beta. In one of the subsequent betas (I’m not sure which), Apple removed the unused completions from the XPC messages sent by AppleSpell. I wasn’t able to glean too much about the model’s behavior from these completions, but it was still a cool find.

Where is the model?

After some more digging, I’m pretty sure I found the predictive text model in /System/Library/LinguisticData/RequiredAssets_en.bundle/AssetData/en.lm/unilm.bundle. The bundle contains multiple Espresso model files that are used while typing (Espresso appears to be the internal name for the part of CoreML that runs inference on models). I wasn’t ultimately able to reverse-engineer the model, but I’m fairly confident this is where the predictive text model is kept. Here’s why:

Many of the files in unilm.bundle don’t exist on macOS Ventura (13.5), but they do exist on the macOS Sonoma beta (14.0). And the files that do exist in both versions have all been updated in Sonoma.
sp.dat, one of the files in unilm.bundle, exists on Ventura, but it’s been updated in the Sonoma beta. In the updated version of the file, I found what looks pretty clearly like a set of tokens for a tokenizer.
The number of tokens in sp.dat matches the shape of the output layer in both unilm_joint_cpu.espresso.shape and unilm_joint_ane.espresso.shape (ANE = Apple Neural Engine), two files in unilm.bundle that describe the shapes of layers in an Espresso/CoreML model. This is what we would expect to see for a model that is trained to predict the next token.

The predictive text model’s tokenizer

I found a set of 15,000 tokens in unilm.bundle/sp.dat that pretty clearly look like they form the vocabulary set for a large language model. I wrote a script that you can use to see this vocabulary file for yourself, which you can check out on GitHub.

The vocabulary starts with <pad>, <s>, </s>, and <unk> tokens, which are all fairly common special tokens (roberta-base and t5-base are two popular language models):

>>> from transformers import AutoTokenizer
>>>
>>> tokenizer = AutoTokenizer.from_pretrained("roberta-base")
>>> tokenizer.convert_ids_to_tokens([0, 1, 2, 3])
['<s>', '<pad>', '</s>', '<unk>']
>>>
>>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
>>> tokenizer.convert_ids_to_tokens([0, 1, 2])
['<pad>', '</s>', '<unk>']

Next come the following sequences:

20 special tokens, named UniLMCTRL0 through UniLMCTRL19
79 contractions (I’d, couldn’t, you’ve…)
1 special _U_CAP_ token
20 special tokens, named _U_PRE0_ through _U_PRE19_
60 special tokens, named _U_NT00_ through _U_NT59_
100 emojis

And then comes a more normal-looking list of 14,716 tokens, most of which are followed by the special character ▁ (U+9601), which is commonly used in byte-pair encoding (BPE) tokenizers, such as the GPT-2 tokenizer, to denote a space.

I have to say that this vocabulary file strikes me as pretty unique, but it’s definitely not out of the question for a language model deployed in this setting. I’ve personally never seen emojis featured so prominently in a language model’s tokenizer, but existing research has shown that domain-specific models and tokenizers can drastically improve downstream model performance. So it makes sense that a model trained for use in things like text messages, in which emojis and contractions will be used a lot, would prioritize them.

Model architecture

Based on the contents of the unilm_joint_cpu model from earlier, we can make some assumptions about the predictive text network. Despite sharing the name of Microsoft’s UniLM from 2019, it looks more to me like a model based on GPT-2.

GPT-2 has four main parts: token embeddings, positional encodings, a series of 12-48 decoder blocks, and an output layer. The network described by unilm_joint_cpu appears to be the same, except with only 6 decoder blocks. Most of the layers within each decoder block have names like gpt2_transformer_layer_3d, which would also seem to suggest it’s based on a GPT-2 architecture.

From my calculations based on sizes of each layer, Apple’s predictive text model appears to have about 34 million parameters, and it has a hidden size of 512 units. This makes it much smaller than even the smallest version of GPT-2.

Model	Decoder Blocks	Parameters	Hidden Size
Apple’s predictive text model	6	34M	512
gpt2	12	117M	768
gpt2-medium	24	345M	1024
gpt2-large	36	762M	1280
gpt2-xl	48	1542M	1600

For the limited scope of the predictive text feature, this makes sense to me. Apple wants a model that can run very quickly and very frequently, without draining much of your device’s battery. When I was testing the predictive text feature, suggestions appeared almost instantly as I typed, making for a great user experience. While the model’s limited size means it wouldn’t be very good at writing full sentences or paragraphs, when it exhibits very high confidence in the next word or two, they’re likely to be good enough to suggest to the user.

However, with my script that snoops on activity from AppleSpell, we can get the model to write full sentences anyway. If I type “Today” as the first word of my sentence and take the model’s top suggestion each time, here’s what I get (video):

Today is the day of the day and the day of the week is going to be a good thing I have to do is get a new one for the next couple weeks and I think I have a lot of…

Not very inspiring. We can compare this with the output from the smallest GPT-2 model:

Today, the White House is continuing its efforts against Iran to help the new President, but it will also try to build new alliances with Iran to make more…

Or the largest GPT-2 model:

Today, the U.S. Department of Justice has filed a lawsuit against the city of Chicago, the Chicago Police Department, and the city’s Independent Police Review Authority, alleging that the police department and the Independent Police Review Authority engaged in a pattern or practice…

Pretty cool seeing the effects of all those extra parameters! It’ll be interesting to see how this feature grows and evolves in the future, and whether Apple decides to keep its scope fairly narrow or someday expand its abilities.

If you’re interested in trying any of this out for yourself, all of my code is on GitHub.