Application development using device-native LLMs.

Philip Grabenhorst

UX Collective

A pen-drawn rendition of the Oracle, taken from John Collier’s Priestess of Delphi, on a smartphone screen.
John Collier’s Priestess of Delphi on a smartphone screen

“The best way to predict your future is to create it.” — Abraham Lincoln

To say that the year since ChatGPT’s release has been eventful would be an understatement. In addition to the tectonic shifts taking place in every sector of the IT industry, the pace of development — building bigger, faster, and stronger models — seems to grow ever faster. Last week, the Google DeepMind team announced a leap-frog over their OpenAI counterparts, who were, themselves, still reeling from the drama of Sam Altman’s almost-ouster. In between all of this, Anthropic blew away industry standards with its release of Claude 2.1 and its 200k token prompt sizes. Our digital oracles are becoming faster and cheaper.

Google’s release was a project called Gemini. Like a McDonald’s meal, it comes in three sizes: Pro, Ultra, and Nano. Someone in their PR department had very high hopes, dubbing it the start of the “Gemini Era.” The demos featuring its most powerful version gave us an idea of what a fully multimodal model could do (although these appear to have been embellished or outright fabricated).

Regardless, I’m much more interested in the smallest version of project Gemini — Gemini Nano. It’s a text-only model intended to run on mobile devices. According to the Android Developer website, it’s already running on Pixel 8 Pro devices, powering offline note summaries and smart keyboard interactions.

A few months back, I wrote an essay about the need for Large Language Models as an operating-system level application service. They’re finally coming. Gemini Nano and Android’s new AICore service are the first serious, official offerings. However, a head of steam is building on other platforms, as well. Over the summer, the Hugging Face community has made progress in making openly available models easy to run on Apple’s Core ML. Furthermore, if the hiring data is anything to go by, Apple might not be far behind with an official set of LLM-powered services.

Generalized Artificial Intelligence is moving closer and closer to each of us. While the performance and viability of these early systems have yet to be verified, we can be sure they’ll iterate and improve, quickly. AI now lives in our pockets. The question for us, as application developers and software engineers, is this: now that device-native LLMs are here, how do we build the most empowering software we can? What problems do we still have to overcome?

The first question is whether or not these models will even be useful. Will Gemini Nano be fast enough to be conversational? Will its prompt-size be on par with current server-side offerings, such as GPT 4 and Claude? The answer to the question “do I ditch my API for a local service?” is something that only individual teams, with a better sense of their projects’ context, can answer. However, in my last essay, I demoed the comparative performance of some local LLMs. They weren’t half bad. I can’t imagine Gemini Nano will be any worse.

As these models become more powerful, we naturally want to delegate more work to them. As they grow closer to us, we expect them to be more understanding of the unspoken context surrounding our conversations. OpenAI’s answer to this problem was simple: “store more of your life on our servers.” The Assistant features they released in November rely heavily on the concept of “threads” and uploaded files. The feeling of familiarity and contextual understanding is provided by keeping a conversation going, regardless of the device you access it on, potentially referencing files you have explicitly provided.

Google’s answer to this problem was … surprisingly … different. In their documentation for Gemini Nano and AICore, they repeatedly emphasize the local, encrypted nature of inputs and outputs from the model. Instead of storing context in the cloud, individual applications can keep track of it, locally. Additional training is possible, using a per-application adapter that loads on top of the base model. However, this raises some questions. If an application is intended to support a uniform experience across several devices, how will this additional training be synchronized? Instead of duplicating information on the client and the server side (OpenAI’s approach), will we be duplicating effort and additional training across devices?

Regardless of how this additional training is applied, there’s another, more important, problem that remains unresolved. AICore’s additional training is application-specific. However, our problems are not. If we want our devices to help us to be better problem solvers, they should demonstrate a cross-sectional understanding of our lives. OpenAI’s assistants miss out on this. However, Google’s services contain a plethora of personal material (contact information, calendar events, and so on) that could be automatically applied as training material for Gemini Nano, making its responses more personal and applicable. I believe this should exist at the operating system level, instead of forcing developers to duplicate this effort across each of their applications.

Of course, this might pose a problem from a privacy perspective. Should each application with access to AICore be able to craft prompts that might regurgitate personal information about your friends, relatives, and near-term schedule? Of course not. However, a messaging app cannot genuinely help us craft useful responses without knowledge of the individual to whom we are speaking. (I find it questionably useful to have an AI assistant write my texts for me anyway). A project management application will be more useful if it is informed with the emails we have received throughout our work day. In the same vein, a journaling application will be more useful if it is informed by the photos we took throughout the day. This is the original promise of projects like Siri and Google Assistant, and it is now within our grasp. If our goal is to provide assistance in traversing the web of information that we encounter every day, making sense of it, and applying it to solve real problems, this personalized approach will be vital.