Keynote: Encrypted LLM Inference with Batching Across Users

Speaker: Craig Gentry

Venue: 9th HomomorphicEncryption.org Standards Meeting

Abstract

It is a challenge to make large-scale LLM inference practical even in the ordinary unencrypted setting. One essential optimization is batching. Instead of multiplying a weight matrix with each user activation vector separately in parallel, the LLM server packages the vectors of many unrelated users as a matrix and then invokes fast, optimized matrix multiplication (MM) to multiply the weight matrix with the packaged matrix. This optimization improves efficiency by orders of magnitude.

Encrypted LLM inference (using FHE) is, of course, even harder to make practical. Recently, amazing progress has been made in reducing encrypted MM to a constant number of ordinary (unencrypted) MMs of comparable size, plus some comparatively inexpensive cryptographic processing, when the encryptions are under the key of a single user. But these results do not apply to encrypted batching, which is inherently multi-user and multi-key. How can we package vectors encrypted under unrelated user keys into a batched encrypted MM whose complexity is dominated by a couple of unencrypted MMs?

We show how to solve this problem: we get encrypted multi-user batching where the dominant cost of the batched linear operations is simply two unencrypted MMs. The server can use any MM algorithm to compute the unencrypted MMs – for example, BLAS or hardware-accelerated implementations or asymptotically faster algorithms such as Strassen. Aside from the unencrypted MMs, all “cryptographic processing” is just quasi-linear in the size of the input and output matrices.

Beyond efficiency, the solution requires minimal coordination. Users do not need to communicate with each other at all before, during or after inference. Users need to communicate only once with the server to send their ciphertexts and evaluation keys and receive inference results. The server can form batches dynamically, with different user sets for different linear operations. Non-linear operations (including FHE bootstrapping) remain under individual user keys and are completely unaffected.