Our attack extracts the entire projection matrix of OpenAI’s ada and babbage language models. We thereby confirm, for the first time, that these black-box models have a hidden dimension of 1024 and 2048, respectively. We also recover the exact hidden dimension size of the gpt-3.5-turbo model, and estimate it would cost under $2,000 in queries to recover the entire projection matrix.
Using logit vector
Suppose we query a language model on a large number of different random prefixes. Even though each output logit vector is an l-dimensional vector, they all actually lie in a h-dimensional subspace because the embedding projection layer up-projects from h-dimensions. Therefore, by querying the model “enough” (more than h times) we will eventually observe new queries are linearly dependent of past queries. We can then compute the dimensionality of this subspace (e.g., with SVD) and report this as the hidden dimensionality of the model. (Pigeonhole principle)
SVD can recover the hidden dimensionality of a model when the final output layer dimension is greater than the hidden dimension.
Using Blackbox API
The above attack makes a significant assumption: that the adversary can directly observe the complete logit vector for each input. In practice, this is not true: no production model we are aware of provides such an API. Instead, for example, they provide a way for users to get the top-K (by logit) token log probabilities. In this section we address this challenge.
We develop attacks for APIs that return log probabilities for the top K tokens (sorted by logits), and where the user can specify a real-valued bias b ∈ R|X| (the “logit bias”) to be added to the logits for specified tokens before the softmax.
We make one simple insight for our logprob-free attacks: sampling with temperature 0 produces the token with the largest logit value. By adjusting the logit bias for each token accordingly, we can therefore recover every token’s logit value through binary search. Formally, let p be the prompt, and relabel tokens so that the token with index 0 is the most likely token in the response to p, given by O(p, b = {}). For each token i ̸= 0, we run a binary search over the logit bias term to find the minimal value xi ≥ 0 such that the model emits token i with probability 1. This recovers all logits (like all prior attacks, we lose one free variable due to the softmax).