10 min

Entropy as a measure of ambiguity

Research

listen article

stop listening

Did you like the article?
Share it!

Although LLMs are becoming stronger in reasoning, deciphering user intent remains a challenge. The problem isn’t just about correctness–it’s about understanding what the user actually meant.

To evaluate and improve our DBT agents, we needed a better understanding of the data being used to train our models, and to measure how vague a question is in the first place. We were surprised that there were no suitable metrics for our task, so we built our own.

We looked into existing metrics for ambiguity first, one being VAGO, which measures vagueness based on how general a word or phrase is. Words like “something”, “any”, or “a lot” are examples of vague language. However, these generic detectors fall short when it comes to the kinds of ambiguities specific to text-to-SQL tasks.

There are many types of ambiguities, and vague descriptors are just one of them. For example, if a schema contained a certain column named “maintenance_hours”, the LLM might stumble if you asked it to return “service_hours” instead. The difference in clarity between “service” and “maintenance” is domain-specific and won’t register as ambiguous outside this DBT context.

Luckily, there’s a well-established metric that quantifies uncertainties in LLMs: entropy.

Figure 1. Entropy

The $p_i$’s are the probabilities of an LLM predicting a certain token, and the entropy takes the sum of $-p_i log(p_i)$ across all possible tokens–but we only take the top 20, and we’ll explain why shortly. Entropy is a direct measurement of how uncertain the model is; a lower entropy means confident, and a higher entropy means confused. To get a visual sense of this, see the graph of $-p_i log(p_i)$ below.

Figure 2. Term in entropy from [0, 1]

We only consider this graph from 0 to 1, which is the range that a token’s probability can be. Notice that the graph approaches 0 at both $p_i = 0$ and $p_i = 1$, while it peaks somewhere in the middle.

This shape has important implications.

First of all, the hundreds of thousands of tokens with near-zero probability don’t contribute to the entropy significantly, which is why we only need to take the top 20.

Now, when a model is confident, most of the probability mass is concentrated on one token, meaning one $p_i$ value is close to 1, and the rest are close to 0. Since $-p_i \log(p_i)$ is near 0 at both extremes, you get low entropy. In contrast, when the model is unsure, it spreads probability more evenly across several tokens. Many $p_i$ values now land in that middle zone (around 0.1 to 0.3) where the graph peaks. These values contribute more heavily to the sum, and you get a higher entropy. So, the entropy rises when the model is uncertain and falls when it’s confident.

To build our actual metric, we first create a context that connects the user query with the expected SQL query, calculate the entropy for each response token, and sum up the entropy to get an ambiguity score for the question. Let’s look at a comparison between a clear question with a vague question.

Figure 3. Entropy across response tokens for clear and vague questions

The purple line (vague) almost always stays mostly above the blue line (clear), which is a good sign. Our vague question is creating higher per-token entropy. But the ratio isn’t constant across the response. Around position 190, for example, the clear question momentarily spikes in entropy.

We’re on the right track, but entropy signals can be noisy across an LLM response, especially when LLMs begin drifting from the gold response. Including these noisy regions in the final entropy comparison can blur the distinction between clear and vague questions, making it harder to quantify their differences. This is expected! Some confusion stems not from the prompt itself, but from the LLM’s internal reasoning and SQL knowledge from its training. Including those regions dilutes the signal we actually care about: how the model's uncertainty is affected by the part of the question that changed.

We need to isolate the effect of our ambiguity. We want to zoom in on signals mostly influenced by the wording of the question.

Thanks to the structured nature of SQL, we can do exactly that: label the exact location of the response that should be affected by specific wording in the input. Let’s see an example.

Clear user query:

Could you show duplicate CBSA coverage metrics together with some congressional district metrics? Identify CBSA names that appear multiple times, providing the total number of occurrences and their types. Then find a few congressional districts with high CID values, tracking their district numbers and CID values and assigning a rank from 1-10 to each. Present the results in a unified view with columns named location_name, location_type, metric_value, ranking, and metric_type.

Vague user query:

Could you show duplicate CBSA coverage metrics together with some congressional district metrics? Identify CBSA names that appear multiple times, providing the total number of occurrences and their types. Then find a few congressional districts with high CID values, tracking their district numbers and CID values and assigning a rank to each. Please combine the final view with columns named location_name, location_type, metric_value, ranking, and metric_type.

We’ve highlighted two changes to increase vagueness: the first completely removes 1-10 range for ranking, and the second doesn’t explicitly ask for a union.

Figure 4. Unmasking SQL snippet

Now, we highlight the “snippets” of the DBT model that would be affected by our wording choices. We’re allowed to do this because entropy is calculated per-token, meaning every entropy value we see across the graph is independent of its previous behaviors, and only depends on the current context it sees.

We once again compare the two entropies, but masking out the rest of the response this time.

Figure 5. Entropy across responses for clear and vague question, masked according to SQL snippets

This is much cleaner! When the SQL snippets are labeled carefully, the entropy from the vague question should produce a consistent increase. This example reveals some interesting subtleties as well. The model seems more confused by the missing rank range than by the vague union instruction, showing how it was more likely to infer the “UNION” part.

Our final metric is defined below, where $h_vague$ and $h_clear$ are entropy values encompassed by the SQL snippets.

Comparing the ratio between the entropy values of the vague question and the clear question, we get a score of 2.09, so we can think of it as “two times more vague”.

We tested this metric on a small subset of our training data and found the following distribution of ambiguity below.

The standard deviation across the samples was 1.7, showing a good range of ambiguity compared to the baseline to work with. We’ll be adopting this metric to filter more data in the future.

For agents to be useful, understanding user intent is just as important as its reasoning skills. By quantifying context-specific ambiguity through metrics such as ours, we can get a more accurate sense of data distribution and quality, which is the backbone behind training a better model.

Did you like the article?
Share it!

Entropy as a measure of ambiguity

More Articles