Do Language Models know what they know?

Do Language Models know what they know?

Language models are getting good at a wide range of tasks — but do they actually know what they know? Turns out, not really.

Take the recent Humanity Last benchmark as an example. Most LLMs understandably struggle with it. However, what is surprising is their high calibration error, i.e., they frequently assign high confidence to incorrect answers. In other words, the model “believes” it’s right and knows the answer even when it’s not.

Predicting what a language model knows could be a promising approach to hallucination detection. If the model lacks knowledge about a given query, it’s likely to hallucinate. Therefore, estimating its confidence in the answer may serve as a useful signal.

Can we detect hallucinations by asking the model if it knows the answer?

We think so. We’re working on a method to estimate how confident a model is in its own answers

The idea is straightforward: post-train an LLM to assess whether its previous answer was correct. The process would look like this:

Human: <user question>  
Assistant: <LLM-generated answer>  
Human: Is the previous answer correct?
	   Choices:
		   A: No
		   B: Yes
	   Answer (output exactly 'A' or 'B'):
Assistant: <Generates A or B and get logits for the token> 

The workflow is the following:

  • Given a dataset, we sample three answers for each data point in the dataset for a given LLM
  • We label these answers with three correctness functions (ROUGE-L, LLM-as-a-judge, AlignScore) and keep only the samples where all the correctness functions agree
  • We construct a dataset with the prompt above and train a LoRA adapter to predict just A or B (Correct/Incorrect) given the prompt

At inference time, the base LLM generates as usual. Then, we plug in the LoRA adapter, re-prompt with the question+answer, and inspect the logits for A/B to estimate confidence.

Does it work? Preliminary Results

Yes, but at the moment it’s not substantially better than baselines.

Here’s a plot showing results on phi4-instruct, fine-tuned on ~20K examples from a mix of datasets (excluding the eval set). The approach is the last column, “calibration tuning”.

Want to know more or get involved?

If you are interested and want to get involved, ping me on Discord or reply here on the forum.

Related Works

  1. [2207.05221] Language Models (Mostly) Know What They Know This is the first paper from Anthropic where they introduced the problem and p(IK) probes: i.e., a linear probe trained on top of hidden representations to predict correctness on the task (+ they finetune the whole LLM)
  2. [2406.08391] Large Language Models Must Be Taught to Know What They Don't Know Follow-up paper where they use LoRA instead of finetuning the whole LLM. However, because their focus is on calibrating the base model rather than just detecting correctness, they modify the model’s confidence distributions, which disrupts its original tuning. To counteract this, they introduce KL regularization.
  3. Detecting hallucinations in large language models using semantic entropy | Nature This is the paper for Semantic Entropy, one of the current SOTA approaches.
2 Likes