ICLR Poster Lightweight Language Model Calibration for Open-ended Question Answering with Varied Answer Lengths

Poster

Lightweight Language Model Calibration for Open-ended Question Answering with Varied Answer Lengths

Xin Liu · Muhammad Khalifa · Lu Wang

Halle B

[ Abstract ]

[ OpenReview]

Abstract:

A model is considered well-calibrated when its probability estimate aligns with the true likelihood of the output being correct. Calibrating large language models (LLMs) is crucial, as it plays a vital role in detecting and mitigating hallucinations, a common issue of LLMs, as well as building more trustworthy models. Yet popular neural model calibration techniques are not well-suited for LLMs due to their lack of flexibility in discerning answer correctness and their high computational costs. For instance, post-processing methods, e.g., temperature scaling, are often unable to reorder the candidate generations. Moreover, training-based methods require fine-tuning the entire model, which becomes impractical due to the increasing sizes of modern LLMs. In this paper, we present Litcab, a lightweight calibration mechanism consisting of a single linear layer that takes as input the sentence representation and predicts a bias term, which is then added to the LM output logits. Litcab results with better-calibrated models, by only adding and training <2% of the original model parameters. For evaluation, we construct CaT, a benchmark consisting of six open-ended question-answering (QA) tasks, covering responses ranging from short phrases to paragraphs. We test Litcab with Llama2-7B, where it improves calibration across all tasks. We further conduct a comprehensive evaluation with multiple popular open-sourced LLMs from GPT and LLaMA families, yielding the following key findings: (i) Larger models within the same family exhibit better calibration. (ii) GPT-family models show superior calibration compared to LLaMA, Llama2 and Vicuna models despite having much fewer parameters. (iii) Fine-tuning pretrained model (e.g., LLaMA) with samples of focused purpose (e.g., conversations) may lead to worse calibration, highlighting the importance of fine-tuning setups.

Chat is not available.