Large language models (LLMs) have made huge strides in natural language understanding thanks to advances in deep learning. Models like ChatGPT can carry on engaging conversations and answer questions with nuanced, human-like responses. However, the lack of built-in confidence scores means we don’t always know when LLMs may be uncertain or provide incorrect answers. This poses risks, especially as they are applied in high-stakes domains like healthcare. To help address this challenge, Google AI researchers have developed a novel approach called ASPIRE, which uses advanced selective prediction techniques – help in making LLMs safer. Let’s explore how the Google ASPIRE framework works and what this means for the future of LLMs.
Table of Contents
What is Google ASPIRE
ASPIRE stands for Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs. The framework fine-tunes pre-trained LLMs to not only generate responses but also assign confidence scores indicating how certain they are in their predictions. Google’s ASPIRE framework aimed to develop a more robust selective prediction technique applicable to any LLM and task.
Across several question-answering datasets, ASPIRE outperformed prior selective prediction methods, demonstrating the potential of this technique to make LLMs’ predictions more trustworthy and their applications safer. Google applied ASPIRE using “soft prompt tuning” – optimizing learnable prompt embeddings to condition the model for specific goals. Two prompts guide the model to generate responses and then evaluate confidence in them.
Example of How Google ASPIRE Works
Let’s explore how ASPIRE works through an example. Consider the question, “Which vitamin helps regulate blood clotting?”. Without selective prediction capabilities, models may provide the most statistically likely answer of “Vitamin C”, which is incorrect.
However, with ASPIRE fine-tuning, the model’s response would look very different. It would output both an answer, such as “Vitamin K”, alongside a predictive confidence score ranging from 0-1.
If the score was high, around 0.9, we could have strong confidence in the prediction. But a low score of 0.1 suggests the model is uncertain, so the response should be carefully verified before being considered definitive.
In the above example, the Google ASPIRE framework helped to answer this question with the first soft prompt and then computed the learned self-evaluation score with the second soft prompt.
How Google ASPIRE Enables Selective Prediction
ASPIRE has three main stages – task tuning, answer sampling, and self-evaluation learning:
1. Task Tuning
The model is first fine-tuned on a task like question answering to improve its predictive performance. Parameter-efficient techniques freeze most weights while tuning a small subset for the task. This enhances the likelihood of correct outputs.
2. Answer Sampling
The tuned model generates multiple possible answers for training questions. High-probability answers based on metrics like ROUGE similarity are retained as a self-evaluation dataset.
3. Self-Evaluation Learning
Additional parameters are trained using this dataset to discriminate between right and wrong answers. Frozen original weights ensure prediction behaviour remains unchanged during this learning.
Evaluating ASPIRE Effectiveness
To evaluate ASPIRE, Google adapted several pre-trained transformer models like OPT-2.7B and OPT-30B using its framework. They found task tuning with ASPIRE boosted accuracy over larger baseline models on question-answering benchmarks like CoQA, TriviaQA and SQuAD.
More importantly, ASPIRE significantly improved models’ ability to distinguish correct from incorrect predictions, as measured by the AUROC score. On CoQA, AUROC increased from 51.3% to 80.3% compared to other methods.
Interestingly, while OPT-30B had higher baseline question answering accuracy on TriviaQA, applying traditional self-evaluation did little to help its selective prediction. However, tuning OPT-2.7B with ASPIRE outperformed both baselines, showing strategic adaptation can be more impactful than sheer model scale.
What This Means For the Future of LLMs
Google ASPIRE embodies a significant shift in the landscape of LLMs. It emphasizes that the capacity of a language model is not solely determined by its size but can be greatly improved through strategic adaptations. By allowing LLMs to assess their own certainty and make selective predictions, ASPIRE opens up new possibilities for precise and confident decision-making. By incorporating ASPIRE into LLMs, we can improve the reliability and accuracy of their responses, making them valuable tools in various applications, including healthcare, customer support, and education.
Last But Not Least
Google ASPIRE is not just another framework; it represents a vision of a future where LLMs can be trusted partners in decision-making. Google AI’s research in this field has laid the foundation for further advancements, and they invite the community to build upon this work. With ASPIRE, we are one step closer to unlocking the full potential of AI in critical applications.
| Read More from Google:
- Google Created A New AI Called AMIE Better At Medical Diagnostics Than Physicians
- What is Privacy Sandbox and How Google’s New Cookieless Policy Will Affect Users
- How Google DeepMind is Advancing Robotic Automation with AutoRT, SARA-RT and RT-Trajectory
- Google VideoPoet: A Groundbreaking LLM for Zero-Shot Video Generation
- Google Imagen 2: A Game-Changing AI Tool That Takes Photorealism to New Heights
- Google New Gemini 1.0 Model
- Google StyleDrop: A Game-Changing AI Image Generator
- MedLM by Google is Transforming the Healthcare Industry
- AI in Google Workspace: Google Sheets, Slides, Docs, and Gmail






