Kasım 2024
P	S	Ç	P	C	C	P
	1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30

Introducing ASPIRE for selective prediction in LLMs

Posted by Jiefeng Chen, Student Researcher, and Jinsung Yoon, Research Scientist, Cloud AI Team

In the fast-evolving landscape of artificial intelligence, large language models (LLMs) have revolutionized the way we interact with machines, pushing the boundaries of natural language understanding and generation to unprecedented heights. Yet, the leap into high-stakes decision-making applications remains a chasm too wide, primarily due to the inherent uncertainty of model predictions. Traditional LLMs generate responses recursively, yet they lack an intrinsic mechanism to assign a confidence score to these responses. Although one can derive a confidence score by summing up the probabilities of individual tokens in the sequence, traditional approaches typically fall short in reliably distinguishing between correct and incorrect answers. But what if LLMs could gauge their own confidence and only make predictions when they’re sure?

Selective prediction aims to do this by enabling LLMs to output an answer along with a selection score, which indicates the probability that the answer is correct. With selective prediction, one can better understand the reliability of LLMs deployed in a variety of applications. Prior research, such as semantic uncertainty and self-evaluation, has attempted to enable selective prediction in LLMs. A typical approach is to use heuristic prompts like “Is the proposed answer True or False?” to trigger self-evaluation in LLMs. However, this approach may not work well on challenging question answering (QA) tasks.

The OPT-2.7B model incorrectly answers a question from the TriviaQA dataset: “Which vitamin helps regulate blood clotting?” with “Vitamin C”. Without selective prediction, LLMs may output the wrong answer which, in this case, could lead users to take the wrong vitamin. With selective prediction, LLMs will output an answer along with a selection score. If the selection score is low (0.1), LLMs will further output “I don’t know!” to warn users not to trust it or verify it using other sources.

In “Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs“, presented at Findings of EMNLP 2023, we introduce ASPIRE — a novel framework meticulously designed to enhance the selective prediction capabilities of LLMs. ASPIRE fine-tunes LLMs on QA tasks via parameter-efficient fine-tuning, and trains them to evaluate whether their generated answers are correct. ASPIRE allows LLMs to output an answer along with a confidence score for that answer. Our experimental results demonstrate that ASPIRE significantly outperforms state-of-the-art selective prediction methods on a variety of QA datasets, such as the CoQA benchmark.

The mechanics of ASPIRE

Imagine teaching an LLM to not only answer questions but also evaluate those answers — akin to a student verifying their answers in the back of the textbook. That’s the essence of ASPIRE, which involves three stages: (1) task-specific tuning, (2) answer sampling, and (3) self-evaluation learning.

Task-specific tuning: ASPIRE performs task-specific tuning to train adaptable parameters (θ_p) while freezing the LLM. Given a training dataset for a generative task, it fine-tunes the pre-trained LLM to improve its prediction performance. Towards this end, parameter-efficient tuning techniques (e.g., soft prompt tuning and LoRA) might be employed to adapt the pre-trained LLM on the task, given their effectiveness in obtaining strong generalization with small amounts of target task data. Specifically, the LLM parameters (θ) are frozen and adaptable parameters (θ_p) are added for fine-tuning. Only θ_p are updated to minimize the standard LLM training loss (e.g., cross-entropy). Such fine-tuning can improve selective prediction performance because it not only improves the prediction accuracy, but also enhances the likelihood of correct output sequences.

Answer sampling: After task-specific tuning, ASPIRE uses the LLM with the learned θ_p to generate different answers for each training question and create a dataset for self-evaluation learning. We aim to generate output sequences that have a high likelihood. We use beam search as the decoding algorithm to generate high-likelihood output sequences and the Rouge-L metric to determine if the generated output sequence is correct.

Self-evaluation learning: After sampling high-likelihood outputs for each query, ASPIRE adds adaptable parameters (θ_s) and only fine-tunes θ_s for learning self-evaluation. Since the output sequence generation only depends on θ and θ_p, freezing θ and the learned θ_p can avoid changing the prediction behaviors of the LLM when learning self-evaluation. We optimize θ_s such that the adapted LLM can distinguish between correct and incorrect answers on their own.

The three stages of the ASPIRE framework.

In the proposed framework, θ_p and θ_s can be trained using any parameter-efficient tuning approach. In this work, we use soft prompt tuning, a simple yet effective mechanism for learning “soft prompts” to condition frozen language models to perform specific downstream tasks more effectively than traditional discrete text prompts. The driving force behind this approach lies in the recognition that if we can develop prompts that effectively stimulate self-evaluation, it should be possible to discover these prompts through soft prompt tuning in conjunction with targeted training objectives.

Implementation of the ASPIRE framework via soft prompt tuning. We first generate the answer to the question with the first soft prompt and then compute the learned self-evaluation score with the second soft prompt.

After training θ_p and θ_s, we obtain the prediction for the query via beam search decoding. We then define a selection score that combines the likelihood of the generated answer with the learned self-evaluation score (i.e., the likelihood of the prediction being correct for the query) to make selective predictions.

Results

To demonstrate ASPIRE’s efficacy, we evaluate it across three question-answering datasets — CoQA, TriviaQA, and SQuAD — using various open pre-trained transformer (OPT) models. By training θ_p with soft prompt tuning, we observed a substantial hike in the LLMs’ accuracy. For example, the OPT-2.7B model adapted with ASPIRE demonstrated improved performance over the larger, pre-trained OPT-30B model using the CoQA and SQuAD datasets. These results suggest that with suitable adaptations, smaller LLMs might have the capability to match or potentially surpass the accuracy of larger models in some scenarios.

When delving into the computation of selection scores with fixed model predictions, ASPIRE received a higher AUROC score (the probability that a randomly chosen correct output sequence has a higher selection score than a randomly chosen incorrect output sequence) than baseline methods across all datasets. For example, on the CoQA benchmark, ASPIRE improves the AUROC from 51.3% to 80.3% compared to the baselines.

An intriguing pattern emerged from the TriviaQA dataset evaluations. While the pre-trained OPT-30B model demonstrated higher baseline accuracy, its performance in selective prediction did not improve significantly when traditional self-evaluation methods — Self-eval and P(True) — were applied. In contrast, the smaller OPT-2.7B model, when enhanced with ASPIRE, outperformed in this aspect. This discrepancy underscores a vital insight: larger LLMs utilizing conventional self-evaluation techniques may not be as effective in selective prediction as smaller, ASPIRE-enhanced models.

Our experimental journey with ASPIRE underscores a pivotal shift in the landscape of LLMs: The capacity of a language model is not the be-all and end-all of its performance. Instead, the effectiveness of models can be drastically improved through strategic adaptations, allowing for more precise, confident predictions even in smaller models. As a result, ASPIRE stands as a testament to the potential of LLMs that can judiciously ascertain their own certainty and decisively outperform larger counterparts in selective prediction tasks.

Conclusion

In conclusion, ASPIRE is not just another framework; it’s a vision of a future where LLMs can be trusted partners in decision-making. By honing the selective prediction performance, we’re inching closer to realizing the full potential of AI in critical applications.

Our research has opened new doors, and we invite the community to build upon this foundation. We’re excited to see how ASPIRE will inspire the next generation of LLMs and beyond. To learn more about our findings, we encourage you to read our paper and join us in this thrilling journey towards creating a more reliable and self-aware AI.

Acknowledgments

We gratefully acknowledge the contributions of Sayna Ebrahimi, Sercan O Arik, Tomas Pfister, and Somesh Jha.

source

Spread the love <3

Tags: ai:artificial aspire for intelligence introducing llms prediction selective

hacialikara

Merhaba, ben Ali Kara. Özel Kocaeli Güneş Okullarında eğitim veren 6 senelik bir İngilizce öğretmeniyim. Hala okulumuzda İngilizce öğretimine devam ediyorum. Öğrencilerim ile hem çevrimiçi (zoom üzerinden) hem de yüzyüze biraraya geliyorum. Güneş okulları olarak öğrencilerimizin yabancı dil öğrenimine önem veriyoruz. Bu sebeple bir çok etkinlik ve proje ile dil öğrenimini destekliyoruz. Öğretmenlik dışında 2014 yılında hobi olarak kod yazmaya başladım. Başta eğitim içerikleri olmak üzere birçok web projelerinde görev aldım. Etüt sistemi, Öğretmen platformu, Randevu sistemi, Online ödeme sistemi, alışveriş otomasyonları gibi çeşitli projeler geliştirdim. Yapay zeka desteği ile halihazırda farklı içerikler oluşturuyorum.

Ishmael H. Hanson dedi ki:

17 Mart 2024, 21:14

I am excited about the potential of ASPIRE. I believe that it could be a valuable tool for improving the accuracy of LLMs on a variety of tasks. I am looking forward to seeing more research on ASPIRE in the future.

Yanıtla
Anya E. Sheehan dedi ki:

17 Mart 2024, 21:14

This is a article very interesting and easy to understand. I really enjoyed it and I will recommend it to my colleagues.

Yanıtla
Kiley W. Parsons dedi ki:

17 Mart 2024, 21:14

This article is hilarious. The author uses humor to explain the concepts behind ASPIRE. I found this to be very effective. I recommend this article to anyone who wants to learn about ASPIRE in a fun and engaging way.

Yanıtla
Dorian H. Cook dedi ki:

17 Mart 2024, 21:14

I am confused by the author’s explanation of ASPIRE. I do not understand how it works. I wish the author had provided more concrete examples.

Yanıtla
Sigmund C. Adams dedi ki:

17 Mart 2024, 21:15

I am concerned about the ethical implications of ASPIRE. I believe that it could be used to manipulate people. I urge the authors to consider the ethical implications of their work before they release it to the public.

Yanıtla
Leona R. Hanson dedi ki:

17 Mart 2024, 21:15

This article is full of sarcasm. The author makes snide remarks about other methods for selective prediction in LLMs. I find this to be unprofessional and unhelpful.

Yanıtla
Noelle J. Mann dedi ki:

17 Mart 2024, 21:15

I find it ironic that the author claims that ASPIRE is easy to use. I found it to be very difficult to use. The instructions were unclear and the code was buggy.

Yanıtla
Nayeli K. Rhodes dedi ki:

17 Mart 2024, 21:15

I am skeptical of the author’s claims about ASPIRE. I have not seen any evidence that it is as effective as the author claims. I would like to see more research on ASPIRE before I can make a judgment about its effectiveness.

Yanıtla
Isaias B. Hanson dedi ki:

17 Mart 2024, 21:15

I disagree with the author’s claims about ASPIRE. I believe that ASPIRE is not as effective as the author claims. I have conducted my own research and found that ASPIRE does not improve the accuracy of LLMs on a variety of tasks.

Yanıtla
Evelyne W. Schwarz dedi ki:

17 Mart 2024, 21:15

This article provides a good overview of ASPIRE. It is well-written and easy to understand. I recommend it to anyone who is interested in learning more about ASPIRE.

Yanıtla
Tatum G. Mckenzie dedi ki:

17 Mart 2024, 21:15

I am curious about ASPIRE. I have never heard of it before. I am going to do some more research on it to learn more.

Yanıtla
Darrell B. Holland dedi ki:

17 Mart 2024, 21:16

This is a terrible article. It is full of errors and the writing is unclear. I do not recommend it to anyone.

Yanıtla
Shaniqua I. Hines dedi ki:

17 Mart 2024, 21:16

I am impressed by the author’s knowledge of ASPIRE. The article is well-researched and provides a comprehensive overview of the method. I recommend it to anyone who is interested in learning more about ASPIRE.

Yanıtla

Artificial Intelligence (AI) 17 Mart 2024

Introducing ASPIRE for selective prediction in LLMs

The mechanics of ASPIRE

Results

Conclusion

Acknowledgments

13 Responses

Bir yanıt yazın Yanıtı iptal et

You may also like...

AMD’den oyunseverlere bir güzellik daha! R5 7600X3D geliyor

The city under siege, cut off from the rest of the world, where everyone is a prisoner

Spotify experiments with an AI DJ that speaks Spanish

Spyware app pcTattletale was hacked and its website defaced