Abstract / Description of output
Common self-improvement approaches for large language models (LLMs), such as STaR, iteratively fine-tune LLMs on self-generated solutions to improve their problem-solving ability. However, these approaches discard the large amounts of incorrect solutions generated during this process, potentially neglecting valuable information in such solutions. To address this shortcoming, we propose V-STaR that utilizes both the correct and incorrect solutions generated during the self-improvement process to train a verifier using DPO that judges correctness of model-generated solutions. This verifier is used at inference time to select one solution among many candidate solutions. Running V-STaR for multiple iterations results in progressively better reasoners and verifiers, delivering a 4% to 17% test accuracy improvement over existing self-improvement and verification approaches on common code generation and math reasoning benchmarks with LLaMA2 models.
Original language | English |
---|---|
Title of host publication | Proceedings of the 2024 Conference on Language Modeling |
Publication status | Accepted/In press - 10 Jul 2024 |
Event | Conference on Language Modeling - University of Pennsylvania, Philadelphia, United States Duration: 7 Oct 2024 → 9 Oct 2024 https://colmweb.org/ |
Conference
Conference | Conference on Language Modeling |
---|---|
Abbreviated title | COLM 2024 |
Country/Territory | United States |
City | Philadelphia |
Period | 7/10/24 → 9/10/24 |
Internet address |