PRompt Optimization in Multi-Step Tasks (PROMST): Integrating Human Feedback and Heuristic-based Sampling (EMNLP 2024, Oral, Top 3%)

MIT, Harvard, MIT-IBM Watson AI Lab


We introduce an automatic prompt optimization framework for complex, multi-step agent tasks: PROMST. To handle the issues of task complexity, judging long-horizon correctness of individual actions, high prompt exploration cost, and human preference alignment, we propose the integration of human feedback, a learned score prediction model, and the modification of task score functions. Our approach generally outperforms representative baselines on eight different task environments. We show that learning a score prediction model improves the overall performance. Finally, we argue that modifying score functions can help align to optimized prompts to user preferences.

Abstract

Prompt optimization aims to find the best prompt to a large language model (LLM) for a given task. LLMs have been successfully used to help find and improve prompt candidates for single-step tasks. However, realistic tasks for agents are multi-step and introduce new challenges: (1) Prompt content is likely to be more extensive and complex, making it more difficult for LLMs to analyze errors, (2) the impact of an individual step is difficult to evaluate, and (3) different people may have varied preferences about task execution. While humans struggle to optimize prompts, they are good at providing feedback about LLM outputs; we therefore introduce a new LLM-driven discrete prompt optimization framework that incorporates human-designed feedback rules about potential errors to automatically offer direct suggestions for improvement.

Our framework is stylized as a genetic algorithm in which an LLM generates new candidate prompts from a parent prompt and its associated feedback; we use a learned heuristic function that predicts prompt performance to efficiently sample from these candidates. This approach significantly outperforms both human-engineered prompts and several other prompt optimization methods across eight representative multi-step tasks (an average 27.7% and 28.2% improvement to current best methods on GPT-3.5 and GPT-4, respectively). We further show that the score function for tasks can be modified to better align with individual preferences. We believe our work can serve as a benchmark for automatic prompt optimization for LLM-driven multi-step tasks.


The training of score models. Here we use BoxLift task as an example. The score model is targeted to predict the score of the input prompt for one specific task.


Human preference alignment via tuning score functions.

Related Links

This paper presents work whose goal is to advance the field of Automatic Prompt Tuning. The focus topic on multi-step agent tasks will advance the research in Foundation Model based Intelligent Agents. This work is also part of a broader research thread around language-instructed task and motion planning, which allows us to transform from natural language instruction into robot control signals.

Other work on Large Language Models for automatic prompt optimization include:

Other work on Large Language Models to Robot Task and Motion Planning and LLM-based agents from our lab include:

BibTeX

@article{chen2024prompt,
  title={Prompt optimization in multi-step tasks (promst): Integrating human feedback and preference alignment},
  author={Chen, Yongchao and Arkin, Jacob and Hao, Yilun and Zhang, Yang and Roy, Nicholas and Fan, Chuchu},
  journal={arXiv preprint arXiv:2402.08702},
  year={2024}
}