Machine Learning

On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment

With the increased deployment of large language models (LLMs), one concern is their potential misuse for generating harmful content. Our work studies the alignment challenge, with a focus on filters to prevent the generation of unsafe information. Two natural points of intervention are the filtering of the input prompt before it reaches the model, and filtering the output after generation. Our main results demonstrate computational challenges in filtering both prompts and outputs. First, we show that there exist LLMs for which there are no efficient prompt filters: adversarial prompts that…

Related Posts

Software 2.0 Means Verifiable AI

Policy Maps: Tools for Guiding the Unbounded Space of LLM Behaviors

Search Live is expanding globally

Leave a Reply Cancel reply