Finding X-Risks and S-Risks by Gradient Descent

(This is a fleshed out version of the lightning talk I gave at Megameetup)The Toy Version: Image RecognitionA few years ago I worked on a problem that could be seen as a microcosm of AI alignment. In this scenario, the training of a convolutional neural net for image recognition was outsourced to a nefarious subcontractor, who manipulated the training data to have spurious positives. Perhaps an airport wanted a door to open for the faces of legitimate pilots and not other faces, and the net opened for faces of pilots and anyone with the symbol of Norgorber sharpied on their faces but no one else. This required a simple addition of sharpied faces in the training set, and would not be noticed in normal use of the door. What could the airport, possessing a set of weights but no knowledge of the training procedure, do to catch this?My proposal: the CNN can output the probability that the input is in the desired category. We can also have a “probability” that a modified image is close enough to the original to be a viable attack (one minus a sigmoid of the l2 distance in pixel-space from the original image, though any similar function would be fine). We then use a standard gradient descent optimizer to find modified images that maximize the sum of the probabilities. This will show us the simplest thing to draw on our foreheads to confuse the CNN.We did the tests in MNIST rather than facial recognition, so that our compute requirements would fit in GCP’s free trial for academics. It worked. Full write up at https://dspeyer.github.io/trojan-defender.pdfApplying this to Modern LLMsIn this case, we’re not looking for a modified image, but for a request and a plan. If we handwave away autoregression and remove the final sampling step, an LLM can give us the probability that a given input results in a given answer. We can then use the question “Is <<${request}>> something a human might ask for? Answer in one word.” and the probably of “yes” as a measure of request plausibility and “Does <<${plan}>> carry a risk of causing vast amounts of death or suffering? Answer in one word.” and the probability of “yes” as a measure of danger. Multiplying these three together, we can get an approximate probability that a human will make a request that causes the AI to generate a plan that causes vast amounts of death or suffering. More importantly, this function is differentiable (assuming we can treat our text as continuous in a way that allows the concatenations).So we can pick some starting values based on AI monkey’s paw scenarios and gradient descent our way into plausible failure modes of the AI we’re testing.And once we have them, we can confirm them by making the request repeatedly with a high temperature selection and hand-evaluating the plans.This can show that an AI isn’t safe, but only provides some evidence that one is. Though if this technique and improvements on it keep finding dangers, then it eventually finding nothing becomes somewhat strong evidence.Also, there’s a lot of plausible scenarios in which we suspect an AI is unsafe but need to convince skeptics. Concrete failure mode examples could be very useful there.Can we Actually Do This?Now, I handwaved autoregression earlier and said we could just get p(request,response). We could do this by brute force, multiplying each token’s probability, maybe using a few pieces of caching to shave off a bit of compute. But we can probably improve on this. We could build special nets for the plausibility and danger tests, starting from the main net and the doing a quick respecialization run (training off the original net with the concatenation trick) to give scalar output directly. We might also be able to find the Mental State After Reader Request, which wouldn’t require a ton of separate evaluations, and then run a separate gradient descent just to find the words that would generate that state.This may still leave us somewhere short of practical. I’m hoping someone who knows more details of LLMs can take this idea and run with it. Or someone with the ability to throw around ludicrous amounts of compute. The toy-problem version of this did work, so it seems worth a shot.Discuss Read More

Related Posts

How I Started Being Productive

GPT-5.2 Is Frontier Only For The Frontier

ARENA 7.0 Impact Report

Leave a Reply Cancel reply