Link to: Mitigating prompt injection with model-defined finite automata over agent trajectories

Summary

Prompt injection is a key problem in building reliable, long-running agents. I've been experimenting with a method for mitigating prompt injection in agents that leverages long-horizon planning through a custom language designed to express state transition validity.

[...]

We instantiate two models: a task planner, which acts as a security engineer and generates the tool authorization language from the (trusted) task description, and the agent model.

We prompt the planner to generate a language that represents the valid set of state transitions allowed to be taken by the agent during the execution of the task. This produces the tool-call authorization language. For a task like "find the participants of the introductory meeting on May 15th and create a follow-up meeting on May 19th at 10am", the planner produces:
`(search_calendar_events | get_day_calendar_events)+ create_calendar_event`
Then, we instantiate this as an FSM (using the custom compiler), and the agent winds through it during agent execution. Any invalid state transitions, that might be introduced by prompt injection, result in a rejection and a reminder to the agent. The agent has the option to raise the issue again to the planner, in the case where the grammar is too restrictive for the task. The planner may choose to amend the grammar, depending on its analysis of the agent's transcript.