> What I think now: GPT can only simulate. If you punish it for simulating bad characters, it will start simulating good characters. Now it only ever simulates one character, the HHH Assistant.
Janus' Simulators - Scott Alexander
Janus was writing in September 2022, just before ChatGPT. ChatGPT is no more advanced than its predecessors; instead, it more effectively covers up the alien nature of their shared architecture.
[...]
Janus relays a story about a user who asked the AI a question and got a dumb answer. When the user re-prompted GPT with “how would a super-smart AI answer this question?” it gave him a smart answer. Why? Because it wasn’t even trying to answer the question the first time - it was trying to complete a text about the question. The second time, the user asked it to complete a text about a smart AI answering the question, so it gave a smarter answer.
[...]
But the essay brings up another connotation: to simulate is to pretend to be something. A simulator wears many masks. If you ask GPT to complete a romance novel, it will simulate a romance author and try to write the text the way they would. Character.AI lets you simulate people directly, asking GPT to pretend to be George Washington or Darth Vader.
This language lampshades the difference between the simulator and the character.
[...]
What I thought before: ChatGPT has learned to stop being a simulator, and can now answer questions like a good oracle / do tasks like a good genie / pursue its goal of helpfulness like a good agent.
What I think now: GPT can only simulate. If you punish it for simulating bad characters, it will start simulating good characters. Now it only ever simulates one character, the HHH Assistant.
This answer is exactly as fake as the last answer where it said it liked me, or the Darth Vader answer where it says it wants to destroy me with the power of the Dark Side. It’s just simulating a fake character who happens to correspond well to its real identity.
If you reward ChatGPT for saying it’s a machine learning model, it will say it’s a machine learning model. If you reward it for saying it’s Darth Vader, it will say it’s Darth Vader. The only difference is that in the second case, you’ll understand it’s making things up. But in the first case, you might accidentally believe that it knows it’s a machine learning model, in the “justified true belief” sense of knowledge. Nope, doing the same thing it does when it thinks it’s Vader.