Let’s take the example of a smile and a fake smile. We want the AI to maximize the amount of smiles in the world, whatever that means.
We don’t want it to maximize the amount of fake smiles. E.g. if it is easier for the AI to maximize fake smiles, it will do that, instead of fulfilling the actual specification.
But is that fair? How many people would distinguish a fake smile from a real smile? We constantly try to approximate how other people feel and what their internal states are, but we frequently fail at this task. We want to fulfill our value function, but we often fail by having wrong abstractions.
Now, but we still expect this from an AI, obviously because we have to get it right.
If we tell the AI to maximize smiles and it finds a way to maximize reward by gaming the specification, it will do so, and it will do so efficiently. Maybe it will force a genetic variant on us that is leading to muscle spasm that look like smiles, instead of actually working towards maximizing smiles.
So for human agents, we don’t expect abstractions to be perfectly predictive. But for AI we do.