I already established that reward hacking will always be an option in the given solution space to a certain problem and that it seems like an agent always converges to reward hacking, if it seems easier than a given goal.
Now, is there a limit where tasks are so difficult (e.g. solve the Riemann Hypothesis), that the agent will always converge to reward hacking behavior? I wonder whether alignment schemes are supposed to account for that, or else they will always lead to failure.
This seems to contradict this: