Let’s evaluate how unrelated behaviour is to values. In order words, how hard and inaccurate it would be for an AI to approximate human values, solely by evaluating human behaviour.
We already know that We value latent variables. Sometimes, we are even unsure about our own values. Thus, asking somebody about what they value is fallible. We often contradict ourselves, so we have to find another way.
We could just examine a human in action, trying to infer the values by looking at them. But often the things we care about are not coupled to our actions. We might value a human in a distant country, even if no action of ours actually influences this person. Just looking at somebody would give the impression that we don’t care about somebody in India, because my actions don’t directly influence them. Or they do give this impression, but I say otherwise.
Wicked, huh?