Understanding the danger from strong AI

Default problem and anthropomorphism confusion

Sep 21, 2023

Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.

This short statement has been signed earlier this year by many important people in the field of artificial intelligence (AI), including pioneers of deep-learning neural networks and CEOs of leading labs such as OpenAI, the company behind ChatGPT. As well as by other very notable thinkers. Why is advanced AI so dangerous? Here I attempt to highlight some crucial issues. I am not an expert, but perhaps other non-experts could therefore benefit from my experience when trying to understand this.

A good first analytical step, I would suggest, is splitting capability from motivation. How can the AI inflict harm on us, versus, why would it want to do so.

Which would also be my entry point for accessing the view of a skeptic. For example, I was not surprised to see renowned security expert Bruce Schneier among the signatories — and then I was surprised when, shortly afterwards, he backtracked and clarified that his worries don’t extend to extinction outright. Does Schneier possess reassuring insights that I have misunderstood or overlooked so far? When trying to get an understanding of his position, my first question, if I could speak to him, would be whether he doubts AI capability or unfriendly AI motivation.

Now, I would have been surprised to see Steven Pinker on the list, even though it includes quite comparable names like Daniel Dennett and David Chalmers. For Pinker says such things as that

a woolly conception of intelligence as a kind of wonder stuff [encourages, among other things] questionable extrapolations from the human case, such as imagining that an intelligent tool will develop an alpha-male lust for domination.

This comes from a reply last year in a debate with Scott Aaronson, another signatory, and complains about anthropomorphism, i.e. about the attribution of human characteristics to something that doesn’t have them. Anthropomorphism is indeed a key intellectual hurdle in the way of a sound assessment of the danger from strong AI. But next to it I would single out something else as well. Even as the number one hurdle. The selection of key intellectual pitfalls becomes the following:

For catastrophe to occur, nothing has to go wrong in particular; instead, AI goals are lethal by default.
Confusion around anthropomorphism — what is anthropomorphic and what is not?

Both default problem and anthropomorphism confusion relate to motivation rather than capability, but let’s take a quick look at the latter first. These days, in the age of ChatGPT, the ranks of those that hold machine intelligence of human level and beyond to be impossible are diminishing. A good many experts actually believe such intelligence to be imminent. But would it be enough for machines to “take over the world”?

Often that discussion takes an unfortunate form where the skeptic demands a concrete takeover scenario to be specified and then tries to poke holes into that specific scenario. But there are so many possible scenarios! And even in the mild case, relatively speaking, that the new synthetic intelligence cannot “fundamentally” surpass the human kind, we would still be faced with an adversary that knows everything on the internet and can presumably make lots of copies of itself (and improved copies), each of which can think much, much faster than humans. How could one not be very worried?

In fact, a sizeable contingent are remarkably unworried, or at least skeptical. As documented in a series of Twitter polls by Zvi Mowshowitz, whose weekly reporting on developments around AI is highly recommended.

I can’t really see particular pitfalls here, of the sort that triggered me to write this post. What is needed is just a willingness to think things through and take seriously the likelihood of qualitatively unprecedented scenarios playing out if the premises are qualitatively unprecedented.1 So I move on from the capability part to the motivation part. Why should the AI have unfriendly goals?

Well, to that there is a simple answer: because some idiot will hook it up to them! Still no interesting insights needed, it appears.

It looks almost as if we’re finished, before even reaching default problem and anthropomorphism confusion. A seemingly sensible conclusion at this point, not least from a military angle, would be that strong AI should be developed, but under strict security, so idiots could not lay their hands on it and hook it up to bad goals.

That kind of view will naturally lead to an arms race.

Yet crucially, it is flawed.

How is it flawed? To approach this, let’s first imagine the following scenario. You are selected as the junior partner to the most evil human being you’ve ever encountered, to form a two-person ruling committee that will decide where the world should go from here. It might not be a comfortable scenario — but the point is that, actually, from a “bird perspective” that looks down on all sorts of possible futures, there is a lot that you two could agree on very easily. Take the composition of Earth’s atmosphere, for example. You would both agree that the atmosphere needs to be breathable, presumably, and that means ruling out the vast majority of possible compositions of gases allowed by the rules of chemistry.

Now, a practical agenda would consist only of interesting items, not of things that “go without saying”, like that the atmosphere should be breathable. But replace the evil person with an AI, and this changes. Completely inhuman futures join the agenda. In fact, they will constitute the vast majority. From the bird perspective, there is remarkably high agreement between any two human beings. Philosophically, what is going on is that the human species has a special relationship with the current particular world state. That’s not a religious statement, as in “God made the world specially for us”, because you can argue in a Darwinian way as well, just the other way round: evolution has ensured that humans are specially adapted to the arbitrary state the world happens to be in. Either way, the atmosphere is breathable, which is actually quite astonishing, what were the chances of that; the most common liquid (water) on Earth’s land surface provides nourishment, which is actually quite astonishing, what where the chances of that; and so on.

Given some random goal or goals, if you let imaginary humans, even bad ones, with unlimited capabilities optimise the world accordingly, the result should be compatible with human existence, because as humans they will respect all the constraints that go without saying, like that the atmosphere should be breathable. A synthetic optimiser would not respect those constraints. Except if its goals happen to be compatible with human existence. But that means something has to go right. By default strong AI is lethal. Nothing has to go wrong for it to be so. This is what I mean by the default problem. (The problem is well-known to many experts, of course.)

How difficult is it to build such a compatible AI? Technical details are beyond the scope of my post, but clearly right now engineers don’t know how to solve this so-called AI-alignment problem. Let me note several difficulties as I understand them. The form of AI that has become dominant is neural networks, taught by example rather than explicitly programmed (not that getting the explicit programming correct would be easy), resulting in an opaque structure doing the cognition where nobody knows what the AI really thinks. A trial-and-error approach to getting its goals right looks forlorn because as soon as it is sufficiently advanced to constitute a real threat it will naturally deceive its operators about goal mismatch. In fact, most current AI-alignment ideas may be doomed to fail at exactly that moment, when it counts.

So it looks like superhuman AI would not be militarily useful. It would be useful only for mass-suicidal cults. Sometimes is is compared with nuclear bombs, where humanity has experience with the problem of proliferation; but the “appropriate analogy here might be a nuclear weapon with a blast radius covering the entire planet” (quoted from here).

Before getting to the confusion around anthropomorphism, I should mention that my rather philosophical presentation of the default problem was not meant to assert that the human species, once a strong AI has escaped human control, will be merely collateral damage, rather than deliberate damage. Once we know how hard it is to get AI goals right, we can note a more immediate problem. Here is a short comment that I’ve seen on

Astral Codex Ten

this year:

If we created an AGI, we would be able to create more, therefore be a risk to the original AI since other AI could be a threat to its existence and therefore conflict with whatever its goals were. I can’t think of any reason why we wouldn’t be destroyed.

That's very succinct, isn’t it — perhaps I was overcomplicating things in the preceding paragraphs. But I think it is important to understand, not only the simple strategic issue noted in this succinct comment, but also how truly alien AI goals would be, from a human perspective. Sometimes AI is compared to companies, for example: are they also “unaligned intelligent agents”? Not really (I claim). I will contest only the unalignedness here: companies are composed of human beings and thus share basically the same tiny speck in the vast overall space of possible goals. Judged from the bird perspective, they are incredibly well-aligned with humanity.

Anyway, overcomplicating or not, my previous paragraphs may have at least been helpful for getting into the right mindset when we now move on to the topic of anthropomorphism. What is anthropomorphic and what is not? The following quotation is made up, but my impression is that people do say such things.

The way things are going AI will indeed take over. As its creators, we will surely be allowed to enjoy our retirement, though.

Anthropomorphic or not? I would certainly worry about anthropomorphism here. Tell me, why should the AI grant us a peaceful retirement? Perhaps because it would cost it only 0.1 percent of its available resources, someone might calculate, so it would still have 99.9 percent left to use as it pleases — yet unfortunately, 100 > 99.9.

Then again, come on, clearly we “deserve” such a tiny percentage, as its creators? Yet unfortunately, such sentiment would appeal to a sense of fairness, which appears to be a specifically human tendency.

Clearly, anthropomorphism can be a real hurdle in the understanding of AI.

On the other hand, one could try to refute the charge of anthropomorphism here by providing some further arguments, like perhaps claiming (an ambitious claim) that fairness is actually a tendency of the universe, rather than just a human one. Or one could show how fairness will be instilled in the AI during its construction. “There isn’t really a novel problem: just raise AI like a child.” Unfortunately, as far as I am aware, the human sense of fairness is known to be innate to a considerable degree, so that kind of alignment proposal is very dubious.

Now, the point is, as for Pinker’s above charge of anthropomorphism, it can be refuted! Is a quest for “domination” a specifically human trait, of “alpha males” in particular? No, because whatever the AI's goal is, power would be helpful in achieving it! In technical language, the acquisition of power is a “convergent instrumental goal”.2 Self-preservation is another, for instance, as an agent can actively advance its goals, whatever they are, only as long as it exists.

This, then, is what I think is the second big challenge when it comes to understanding the danger from strong AI, besides the default problem: avoiding anthropomorphism traps but at the same time understanding why certain (other) human traits are actually not specifically human, but instead inherent aspects of intelligent agency, to be expected from AI as well.

Let me conclude this post, and the anthropomorphism topic in particular, with another exercise, so to speak. Some commonsensical readers may have stumbled over an apparent contradiction above when I mentioned deception resulting from goal mismatch. Why on earth would the machine stick to an inhuman goal even though it knows that A) the goal was instilled into it by its creators, and B) it is not what the creators actually want?

I admit I haven’t really grasped all this enough yet to be completely sure that those readers wouldn’t have a point . . . but unfortunately, they may well not have one. It looks like another of these things that go without saying only between humans. What is needed here, if we try to think rigorously about it, is some kind of “charitable re-interpretation” — and this appears to be a human thing, like fairness, as opposed to something that would be inherent in intelligent agency.3

I’m not claiming that I find that easy myself. Doubts kept creeping in regarding core parts of this post, because the AI-doom reasoning somehow feels unreal. Unfortunately, such feeling is well accounted for by the unprecedentedness alone and hence not an indication of flaws in the reasoning; I have to discount the feeling (as best I can).

It is a fairly simple argument, and surely Pinker has been confronted with it; so the crux might be something about how AI will stay on a “narrow path” towards its goals, according to Pinker (but I don't understand his arguments). Or could it be that it somehow lacks goals and agency in a more fundamental sense? Perhaps there isn’t even the danger noted earlier, that some idiot will hook it up to bad goals? Such claims (not from Pinker or Schneier) may have become more popular with the advent of large language models like ChatGPT, which have made it difficult for AI skeptics to specify cognitive tasks that AI definitely can’t do and on the other hand appear to lack agency. However, take an old-fashioned chess algorithm instead: clearly even that simple AI can be usefully thought of as pursuing goals. Actually, the fact that ChatGPT can play some chess shows nicely that language models give rise to goals, too, at least in principle: see here.

For further help with the concluding exercise see section 1 of Complex Value Systems are Required to Realize Valuable Futures by Eliezer Yudkowsky, a pioneering thinker on the AI-alignment problem (and a signatory on the list). Of the texts I have read so far on the topics discussed here, this 2011 paper by Yudkowsky is the one I would recommend most. Among other things, it makes clearer than I did how, in the space of possible goals, not only will you miss by default, but near-misses will already lead to disaster — a miss is as good as a mile.

Seeking Bird Perspectives

Discussion about this post