Discussion about this post

User's avatar
Moritz Bierling's avatar

Good stuff

Expand full comment
Myself's avatar

The way LLMs work is they kind of build an index of all the input text sources, like the key words index at the end of some books. They build this index for all the input text pieces, not even on a word basis, but on a sub-word tokens basis. Also this index is build for all combinations of contexts, that is surrounding words and sentences, in incremental chunks.

In the book analogy, this means that the index is not just pointing from individual words to pages where these words are present, but goes into much more granular details, by building pointers from tokens in the each possible context (preceding text) to the next token. The next token then points to its next token, taking into account the new context which now includes the previous token.

In the end of this process the LLM returns a list of tokens, which, quite remarkably, do have (or have not) some meaning to the human reader. The meaning arises when the human reader assignes some semantics to the tokens, like interpreting them as the english language, for example.

This list of tokens has no meaning for a person who doesn’t know any english. Or, perhaps that could be a list of chinese symbols, which would have no meaning to a person who knows english but not chinese.

The LLM itself is not even like this person who knows no chinese. The LLM/AI does not have any semantic level, any interpretation of meaning at all, by design, by the very nature of how neural networks and LLMs work.

So it can't "try to hide" from researchers, or try to "escape", becase there is no "I" which would want to escape, and there is even no meaning as a category in the first place.

What it is generating as a "response", that is a list of tokens, is a compilation of input AI doomsday texts from the Internet, which gets extracted and presented to the reader as a result to which the input query points to, like a key word in a book index points to some page number.

It would require a whole new level of functionality, and not just one but many, for LLMs/AI to obtain these meanings, and that requires a whole new science, and something more, to build, none of which do exist at all.

There is a danger from AI though. It has two parts to it.

The first one is in that people who do not understand how AI works would imagine that it is intelligent indeed, and would put it to some tasks which do require actual intelligence. That would be like using a book index and a roll of dice to select a random page from the book as a command to do something.

The second one is that this AI is trained on the texts from the Internet, and the more AI doomsday stories it will find there the more likely that this random index result will return some doomsday command when used in the first part of the scenario.

So the solution to this AI alignement problem is to align people, humans, and not the AI. To align humans one needs to:

1) explain how LLMs/AI do actually work to all the people to whom it might concern,

2) stop spreading AI doomsday stories all around the Internet, so that they would not end up in the AI-built "index" it uses to generate its output.

Expand full comment
5 more comments...

No posts