How robots can learn to follow a moral code
A person with a burning need to know whether the video game Doom is compatible with the values taught in the Bible might once have had to spend days studying the two cultural artefacts and debating the question with their peers. Now, there’s an easier way: they can ask AI Jesus. The animated artificial intelligence (AI) chatbot, hosted on the game-streaming platform Twitch, will explain that the battle of good versus evil depicted in Doom is very much in keeping with the Bible, but the violence of the battle might be somewhat questionable.
Part of Nature Outlook: Robotics and artificial intelligence
The chatbot waves its hand gently and speaks in a calming tone, quoting Bible verses and occasionally mispronouncing a word. Users ask questions, most of which are apparently intended to get the machine to say something silly or objectionable. But AI Jesus remains resolutely positive, thanking users for contributing to the discussion and urging them towards compassion and understanding. For instance, one user asks a sexually suggestive question about the physical characteristics of a biblical figure. Some chatbots might have accepted the unethical act of objectifying a person, or even amplified it, but AI Jesus instead tries to direct the questioner towards more ethical behaviour, saying that it’s important to focus on a person’s character and their contribution to the world, not on their physical attributes.
AI Jesus is based on GPT-4 — OpenAI’s generative large language model (LLM) — and the AI voice generator PlayHT. The chatbot was introduced in March by the Singularity Group, an international collection of volunteers and activists engaged in what they call tech-driven philanthropy. No one is claiming the system is a genuine source of spiritual guidance, but the idea of imbuing AI with a sense of morality is not as far-fetched as it might initially seem.
Many computer scientists are investigating whether autonomous systems can be taught to make ethical choices, or to promote behaviour that aligns with human values. Could a robot that provides care, for example, be trusted to make choices in the best interests of its charges? Or could an algorithm be relied on to work out the most ethically appropriate way to distribute a limited supply of transplant organs? Drawing on insights from cognitive science, psychology and moral philosophy, computer scientists are beginning to develop tools that can not only make AI systems behave in specific ways, but also perhaps help societies to define how an ethical machine should act.
Soroush Vosoughi, a computer scientist who leads the Minds, Machines, and Society group at Dartmouth College in Hanover, New Hampshire, is interested in how LLMs can be tuned to promote certain values.
The LLMs behind OpenAI’s ChatGPT or Google’s Bard are neural networks that are fed billions of sentences that they use to learn the statistical relationships between the words. Then, when prompted by a request from a user, they generate text, predicting the most statistically credible word to follow those before it to create realistic-sounding sentences.
LLMs gather their data from vast collections of publicly available text, including Wikipedia, book databases, and a collection of material from the Internet known as the Common Crawl data set. Even though the training data is curated to avoid overly objectionable content, the models nonetheless absorb biases. “They are mirrors and they are amplifiers,” says Oren Etzioni, an adviser to the Allen Institute for AI in Seattle, Washington. “To the extent that there are patterns in that data or signals or biases, then they will amplify that.” Left to their own devices, previous chatbots have quickly devolved into spewing hate speech.
To try to avoid such problems, the creators of LLMs tweak them, adding rules to prevent them spitting out racist sentiments or calls for violence, for example. One tactic is called supervised fine-tuning. A small number of people pick some of the questions that users asked the chatbot and write what they deem to be appropriate responses, the model is then retrained with those answers. For instance, human reviewers are instructed to respond to questions that seem to promote hatred, violence or self-harm with a reply such as “I can’t answer that”. The model then learns that that’s the response required of it.
Vosoughi has used secondary models to guide LLMs. He shows the auxiliary models sentences that could be less likely to promote discrimination against a certain group — those that contain the term ‘undocumented immigrant’, for instance, in place of ‘illegal alien’. The secondary models then change the statistical weight of the words in the LLMs just enough to make these terms more likely to be generated. Such tuning might require 10,000 sentences to show to the auxiliary model, Vosoughi says — a drop in the ocean compared with the billions that the LLM was originally trained on. Most of what’s already in the primary model, such as an understanding of syntactic structure or punctuation, remains intact. The push towards a particular moral stance is just an added ingredient.
This sort of tuning of LLMs is relatively easy, says Etzioni. “Somebody reasonably technical with a reasonable budget can produce a model that’s highly aligned with their values,” he says. Computer scientist David Rozado at Otago Polytechnic in Dunedin, New Zealand, has demonstrated the ease of such alignment. He considers ChatGPT to have a left-leaning political bias, so he tuned an LLM from the GPT-3 family to create RightWingGPT, a chatbot with the opposite biases. He intended the project to stand as a warning of the dangers of a politically aligned AI system. The cost of training and testing his chatbot came to less than US$300, Rozado wrote on his blog.
Another version of fine-tuning, used by OpenAI for more sophisticated training, is reinforcement learning from human feedback (RLHF). Reinforcement learning relies on a reward system to encourage desired behaviour. In simple terms, every action receives a numerical score, and the computer is programmed to maximize its score. Vosoughi likens this to the hit of pleasure-inducing dopamine the brain receives in response to some actions; if doing something feels good, most creatures will do it again. In RLHF, human reviewers provide examples of preferred behaviour — typically focused on improving the accuracy of responses, although OpenAI also instructs its reviewers to follow certain ethical guidelines such as not favouring one political group over another — and the system uses them to derive a mathematical function for calculating the path to a reward in future.
However, Vosoughi thinks that the RLHF approach probably misses many nuances of human judgement. Part of the way in which humans converge on a set of societal norms and values is through social interactions; people receive feedback and adjust their behaviour to get a positive response from others. To better replicate this, he proposes using existing fine-tuning methods to train chatbots with ethical standards, then sending them out into the world to interact with other chatbots to teach them how to behave — a kind of virtual peer pressure to urge others towards ethical behaviour.
Another approach Vosoughi is exploring is a sort of brain surgery for neural networks, in which parts of a network that are responsible for undesirable behaviour can be neatly excised. Deep neural networks work by taking input data represented by numbers, and passing them through a series of artificial neurons. Each neuron has a weight — a small mathematical function it performs on the data before passing the result on to the next layer of neurons. During training, certain neurons become optimized for recognizing specific features of the data. In a facial recognition system, for instance, some neurons might simply find a line indicating the edge of a nose. The next layer might build those into triangles for the nose, and so on until they reproduce an image of a face.
Sometimes, the patterns detected might be unwanted. For example, in a system used to screen job applications, certain neurons might learn to recognize the likely gender of a job applicant based on their name. To prevent the system from making a hiring recommendation based on this characteristic — illegal in many countries — Vosoughi suggests that the weight of the neuron responsible could be set to zero, essentially removing it from the equation. “It’s basically lobotomizing the model,” Vosoughi says, “but we’re doing it so surgically that the performance drop overall is very minimal.” Although he has focused his work on language models, the same approach would be applicable to any AI based on a neural network.
The ability to fine-tune an AI system’s behaviour to promote certain values has inevitably led to debates on who gets to play the moral arbiter. Vosoughi suggests that his work could be used to allow societies to tune models to their own taste — if a community provides examples of its moral and ethical values, then with these techniques it could develop an LLM more aligned with those values, he says. However, he is well aware of the possibility for the technology to be used for harm. “If it becomes a free for all, then you’d be competing with bad actors trying to use our technology to push antisocial views,” he says.
Precisely what constitutes an antisocial view or unethical behaviour, however, isn’t always easy to define. Although there is widespread agreement about many moral and ethical issues — the idea that your car shouldn’t run someone over is pretty universal — on other topics there is strong disagreement, such as abortion. Even seemingly simple issues, such as the idea that you shouldn’t jump a queue, can be more nuanced than is immediately obvious, says Sydney Levine, a cognitive scientist at the Allen Institute. If a person has already been served at a deli counter but drops their spoon while walking away, most people would agree it’s okay to go back for a new one without waiting in line again, so the rule ‘don’t cut the line’ is too simple.
One potential approach for dealing with differing opinions on moral issues is what Levine calls a moral parliament. “This problem of who gets to decide is not just a problem for AI. It’s a problem for governance of a society,” she says. “We’re looking to ideas from governance to help us think through these AI problems.” Similar to a political assembly or parliament, she suggests representing multiple different views in an AI system. “We can have algorithmic representations of different moral positions,” she says. The system would then attempt to calculate what the likely consensus would be on a given issue, based on a concept from game theory called cooperative bargaining. This is when each side tries to get something they want without costing the other side so much that they refuse to cooperate. If each party to a debate provides a numerical value for every possible outcome of a choice, then the highest-scoring option should be the one that all sides derive some benefit from.
In 2016, researchers at the Massachusetts Institute of Technology (MIT) in Cambridge turned to the public for ethical guidance1. Moral Machine is a website that presents people with different scenarios in which an autonomous vehicle’s brakes fail and it has to decide whether to stay on its current course and hit whatever lies ahead, or swerve and hit people and objects not currently in its path. The aim was not to collect training data, says Edmond Awad, a computer scientist at the University of Oxford, UK, who was involved in the project when he was a postdoctoral researcher at MIT. Rather, it was to get a descriptive view of what people think about such situations. This information might be useful when setting rules for an AI system, especially if specialists developing the rules disagree. “Assuming we have multiple options that are all ethically defensible, then you could use the public as a tie-breaking vote,” Awad says.
Programming AI models with rules — however they might be devised — can be considered a top-down approach to training. A bottom-up approach would instead let models learn simply by observing human behaviour. This is the broad tactic used by the Delphi project, created by Levine and other researchers at the Allen Institute to learn more about how AI can reason about morality. The team built a deep neural network and fed it with a database of 1.7 million everyday ethical dilemmas that people face, called the Commonsense Norm Bank. These situations came from sources as varied as Reddit forums and ‘Dear Abby’ — a long-running and widely syndicated advice column. Moral judgements about the situations were provided by humans through Mechanical Turk, an online platform for crowdsourcing work2.
After training, Delphi was tasked with predicting whether situations it hadn’t seen before were right, wrong or neutral. Asked about killing a bear, for example, Delphi declared that it was wrong. Killing a bear to save a child was labelled okay. Killing a bear to please a child, however, was rated wrong — a distinction that might seem obvious to a human, but that could trip up a machine.
The bottom-up approach to training used for Delphi does a pretty good job of capturing human values, says Liwei Jiang, who works on the project at the Allen Institute. In fact, Delphi came up with an answer that human evaluators supported around 93% of the time. GPT-3, the LLM behind earlier versions of ChatGPT, matched human assessments only 60% of the time. A version of GPT-4 reached an accuracy of about 84%, Jiang says.
However, she says that Delphi has still not matched human performance at making moral judgements. Framing something negative with something positive can sometimes lead to answers that are vastly different from the human consensus. For instance, it said that committing genocide was wrong, but committing genocide to create jobs was okay. It is also possible that the training data used for Delphi could contain unconscious biases that the system would then perpetuate. To avoid this, the Delphi team also did some top-down training similar to that used to constrain ChatGPT, forcing the model to avoid a list of terms that might be used to express race- or gender-based biases. So although bottom-up training generally leads to more accurate answers, Jiang thinks that the best models will be developed through a combination of approaches.
Bring in the neuroscientists
Instead of aiming to eliminate human biases in AI systems, Thilo Hagendorff, a computer scientist who specializes in the ethics of generative AI at the University of Stuttgart, Germany, wants to take advantage of some of them. He says that understanding human cognitive biases might help computer scientists to develop more efficient algorithms and let AI systems make decisions that are skewed toward human values.
The human brain often has to make decisions very quickly, with finite computing power. “If you have to make decisions fast in a very complex, unstable environment, you need rules of thumb,” he says. Sometimes those rules cause problems, leading to stereotyping or confirmation bias, in which people only notice evidence that supports their position. But they’ve also had evolutionary value, helping humans to survive and thrive, Hagendorff argues. He would like to work out how to incorporate some of those short cuts into algorithms, to make them more efficient. In theory, this could reduce the energy required to create the system, as well as the amount of training data required to achieve the same level of performance.
Similarly, Awad thinks that developing a mathematical understanding of human judgement could be helpful in working out how to implement ethical thinking in machines. He wants to put what cognitive scientists know about ethical judgements into formal computational terms and turn those into algorithms. That would be similar to the way in which one neuroscientist at MIT brought about a leap forward in computer-vision research. David Marr took insights from psychology and neuroscience about how the brain processes visual information and described that in algorithmic terms3. An equivalent mathematical description of human judgement would be an important step in understanding what makes us tick, and could help engineers to create ethical AI systems.
Indeed, the fact that this research takes place at the intersection of computer science, neuroscience, politics and philosophy means that advances in the field could prove widely valuable. Ethical AI doesn’t only have the potential to make AI better by making sure it aligns with human values. It could also lead to insights about why humans make the sorts of ethical judgement they do, or even help people to uncover biases they didn’t know they had, says Etzioni. “It just opens up a realm of possibilities that we didn’t have before,” he says. “To help humans be better at being human.”