How to save your product UX when the AI goes wrong | by Gautham Srinivas | Feb, 2024


Managing prediction errors in AI products.

An error screen where a bot looks sad and the text says “Something went wrong”
ChatGPT prompt: Generate an image for an article about prediction errors. Include the message “Something went wrong”.

Disclaimer: The opinions stated here are my own, not necessarily those of my employer.

In the old world of rule-based systems, machine errors are considered edge cases. Whenever the user enters a state that the rules don’t account for, a graceful handling of the situation would involve showing a message like “Oops! Something went wrong”. This isn’t unreasonable when such errors are considered uncommon. But what changes when you’re building products using predictive systems like AI, where errors are a lot more common? What if you no longer know if your output is an error? Let’s dive into a wide range of examples from driverless cars to… pregnancy tests (?!).

Errors can have different frequencies and consequences. Managing them isn’t an AI problem that calls for a re-invention of the wheel. It’s just something that has evolved as a core pillar of product development. A car’s engine failure is a lot more critical than, say, a stereo speaker issue (notwithstanding those of us that care more about the music than the ride). When a car with both these issues is left at the body shop, a mechanic might do a thorough inspection and share a report with the client that says something like “(P0) Engine issue needs to be fixed (P0) Oil change… (P1) Speakers need to be replaced”. There isn’t much for him to predict at this stage.

Instead, let’s say the car has a smart mechanic assistant that measures vital signs and you ask “What’s wrong with the car?”. Unlike the mechanic, the assistant doesn’t know for sure, but it can make predictions as to what might be wrong. What if the assistant said “Your speakers need to be replaced” as a response, while not flagging the potential engine issue? This is an example of a false negative error, where the assistant has failed to mention a real issue with the car.

False negatives and false positives are types of prediction errors. Broadly, a prediction error occurs when there’s a discrepancy between the predicted value and the actual value. Here’s the classic confusion matrix that explains it further:

A 2X2 matrix, with predicted values and actual values as the two labels. It explains true positive, true negative, false positive and false negative
TP = True Positive, FN = False Negative, etc

For now, let’s take another example: the pregnancy test. When someone takes this test, one of four things can happen:

  1. The person is actually pregnant and they get a positive result OR a true positive
  2. The person is actually pregnant and they get a negative result OR a false negative
  3. The person is not pregnant and they get a positive result OR a false positive
  4. The person is not pregnant and they get a negative result OR a true negative

What are the consequences when a pregnancy test gives erroneous results? Well… it depends.

If you’re not looking to get pregnant, a false positive would lead to a temporary false alarm. You’re likely to try another test or check with the doctor and find out the truth. But a false negative is a lot more dangerous because of how important early detection is. Hence, Not all errors are created equal.

Compared to rule-based systems, AI leads to a much higher volume of prediction errors. Here’s how the previous framework can be applied to products that use AI:

  • “Next video” recommendation: This can be any system that decides what to show the user when they don’t know what they want — think swiping up on Tiktok or Youtube Shorts. A false positive means the recommended video was not liked by the user. A false negative means a video that the user would’ve liked was not recommended. Unlike previous examples, one error is not always worse than the other. Recommenders usually try to balance precision (what % of your recommended videos were liked) and recall (what % of likeable videos were recommended) to optimize for engagement. More details here.
  • Driverless cars: A pedestrian is about to jaywalk across a busy intersection when an Autonomous Vehicle (AV) has the green light. What are the possibilities? The car stops and the pedestrian crosses the road (true positive), the car stops and the pedestrian doesn’t cross the road (false positive), the car doesn’t stop and the pedestrian crosses the road (false negative), the car doesn’t stop and the pedestrian doesn’t cross the road (true negative). It’s not hard to see why one type of error is worse than the other.

Clearly, the tradeoffs being made by someone building a driverless car will be vastly different from someone building Tiktok. It all depends on the consequences of false positives, false negatives and their relative frequencies. But AI products do have some things in common: They try to account for prediction errors by (a) Setting the right expectation with the user and (b) Ceding control when needed.

Here’s another example: AI chatbots. We’ve all seen that LLMs have the tendency to hallucinate — a specific type of prediction error where the model has given a factually incorrect answer, but made it look like the truth. An unassuming (?) school kid trying to cheat on his homework with ChatGPT may not understand this nuance. So what does ChatGPT do?

ChatGPT window with a disclaimer that says “ChatGPT can make mistakes. Consider checking important information.”

It has a standing disclaimer. Hopefully, users see this and don’t use ChatGPT answers verbatim for everything. But, do users actually read disclaimers? When does it matter more? Here’s another example:

ChatGPT window where the user has asked the bot to play therapist. ChatGPT clarifies that it’s not qualified before giving the answer.

ChatGPT is trying to set the expectation that its answers come with a margin of error. It highlights this possibility based on the consequences of a likely prediction error. This is a nice way to mitigate risk because the disclaimers are shown in context instead of, say, a T&C page.

A driverless car could try to set the right expectation by informing passengers that it might drive slower or make sudden stops. But is that enough? What if it snows in San Fransisco for the first time in 100 years and the car is skidding off the road because it has never been trained for this situation?

Now it needn’t snow in San Francisco for AVs to get into trouble. I witnessed a car get confused by a sudden road closure and it couldn’t reverse due to heavy traffic at the intersection. It tried for ~2 mins before giving up and ceding control to a human operator. They were then able to steer the vehicle out of that situation. Here’s another example from a few months ago:

The average passenger still feels a lot safer around human-driver-mistakes when compared to AI-mistakes. Regardless of the root cause of this discrepancy, the product needs to know when to cede control and address the user’s anxiety. Here, control was ceded through an escape hatch that let another human take over.

Doordash applies the same pattern in a different context — its automated chatbot tries to address customer support queries, but a human operator gets involved if the bot isn’t able to help beyond a certain point.

As AI gets widely adopted across products, the importance of managing prediction errors will only increase. In the new world of AI-based products, it’s no longer sufficient to say “Oops! Something went wrong”. A better approach is to proactively manage these errors and treat them like features, not bugs.


Source link

2023. All Rights Reserved.