Measuring the experience of AI. Finding emergent relevance during… | by Ammar Halabi | Mar, 2024

[ad_1]

There are many answers to this question, from UX frameworks, methods, job descriptions, strategy exercises, design guidelines and more.

I’ll just pick one. Having addressed the macro-level question of relevance, I’d like to visit the other end and make a practical, micro-level contribution. I’ll show how we’ve measured UX in an AI-driven content recommender, taking relevance into account.

Over the past year, we’ve been developing an application that curates content according to users’ interests. We had already done a round of discovery research and prototyping that highlighted the importance of emergent relevance and context in use. So we wanted to test whether it really worked for users, with real production data, with real user profiles and preferences. We decided to test for the following criteria, all of which had emerged from the early discovery research:

  • Content relevance: How interesting do our users find the content?
  • Trust: How controllable, understandable, and transparent does it feel?

I set out to design a tool to measure these aspects. The result was an attitudinal questionnaire that can be integrated into user testing, whether moderated or not. It’s an evaluative tool that can also be used in an exploratory or generative way when administered in a moderated, open-ended setting.

In the early 2000s, I had some experience in implementing and evaluating information retrieval and search engines (Halabi, Islim & Kurdy, 2010). The convention at the time was to rely on formal performance metrics (such as accuracy and breadth) to tell whether a retrieval technique was good, measured against standard benchmark datasets. Over time, algorithms improved to the point where the formal quality of retrieval and recommendation became less of an issue. Therefore, formal metrics became less useful in predicting actual user adoption, and it was the user experience that played an increasing role.

Despite this growing recognition of the importance of user experience (see reading list), I was surprised by the lack of robust, validated tools for measuring UX on retrieval platforms. Indeed, recent surveys still report an over-reliance on formal metrics (e.g., Bauer, Zangerle & Said, 2023). This resonated with how we worked in my organisation with ML engineers and as a black box service. It was a simple, classic case where we needed more rapid prototyping with real data, not only to test UI and interaction, but also to test how different algorithms contributed to the experience.

Fortunately, Pu, Chen & Hu (2011) have addressed a similar problem in UX measurement. They developed a validated questionnaire based on psychometric modelling and factor analysis. They found how certain experience factors, such as relevance, explanation and variety, contributed to satisfaction and intention to use.

This was a good starting point; these categories fit very well with the factors we wanted to measure. Using this framework, we were able to explain the intention to use we measured by linking it to different aspects of the UX of the AI-driven solution. Relevance here corresponds to recommendation accuracy in Pu et al.’s model (Figure 1).

Causal model of UX, linking the highest level constructs Purchase Intention and Use intention with constructs based on User Perceive Qualities, User Beliefs, and User Attitudes.
Figure 1. Structural model fit, used with permission of the publisher (ACM), from: Pu, Li & Chen (2011)

With the help of my colleagues on the product team (all of whom have long experience in search and recommendation), I adapted Pu, Chen & Hu’s model. I consolidated similar constructs that show a strong correlation in the original article, while preserving the causal relationships to maintain explanatory power. Specifically:

  • I merged Explanation and Transparency.
  • I merged Use Intention and Purchase Intention into Use & Convergence.
  • I merged Interface Adequacy and Ease of Use.
  • I merged Interaction Adequacy and Control.

In addition, we integrated Recommendation Timeliness into the model, as we had found it to be an important factor in our earlier research. Fortunately, this notion was also validated in a later study (Chen et al., 2019). See the resulting causal model that I adapted (Figure 2).

Causal model of UX, linking the highest level construct of Use & Convergence, with two mid-level ones: Trust & Confidence, and Usefulness & Satisfaction. These two are then linked to lower-level constructs, which are: Transparency, Control, Recommendation Relevance, Recommendation Novelty, Recommendation Diversity, Recommendation Timeliness, Information Sufficiency, and Ease of Use.
Figure 2. Our simplified causal model for measuring UX in AI-driven applications

Questionnaire items

For the final wording of the questionnaire, I took most of the wording from Pu, Chen & Hu (2011), with the exceptions noted below. All questions are measured on a 5-point Likert scale:

  1. Transparency: I understood why the items were recommended to me.
  2. Control: I found it easy to tell the system what I like/dislike.
  3. Recommendation Relevance: The items recommended to me matched my interests.
    — Follow-up: How many items from this list of recommendations would you investigate further?*
    Note: I’ve added this follow-up to get more granular details on relevance
  4. Recommendation Novelty: The items recommended to me are novel.
  5. Recommendation Diversity: The items recommended to me are diverse.
  6. Recommendation Timeliness: The items recommended to me are timely.
  7. Information sufficiency: The information provided for the recommended items is sufficient for me to make a decision.
  8. Ease of use: How difficult or easy did you find it to use the system?*
    Note: I used the standard wording of the Single Ease Question (SEQ)
  9. Confidence & Trust: I am convinced and trust the items recommended to me.
  10. Usefulness: I feel supported in finding what I like with the help of the recommender.
  11. Use & Convergence: I would use this recommender often.
    Note: Depending on the end result you want to measure, you can reword for convergence to action, recommendation, will to promote, etc.

I also wanted to compare users’ attitudes to our new platform with what they were already using, so I added the following benchmark section:

  1. What tools do you use to keep up with your areas of interest?
  2. Overall, how satisfied are you with your existing tools and practices for keeping up to date?
  3. In comparison, how satisfied are you with the recommender you’ve seen in this experiment for keepoing up to date?
  4. Imagine you had a development budget of 5000 to support tools to help professionals keep up to date. Where would you invest it? Options:
    — The tools you currently use to keep up to date
    — Improve this recommender to replace the tool(s) you use
    — Other: elaborate

How to use this questionnaire?

The structural model provides an explanatory framework for interpreting the data (Figure 2). For example, if you observe low intentions for Use & Convergence, you can trace this back to other poorly scored factors (e.g. was Usefulness poor because of low Relevance? Or was Trust low due to low Transparency?) Use this reasoning to determine which factors need to be improved, including algorithm design with the ML engineers.

To improve the signal quality, we decided that it was best to integrate the questionnaire into the flow of a user test (and later into the actual pilot). The tested product can be at any level of fidelity, but it’s important that the content recommendations are real, as they are the core of the test.

Depending on the stage of your product, if you’re doing early testing or seeking qualitative feedback, you could do as we did and combine it with interviews or moderated testing, using it as a prompt to dig deeper qualitatively. In later stages, once you have a functional pilot, you can combine it with product analytics and run it as a non-moderated on-site survey to get an attitudinal signal that helps explain the behavioural data.

In our scenario, we planned a 3-phase approach to iterate and triangulate for greater confidence:

  1. Initial test with a functional POC: moderated, qualitative, small-sample user testing, culminating with the questionnaire.
  2. Pilot test with an alpha release: unmoderated, larger-sample test concluding with the questionnaire. This would give us a reliable signal of user attitudes.
  3. Online, real-world evaluation with A/B testing: Analytically measure actual behaviour, combined with the questionnaire, to correlate attitudinal responses with adoption, retention and churn.

[ad_2]

Source link