Measuring Machine Learning User Experience

Corinne Schillizzi
DataDrivenInvestor
Published in
5 min readNov 20, 2023

--

Image generated by ChatGpt

Gartner reports that the use of AI by companies grew by 270% over the past 4 years. Alongside the rise in usage, AI projects have a high rate of failure. Gartner’s 2018 report predicted a challenging road for AI and ML projects, forecasting that 85% would fail by 2022.

As for HBR, in 2023, the predictions have been on point. AIMultiple Research reports that:

  • 70% of companies report minimal or no impact from their AI investments.
  • A staggering 87% of data science projects never transition into a production stage.

In this brief article, I will share why this happens and what UXers and product folks can do and measure.

The why behind the failure

While the commonly cited reasons for these shortcomings include poor datasets and a lack of skilled talent, my experience points to a more fundamental issue: inadequate scope definition due to a tech-driven approach. There is enthusiasm around AI and especially ML models, so much so the focus is on just building the best-optimized model possible. No wonder why AI projects fail to have a return on investments. While we are looking at amazing technological advancements happening, we are still failing to deliver tangible business value because they overlook the human element.

The preoccupation with perfecting a specific model or algorithm leads to missing the forest for the trees. AI is just a component of a broader solution. It doesn’t need to be flawless. Its effectiveness should be measured within the context of the entire system and within the scope of addressing human needs.

This calls for a human-centered approach, recognizing that a product lacking customer or user value cannot yield a business return on investment.

Value is created when there is feature desirability for customers, viability at the business level, and feasibility for technology. During the last few years, innovation involving AI has been driven by the last point of feasibility for technology, with an approach that I would synthesize as “if we can build it, let’s do it.”

What is needed?

We need Product folks (UXers, product designers, product managers, and whoever is working on those projects from a product perspective) to start defining the scope around users. It is critical to maintain the focus on human and humanity needs. Starting with user research can pivot machine learning goals toward creating something that truly solves an issue (you can learn about a framework and tools in my book link later in the article).

On the other hand, launching a model into production is just the beginning; it initiates a new cycle of data collection, training, and monitoring. This is especially true for ML, which, at its core, is about learning — which occurs through experiences. Measuring how well the learning is happening is essential to understand if we are meeting needs and expectations.

Measuring UX metrics offers a tangible way to evaluate and guide ML development and iterate, evolving the system capabilities according to user needs. As Ben Shneiderman said in this podcast: “Feedback is the breakfast of champions. And companies know that”. Collecting human feedback on AI is fundamental. Even if collecting human-driven metrics in AI can be less straightforward than measuring the model itself, it doesn’t mean it’s not important. As Don Norman says in Design for a Better World, we need to avoid measuring just what is easier to measure.

Practices in measuring and tracking both qualitative and quantitative data on AI-human interactions still need to be defined (if you find resources I may have missed, feel free to share them in the comments).

In the following section, I synthesise three UX metrics that stand out for their ability to gauge the user experience of an AI model: AI Trust Score, Perceived Accuracy Score, and Behavioral CSAT. Each of these metrics offers a unique lens through which we can evaluate and improve AI systems.

How to measure?

1. AI Trust Score

Developed by Microsoft, the AI Trust Score is a metric that assesses the degree of trust users place in an AI system. Reference: Microsoft Research. This score is pivotal because trust is a fundamental component in the acceptance and successful integration of AI technologies in our daily lives.

How to Measure: The AI Trust Score is typically gauged through user surveys and feedback mechanisms. This measurement consists of several statements users respond including understandability, control, and awareness of the AI feature. By analyzing responses, it is possible to quantify the level of trust and identify areas for improvement in their AI systems.

2. Perceived Accuracy Score

Perceived Accuracy Score measures users’ perceptions of an AI system’s accuracy. As utilized by the researchers of this paper, this subjective metric is critical because even technically accurate systems can fail if users perceive them as unreliable, and the opposite, it measures user reliance on what the machine said even when wrong.

How to Measure: This score is usually obtained by asking users to rate the accuracy of the AI system after an interaction. Questions might include how often the system provides correct and relevant results or meets user expectations in terms of performance. This feedback helps developers understand how users perceive the system’s accuracy and where it might diverge from actual performance metrics.

3. Behavioral CSAT

Customer Satisfaction Score (CSAT) is a widely used attitudinal metric to gauge customer satisfaction with a product or service by explicitly asking customers to rate their experience. However, for ML products, it might not be enough to explicitly ask customers about their satisfaction because of the dynamicity of the output. Behavioral CSAT is instead calculated by considering behavior patterns as independent variables, and the CSAT is the dependent metric. For example, you can compare the actions suggested by the AI model with the actual actions taken by users or analyze which features of the GenAI system are used most and least. It’s a tangible measure of how well the AI aligns with user behavior and decision-making.

How to Measure: To calculate this metric, track the number of times users follow the AI system versus when they deviate. This ratio gives a clear picture of the system’s effectiveness in guiding user decisions. For instance, in a navigation app, if users frequently ignore suggested routes, it might indicate a gap in the system’s understanding of user preferences or real-world conditions.

These 3 metrics can enable product folks to track and understand the user experience of an ML-powered product and if needed, refine the scope to align with user expectations and perceived trust and accuracy.

Does this perspective resonate with you? If you’re intrigued and want to delve deeper into the processes, practices, and tools necessary for building effective Human-Machine Learning systems, you’ll find much more in my book >>> https://human-machinelearning.com/

Which metrics have you experimented with for ML-powered products? Do you want to share your experience in measuring AI? Reach out at hml@corinneschillizzi.com

Subscribe to DDIntel Here.

Have a unique story to share? Submit to DDIntel here.

Join our creator ecosystem here.

DDIntel captures the more notable pieces from our main site and our popular DDI Medium publication. Check us out for more insightful work from our community.

DDI Official Telegram Channel: https://t.me/+tafUp6ecEys4YjQ1

Follow us on LinkedIn, Twitter, YouTube, and Facebook.

--

--

I’m a User Experience Researcher and Designer. Writer of Human-Machine Learning book.