Mariam Raashed

VoiceLab

User Research, User Interface, User Experience, App Design

Overview

Addressing the bias in ASR systems through inclusivity and design

VoiceLab is a platform which seeks to lend a voice to those who have been rendered voiceless. There are two main features of the platform which can be accessed by the users. One allows users to either immortalise their own voice or choose a voice from the database to speak up and the other where users can contribute their voice to enhance the data set by recording their own voice. Based on the individual, the application would have a speech-to-text and text-to-speech model to use.

01

Problem Statement

Automated Speech Recognition Systems (ASR) Are Slowly Becoming More And More Crucial For Human Use. However, They also Have An Inherent Bias Which can be really problematic for Users who do not have a good command over their voice, or may speak a language in different accents, even at a different pitch.
The Ability To Raise Your Voice Is An Absolute Privilege, But Even More So If You Are Actually Heard. Being Unheard When We Raise Our Voice Is One Thing, But Allowing Technology To Dismiss Us Completely, Forcing Us To Change The Way We Speak In A Manner That Is More ‘Acceptable’ Is Unthinkable.
This Leads Us To Believe Our Initial Hypothesis Is True: AI Data Gap Creates A Bias Within ASR Systems.

02

Solution

The Suggestive Solution In This Case Is In The Form Of An Application, 'VoiceLab', That Invites Users To Voluntarily Contribute By Lending Their Voice To Create A More Diverse Dataset Of Voices. By Leveraging The Nudge Theory, The App Nudges/ Pushes Forward The Idea Of Including More Plant Based Ingredients In User's Lives.

03

Research

You can expect to find ASR applications in places you wouldn't expect, like self-checkout kiosks in grocery stores. In the near future, voice interfaces may become more popular than touch-screen devices. Voice interfaces could change the way people interact with the world.

Even state-of-the-art automatic speech recognition (ASR) algorithms struggle to recognize the accents of people from certain regions of the world. That’s the top-line finding of a new study 

Since the invention of IBM’s shoebox in 1961, ASR systems have made significant progress beyond their initial capabilities of performing mathematical functions and recognizing over 16 languages (Ngueajio & Washington, 2022). However, it is crucial to question whether technology has truly advanced if it continues to perpetuate ancestral social stereotypes, particularly against marginalized communities.
According to a 2020 Stanford study, five different companies in the voice assistant space found a racial bias in their ASR systems. The study revealed that voice assistants misidentified 35% of words from Black users, while the systems only misidentified 19% of words from White users (Metz, 2020). These systems learn by analyzing vast amounts of data, so if the machine learning model only analyzes voice patterns of White users, bias will inevitably occur.

04

User Functionality

The concept is to create a platform that is interactive, online and engages users in a manner where the AI essentially is able to learn and collect data while giving the user an output. In this case, to collect different tones, pitches, octaves and accents of different users to be able to widen the spectrum of the AI itself.

 

  • Responsive conversational AI (to engage user)
  • Visual Responses during a conversation.
  • User prompted by choosing a category (Song, Story, News)
  • User records their voice by repeating the sentences widening its data set.
  • User can listen to a repeat of the same prompt in different voices in the data set

Voice Input

Record Your Voice
The user will be able to contribute their voice by three means;
  1. Uploading a real-time recording of fifteen seconds
  2. Using a prompt to read the script and recording their voice
  3. Uploading snippets of voice recordings, videos or any audio files, is also limited to fifteen seconds.

Voice Input Validation

Validate Voices

In order to effectively process audio input, it is imperative that it undergoes thorough validation by multiple users. This validation process ensures that the voice input is accurate and reliable. Once the audio input is in the system, other users can then confirm, reject, or report it, depending on the quality and accuracy of the transcription. The listener’s ultimate goal is to match the spoken words to the transcription as closely as possible. If the audio input is rejected or confirmed by a certain percentage of users, typically 50%, it will either be fully rejected or uploaded to the global database for others to access. This ensures that the audio input is vetted and verified before being made available to a wider audience.

Analysis

View Your Analysis
A data visualisation ultimately allows users to see the collective contributions that users have made to the database. The visualization is a representation of the user’s entire input. The number of hours users invested in recording their voice, either with or without prompts, or hours spent validating other voices that eventually contributed to the database are presented. The user would then be able to gauge the degree of change they were able to effect. The goal is to not discriminate against either of the efforts, regardless of whether a user chooses to record using their own voice or validate other voices.

05

Design Process

App Information Architecture

06

Wireframe

Sketches/ Ideation

VoiceLab

Users will be able to contribute to the database by recording their voice and validating other voices. Once they select “Give Voice” as their primary use for the application, their main screen upon accessing the app will be the screen that allows them to record their voice through the different ‘Voice Inputs’. Users will also be able to contribute by validating other voices as well. These scenarios are valid for users who want to contribute to equal representation, or also to help build a database to help other users who want to find a voice.

08

conclusion

The aim of this thesis design project was to create a platform that creates a space for individuals to help bridge the increasing data gap in AI while also giving back to the people that earnestly take part in contributing to a more inclusive database. While questioning the huge data gap that persists in the current world of Artificial Intelligence and in specific Automated Speech Recognition, it was important to highlight the neglect of such decisions and explore it in detail.