Building comprehensive NLP | risky

View all sessions on demand from Smart Security Summit here.
Every day, millions of standard English speakers enjoy the benefits of natural language processing (NLP) models.
But for African American Native English (AAVE) speakers, technologies like voice-controlled GPS systems, digital assistants, and voice-to-text software often pose problems. because of the large size. NLP model frequently unable to understand or create words in AAVE. Worse still, models are often trained on data pulled from the web and tend to incorporate racial biases and stereotypical associations that are rampant online.
When these bias models are used by companies to help make high-risk decisions, AAVE speakers may find themselves unfairly restricted from accessing social media, being denied access to social media. inappropriately accessing housing or lending opportunities, or being treated unfairly in the law enforcement or judicial system.
For the past 18 months, machine learning (ML) expert Jazmia Henry has focused on finding ways to responsibly incorporate AAVE into language models. As a Fellow at the Stanford Human-Centered Artificial Intelligence Institute (SEA) and the Center for Comparative Studies on Race and Ethnicity (CCSRE), she created a open source corporation in more than 141,000 words from AAVE to help researchers and builders design models that are both inclusive and less prone to bias.
“My hope with this project is that social and computer linguists, anthropologists, computer scientists, social scientists and other researchers will poke and poke at this block. , study with it, wrestle with it and test its limits so we can develop it. become a true representative of AAVE and provide feedback and insights on our potential next steps in the algorithm,” said Henry.
In this interview, she describes early obstacles in developing this database, its ability to help computational linguistics understand the origins of AAVE, and her plans after Stanford.
How would you describe the native English of African Americans?
For me, AAVE is the language of perseverance and rising. It was the result of African languages thought to have disappeared during the slave trade migration that were incorporated into English to create a new language spoken by the descendants of those African peoples. use.
How did you become interested in including AAVE in NLP models?
When I was a child, both my father and mother spoke their mother tongue from time to time. For my Caribbean father it was the Jamaican patriot, and for my mother it was Gullah Geechee, found in the coastal regions of the Carolinas and Georgia. Each language is a creole, which is a new language created by mixing different languages.
Everyone seemed to understand that my parents were speaking another language, and no one doubted their intelligence. But when I see people in my community speaking AAVE, which I believe is another creole language, I can tell that there is a sense of shame and stigma associated with it – a feeling that if we use Using this language outside, we will be judged as less intelligent. When I started working in data science, I wondered what would happen if I tried to collect data on AAVE and incorporate it into NLP model so we can really start to understand it and improve the performance of these models.
How did your project develop and what obstacles did you face?
There were many obstacles, and in the end I had to change my goal. AAVE evolves a lot faster than many languages and often turns standardized English into standardized English, giving words entirely new meanings. For example, the word “mad” is often defined as “angry”. However, in AAVE, it is often used to mean “very”, as in “crazy”.
AAVE can also be largely determined by the situation, the speaker, and the tone used, which language processing model not taken into consideration. I finally decided to create an AAVE datastore, divided into four collections. The lyrics collection includes lyrics from 15,000 songs by 105 artists, from Etta James and Muddy Waters to Lil Baby and DaBaby.
The Leadership Collection includes speeches by influential individuals ranging from Fredrick Douglass and Sojourner Truth to Martin Luther King and Ketanji Brown Jackson. The hardest to put together is the book collection, because African Americans are hardly underrepresented in literary standards, but I have included works from the historic Black Book archival collections from universities.
Finally, the social media collection is the most robust and diverse, including video recordings, blog posts, and 15,000 tweets, all gathered from Black thought leaders.
How do you expect your project to be used?
I know the corpus is starting to be used, but I still don’t know by whom or for what purpose. It is my hope that this preliminary work will inspire researchers to step into this space, question it, and push it forward to ensure that AAVE is represented in the languages used. in NLP. Social and computer linguists can use this to help determine if AAVE is really its own language or dialect, and look for connections between this language and other languages. other African languages, especially those that have not been recorded or preserved in Western history.
As we grew up, we learned what was taken from our slave ancestors and from their descendants. AAVE can be proof that things are not taken away and that we can retain some of who we are in the way we communicate with each other. That knowledge has the power to remove shame and raise pride. When I was saying “What’s the matter, brother?” I’m not unintelligent; I’m strategizing and calling on our ancestors with that conversation.
Not only does it not reflect the broader community, but it actively discriminates against it. Large language models having difficulty understanding or generating words in AAVE are more likely to exacerbate stereotypes about Blacks in general, and these biased associations are being systemized. in these models. When they are commercialized, they model — and their biases — can lead to companies making unfair decisions that affect the lives of AAVE speakers. This can lead to everything from individuals disproportionately having their social media edited or removed from platforms to discrimination in areas like housing, banking as well as the law enforcement and judicial systems.
What should NLP developers think when they build tools?
There have been several popular NLP models that incorporate a lot of biases. Companies are working to shrink these problematic models, but that often comes with a focus on minimizing risk rather than minimizing bias. Instead of trying to find a solution, companies sometimes take the approach of “Don’t touch AAVE or anything related to Blackness anymore, because we didn’t get it right the first time. .”
Instead, they should ask how they can do it correctly now. This is the time to build better models, improve processes, and come up with new ways to work with languages like AAVE, so that larger companies don’t continue to cause harm.
What are your plans for the future when you leave Stanford?
I am starting a new job at Microsoft where I will be working as a senior application engineer for the autonomous systems team with bonsai project. We’re enhancing deep reinforcement learning with what we call “machine teaching,” which is essentially teaching machines how to perform tasks that can help humans work more efficiently, improve secure and enable autonomous decision-making using AI. This job gives me the opportunity to improve people’s lives, and I am grateful for this opportunity.
Beth Jensen is a contributing writer for Stanford’s Human-Centered AI Institute.
This story originally appeared on Hai.stanford.edu. Copyright 2023
DataDecision makers
Welcome to the VentureBeat community!
DataDecisionMakers is a place where experts, including those who work with data, can share data-related insights and innovations.
If you want to read about cutting-edge ideas and updates, best practices, and the future of data and data technology, join us at DataDecisionMakers.
You can even consider contribute an article your own!