Contextualized word embeddings are state of the art in NLP, learning from large corpora of everyday language usage in order to more quickly understand how to perform new tasks. Unsurprisingly, along with general understanding of language, these embeddings pick up the biases common in society. Using the embeddings in a number of important tasks (semantic search, sentiment recognition, predictive modelling, etc.) is thus problematic, as the risk of perpetuating bias is high.
Lexalytics has been working with a team of students at UMass Amherst to produce debiased corpora that use a variety of techniques to reduce gender specific associations within the corpus. We will discuss our efforts to reduce gender bias with contextualized word embeddings, effects on accuracy in downstream NLP tasks, challenges in working with names, and future work and applicability to other forms of latent bias.