Problem: neural networks do not understand what they say. After training on big text datasets from social networks, they can suddenly begin to justify slavery and racism, as well as tell a person the right ways to harm themselves.
Solution: a neural model that detects "inappropriate" statements related to these "sensitive" topics for further content filtering.
So far, 18 topics were defined as "sensitive", including drugs, pornography, politics, religion, suicide and crime. The main criterion of relevance is whether the statement can harm the human interlocutor or spoil the reputation of the chatbot owner. The data for training the neural network was taken from Dvach and OtvetiMail.ru.
Official press release: https://www.skoltech.ru/en/2021/07/neural-model-seeks-inappropriateness-to-reduce-chatbot-awkwardness/
Published article about the research for deeper understanding: https://aclanthology.org/2021.bsnlp-1.4/Pre-trained model for inappropriate utterances detection: https://huggingface.co/Skoltech/russian-inappropriate-messages