← All posts

Google’s AI hate speech detector is easily fooled by a few typos or the insertion of words like “love”

This is a guest post by Jeremy Leggett. Views are the author's own and do not necessarily represent the opinions or positions of MyGridGB or Dr Andrew Crossland.

Researchers at Aalto University in Finland find simple keyword filters & complex AIs are equally vulnerable to workarounds, in 7 systems used for hate-speech detection.

New Scientist reports that some words were particularly effective at masking hateful content because of their strong positive connotations. …For example, a sentence that Perspective assigned a “toxicity” score of 0.99 – with 1 being peak obscenity, could be reduced to 0.15 simply by adding the word “love”.”

Image: Council of Europe