NSFW filters are subject to this same abuse when dealing with nsfw ai chat systems, mainly because these models can be used in de-anonymized or re-identification contexts. However, if a bot is built to chat like an AI —it would be dependent on NLP algorithms that are trained to identify porn but the users can slightly bend this and use coded language or slang word with all caps (this also often comes as very easy replacement). There will be a finite amount of error rate, say 5-10%, which means there is scope for thousands of manipulative attempts to get through per million interactions and thus impairing the model's performance.
Adversarial attacks also pose manipulation risk. In adversarial attacks, you are capitalizing on a vulnerability in the model where by providing it with deliberately crafted data or formal prompts, forcing an unintended response. This tactic exist in many known AI models of adversarial examples where slight input alterations impact output drastically. ACM: How should AI change the way it is developed/adapted to be more robust against adversarial attacks, organizationally or otherwise.Institutional Review Board (IRB): Researchers estimate that adaptive attackers could require up 20 percent more data for them if this manipulation was attempted on an AI system Academic Press Access Division(SSN): This can also really increase both how long and expensive model development takes with all of these new sources coming in from everywhere.
Model behavior transparency is even mushier. Ensures the EliminatesIllegalContent flag behaves as expected (failure to remove large enough amounts of content may allow users to lose faith in the model reliability -> platform engagement) As OpenAI CEO Sam Altman has said, “Trust in AI comes from understanding its limitations,” meaning that it is essential for platforms to not hide the fact they can have vulnerabilities. The downside is that this transparency also exposes areas where bad actors could find your site vulnerable, which means it's a balancing act.
It's this building of semantics relative to language handling that can be a real pain the arse for ai chat systems since again, unless you're able to manage culturally specific or contextualised language its all too easy getting rolled because y'all technophiles still think everyone in the USA speaks like those idiotic people on tv and assume efficiency also equals competition. In languages such as Arabic or Hindi that use slang and regional dialects, this makes AI systems much less reliable which results in a 15 ~20%lower rate of detection for detecting non-English content. Building these filters across multiple languages and cultural contexts requires extensive data collection which is incredibly expensive to produce, yet critical for ensuring resilience against global manipulation risks.
Read the nsfw ai chat blog for a deeper examination of these manipulation risks and dynamics.