You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm just wondering whether there's plans on adding multi-language recommended datasets other than the English one?
If not, would love a short guide on how the base English dataset was built/collected so we can try building some for other languages, and ideally contribute them to the package
The text was updated successfully, but these errors were encountered:
Thanks for the question! Unfortunately, I am not planning on adding built-in multi-language support for the reasons indicated in #57 (comment). To summarize, unless someone volunteers to step up and actively maintain the dataset, it will be rather difficult for me to resolve any issues in non-English datasets myself, and so I would prefer to keep them out of the main library.
The other caveat mentioned in that earlier comment is also applicable:
The library was designed with English in mind, and I am not sure how nicely some of its foundations generalize to other languages. In particular, I am skeptical as to whether the current system (character-based transformations, plus a carefully curated set of patterns) for detecting variants of terms will remain effective. For your request in particular, this is less of an issue because English and French are somewhat closely related.
That said, with respect to the question on how the base English dataset was built, I started from existing collections of profanity (notably https://github.com/words/cuss, which is cited in the main dataset), and then manually developed patterns for common variations. To try to minimize false positives and Scunthorpe-esque issues, I checked added patterns against a large collection of English words to see if there were accidental matches (https://github.com/jo3-l/obscenity/tree/main/scripts).
Summary
I'm just wondering whether there's plans on adding multi-language recommended datasets other than the English one?
If not, would love a short guide on how the base English dataset was built/collected so we can try building some for other languages, and ideally contribute them to the package
The text was updated successfully, but these errors were encountered: