Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question/feature: Multi-language recommended datasets #99

Open
kkx64 opened this issue Mar 16, 2025 · 2 comments
Open

question/feature: Multi-language recommended datasets #99

kkx64 opened this issue Mar 16, 2025 · 2 comments

Comments

@kkx64
Copy link

kkx64 commented Mar 16, 2025

Summary

I'm just wondering whether there's plans on adding multi-language recommended datasets other than the English one?

If not, would love a short guide on how the base English dataset was built/collected so we can try building some for other languages, and ideally contribute them to the package

@Geczy
Copy link

Geczy commented Mar 30, 2025

I'm also looking for this

@jo3-l
Copy link
Owner

jo3-l commented Apr 2, 2025

Thanks for the question! Unfortunately, I am not planning on adding built-in multi-language support for the reasons indicated in #57 (comment). To summarize, unless someone volunteers to step up and actively maintain the dataset, it will be rather difficult for me to resolve any issues in non-English datasets myself, and so I would prefer to keep them out of the main library.

The other caveat mentioned in that earlier comment is also applicable:

The library was designed with English in mind, and I am not sure how nicely some of its foundations generalize to other languages. In particular, I am skeptical as to whether the current system (character-based transformations, plus a carefully curated set of patterns) for detecting variants of terms will remain effective. For your request in particular, this is less of an issue because English and French are somewhat closely related.

That said, with respect to the question on how the base English dataset was built, I started from existing collections of profanity (notably https://github.com/words/cuss, which is cited in the main dataset), and then manually developed patterns for common variations. To try to minimize false positives and Scunthorpe-esque issues, I checked added patterns against a large collection of English words to see if there were accidental matches (https://github.com/jo3-l/obscenity/tree/main/scripts).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants