question/feature: Multi-language recommended datasets #99

kkx64 · 2025-03-16T13:44:28Z

Summary

I'm just wondering whether there's plans on adding multi-language recommended datasets other than the English one?

If not, would love a short guide on how the base English dataset was built/collected so we can try building some for other languages, and ideally contribute them to the package

Geczy · 2025-03-30T19:05:33Z

I'm also looking for this

jo3-l · 2025-04-02T19:59:50Z

Thanks for the question! Unfortunately, I am not planning on adding built-in multi-language support for the reasons indicated in #57 (comment). To summarize, unless someone volunteers to step up and actively maintain the dataset, it will be rather difficult for me to resolve any issues in non-English datasets myself, and so I would prefer to keep them out of the main library.

The other caveat mentioned in that earlier comment is also applicable:

The library was designed with English in mind, and I am not sure how nicely some of its foundations generalize to other languages. In particular, I am skeptical as to whether the current system (character-based transformations, plus a carefully curated set of patterns) for detecting variants of terms will remain effective. For your request in particular, this is less of an issue because English and French are somewhat closely related.

That said, with respect to the question on how the base English dataset was built, I started from existing collections of profanity (notably https://github.com/words/cuss, which is cited in the main dataset), and then manually developed patterns for common variations. To try to minimize false positives and Scunthorpe-esque issues, I checked added patterns against a large collection of English words to see if there were accidental matches (https://github.com/jo3-l/obscenity/tree/main/scripts).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question/feature: Multi-language recommended datasets #99

question/feature: Multi-language recommended datasets #99

kkx64 commented Mar 16, 2025

Geczy commented Mar 30, 2025

jo3-l commented Apr 2, 2025 •

edited

Loading

question/feature: Multi-language recommended datasets #99

question/feature: Multi-language recommended datasets #99

Comments

kkx64 commented Mar 16, 2025

Summary

Geczy commented Mar 30, 2025

jo3-l commented Apr 2, 2025 • edited Loading

jo3-l commented Apr 2, 2025 •

edited

Loading