Implement actual bad word list download and caching

rameerez · rameerez · commit 6269b613bea7 · 2024-11-03T19:36:23.000Z
diff --git a/README.md b/README.md
@@ -1,21 +1,28 @@
-# 🗑️ `moderate` - Moderate and block bad words from your Rails app
+# 👮‍♂️ `moderate` - Moderate and block bad words from your Rails app
 
-`moderate` is a Ruby gem that moderates user-generated content by adding a simple validation to block bad words in any text field.
+`moderate` is a Ruby gem that moderates user-generated text content by adding a simple validation to block bad words in any text field.
 
 Simply add this to your model:
 
 ```ruby
 validates :text_field, moderate: true
 ```
 
-That's it! You're done.
+That's it! You're done. `moderate` will work seamlessly with your existing validations and error messages.
+
+> [!WARNING]
+> This gem is under development. It currently only supports a limited set of English profanity words. Word matching is very basic now, and it may be prone to false positives, and false negatives. I use it for very simple things like preventing new submissions if they contain bad words, but the gem can be improved for more complex use cases and sophisticated matching and content moderation. Please consider contributing if you have good ideas for additional features.
 
 # Why
 
-Any text field where users can input text may be a place where bad words can be used. This gem blocks records from being created if they contain bad words.
+Any text field where users can input text may be a place where bad words can be used. This gem blocks records from being created if they contain bad words, profanity, naughty / obscene words, etc.
 
 It's good for Rails applications where you need to maintain a clean and respectful environment in comments, posts, or any other user input.
 
+# How
+
+`moderate` currently downloads a list of ~1k English profanity words from the [google-profanity-words](https://github.com/coffee-and-fun/google-profanity-words) repository and caches it in your Rails app's tmp directory.
+
 ## Installation
 
 Add this line to your application's Gemfile:
@@ -30,6 +37,32 @@ And then execute:
 bundle install
 ```
 
+Then, just add the `moderate` validation to any model with a text field:
+
+```ruby
+validates :text_field, moderate: true
+```
+
+`moderate` will raise an error if a bad word is found in the text field, preventing the record from being saved.
+
+It works seamlessly with your existing validations and error messages.
+
+## Configuration
+
+You can configure the `moderate` gem behavior by adding a `config/initializers/moderate.rb` file:
+```ruby
+Moderate.configure do |config|
+  # Custom error message when bad words are found
+  config.error_message = "contains inappropriate language"
+
+  # Add your own words to the blacklist
+  config.additional_words = ["badword1", "badword2"]
+
+  # Exclude words from the default list (false positives)
+  config.excluded_words = ["good"]
+end
+```
+
 ## Development
 
 After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
diff --git a/lib/moderate.rb b/lib/moderate.rb
@@ -3,6 +3,7 @@
 require_relative "moderate/version"
 require_relative "moderate/text"
 require_relative "moderate/text_validator"
+require_relative "moderate/word_list"
 
 module Moderate
   class Error < StandardError; end
diff --git a/lib/moderate/text.rb b/lib/moderate/text.rb
@@ -12,15 +12,26 @@ def bad_words?(text)
 
       private
 
-      DEFAULT_BAD_WORDS = Set.new(["asdf"]).freeze
-
       def compute_word_list
-        (DEFAULT_BAD_WORDS + Moderate.configuration.additional_words -
-         Moderate.configuration.excluded_words).to_set
+        @default_words ||= begin
+          words = WordList.load
+          logger.info("[moderate gem] Loaded #{words.size} words from word list")
+          words
+        end
+
+        result = (@default_words + Moderate.configuration.additional_words -
+                 Moderate.configuration.excluded_words).to_set
+        logger.debug("[moderate gem] Final word list size: #{result.size}")
+        result
       end
 
       def reset_word_list!
         @words_set = nil
+        @default_words = nil
+      end
+
+      def logger
+        @logger ||= defined?(Rails) ? Rails.logger : Logger.new($stdout)
       end
     end
   end
diff --git a/lib/moderate/word_list.rb b/lib/moderate/word_list.rb
@@ -0,0 +1,67 @@
+# frozen_string_literal: true
+
+require 'net/http'
+require 'uri'
+require 'tmpdir'
+require 'logger'
+
+module Moderate
+  class WordList
+    WORD_LIST_URL = 'https://raw.githubusercontent.com/coffee-and-fun/google-profanity-words/main/data/en.txt'
+
+    class << self
+      def load
+        cache_path = cache_file_path
+
+        begin
+          if File.exist?(cache_path)
+            words = File.read(cache_path, encoding: 'UTF-8').split("\n").to_set
+            return words unless words.empty?
+          end
+
+          download_and_cache(cache_path)
+        rescue StandardError => e
+          logger.error("[moderate gem] Error loading word list: #{e.message}")
+          logger.debug("[moderate gem] #{e.backtrace.join("\n")}")
+          raise Moderate::Error, "Failed to load bad words list: #{e.message}"
+        end
+      end
+
+      private
+
+      def cache_file_path
+        if defined?(Rails)
+          Rails.root.join('tmp', 'moderate_bad_words.txt')
+        else
+          File.join(Dir.tmpdir, 'moderate_bad_words.txt')
+        end
+      end
+
+      def download_and_cache(cache_path)
+        uri = URI(WORD_LIST_URL)
+        response = Net::HTTP.get_response(uri)
+
+        unless response.is_a?(Net::HTTPSuccess)
+          raise Moderate::Error, "Failed to download word list. HTTP Status: #{response.code}"
+        end
+
+        content = response.body.force_encoding('UTF-8')
+        words = content.split("\n").map(&:strip).reject(&:empty?).to_set
+
+        if words.empty?
+          raise Moderate::Error, "Downloaded word list is empty"
+        end
+
+        logger.info("[moderate gem] Downloaded #{words.size} words from #{WORD_LIST_URL}")
+        File.write(cache_path, content, encoding: 'UTF-8')
+        logger.debug("[moderate gem] Cached word list to: #{cache_path}")
+
+        words
+      end
+
+      def logger
+        @logger ||= defined?(Rails) ? Rails.logger : Logger.new($stdout)
+      end
+    end
+  end
+end