Skip to content

Fix: Anonymize domains and email addresses in diagnostics #2412

Merged
limetech merged 5 commits into
unraid:masterfrom
Squidly271:fix/diagnostics
Oct 15, 2025
Merged

Fix: Anonymize domains and email addresses in diagnostics #2412
limetech merged 5 commits into
unraid:masterfrom
Squidly271:fix/diagnostics

Conversation

@Squidly271

@Squidly271 Squidly271 commented Oct 2, 2025

Copy link
Copy Markdown
Contributor

Summary by CodeRabbit

  • New Features

    • Diagnostics now anonymize domain-like patterns across exported server configs, aggregated URLs, logs, and related outputs.
    • Email anonymization added for exported logs and GraphQL-related outputs to protect user privacy.
    • Anonymization is applied earlier and consistently in the export flow, with per-output hooks and a global/opt-in privacy flag.
  • Bug Fixes

    • Export pipeline now reliably masks raw email and domain data across all diagnostic files.

@coderabbitai

coderabbitai Bot commented Oct 2, 2025

Copy link
Copy Markdown
Contributor

Walkthrough

Adds three anonymization utilities—anonymize_domain_file($file), anonymize_domain(&$text), and anonymize_email($file)—and integrates them throughout the diagnostics script to mask domains and emails in generated outputs when not running in full ($all) mode.

Changes

Cohort / File(s) Summary
Diagnostics script (new helpers & integrations)
emhttp/plugins/dynamix/scripts/diagnostics
Added functions anonymize_domain_file($file), anonymize_domain(&$text), and anonymize_email($file). Incorporated anonymization into URL generation (geturls()), ident.cfg, system/servers.conf.txt, log-derived text files (e.g., $log.txt), GraphQL outputs, and other diagnostic outputs, conditional on $all.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Diag as diagnostics script
  participant geturls as geturls()
  participant ADtext as anonymize_domain()
  participant ADfile as anonymize_domain_file()
  participant AEmail as anonymize_email()
  participant FS as Filesystem

  User->>Diag: run diagnostics (with/without $all)
  Diag->>geturls: collect URLs
  alt not $all
    geturls->>ADtext: anonymize_domain($urls)
    ADtext->>Diag: masked URLs
  else $all
    geturls->>Diag: raw URLs
  end
  Diag->>FS: write urls.txt

  Diag->>ADfile: anonymize_domain_file("/$diag/config/ident.cfg") (if not $all)
  ADfile->>FS: write ident.cfg

  Diag->>ADfile: anonymize_domain_file("/$diag/system/servers.conf.txt") (if not $all)
  ADfile->>FS: write servers.conf.txt

  Diag->>AEmail: anonymize_email($graphql) (if not $all)
  AEmail->>FS: write graphql-api.txt
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I hop through folders, whiskers keen,
I tuck domain tails out of the scene.
Emails muffled, secrets kept tight,
Diagnostics sleep sound through the night.
A carrot wink — privacy done right. 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit's high-level summary is enabled.
Title Check ✅ Passed The title accurately and succinctly describes the main change of adding domain and email anonymization in diagnostics outputs.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Oct 2, 2025

Copy link
Copy Markdown

🔧 PR Test Plugin Available

A test plugin has been generated for this PR that includes the modified files.

Version: 2025.10.03.0439
Build: View Workflow Run

📥 Installation Instructions:

Install via Unraid Web UI:

  1. Go to Plugins → Install Plugin
  2. Copy and paste this URL:
https://preview.dl.unraid.net/pr-plugins/pr-2412/webgui-pr-2412.plg
  1. Click Install

Alternative: Direct Download

⚠️ Important Notes:

  • Testing only: This plugin is for testing PR changes
  • Backup included: Original files are automatically backed up
  • Easy removal: Files are restored when plugin is removed
  • Conflicts: Remove this plugin before installing production updates

📝 Modified Files:

Click to expand file list
emhttp/plugins/dynamix/scripts/diagnostics

🔄 To Remove:

Navigate to Plugins → Installed Plugins and remove webgui-pr-2412, or run:

plugin remove webgui-pr-2412

🤖 This comment is automatically generated and will be updated with each new push to this PR.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
emhttp/plugins/dynamix/scripts/diagnostics (1)

363-363: Email regex lacks trailing word boundary.

The regex uses \b at the start but not at the end. This could cause issues with emails followed immediately by certain punctuation (e.g., user@example.com; might not match correctly).

Consider adding \b at the end of the pattern:

-    run("sed -ri 's/\b[^[:space:]\/<>\"'\'']+@[^[:space:]\/<>\"'\'']+\.[^[:space:]\/<>\"'\'']+/email@removed.com/g' ".escapeshellarg($log));
+    run("sed -ri 's/\b[^[:space:]\/<>\"'\'']+@[^[:space:]\/<>\"'\'']+\.[^[:space:]\/<>\"'\'']+(,|;|\\.|\\))?\\b/email@removed.com/g' ".escapeshellarg($log));

Actually, that's getting complex. A simpler approach:

-    run("sed -ri 's/\b[^[:space:]\/<>\"'\'']+@[^[:space:]\/<>\"'\'']+\.[^[:space:]\/<>\"'\'']+/email@removed.com/g' ".escapeshellarg($log));
+    run("sed -ri 's/\b[^[:space:]\/<>\"'\'']+@[^[:space:]\/<>\"'\'']+\.[^[:space:]\/<>\"'\'']+(\\b|[[:punct:]])/email@removed.com\\1/g' ".escapeshellarg($log));

Wait, the character class [^[:space:]\/<>\"'\'']+ already excludes common punctuation boundaries, so the lack of \b at the end may not be critical. The current pattern should work for most cases.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 95bc896 and 76b44ce.

📒 Files selected for processing (1)
  • emhttp/plugins/dynamix/scripts/diagnostics (3 hunks)
🔇 Additional comments (2)
emhttp/plugins/dynamix/scripts/diagnostics (2)

3-3: LGTM!

Copyright year update is appropriate for 2025.


740-744: LGTM!

The change correctly applies email anonymization to the GraphQL log as stated in the PR objective. The function call properly passes the $graphql file path for processing.

Note: This inherits the output filename issue from anonymize_email (see previous comment).

Comment thread emhttp/plugins/dynamix/scripts/diagnostics
@Squidly271 Squidly271 changed the title Fix: Anonymize email addresses in graphql-api.log Fix: Anonymize domains and email addresses in diagnostics Oct 2, 2025
@Squidly271 Squidly271 requested a review from ljm42 October 2, 2025 16:53

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
emhttp/plugins/dynamix/scripts/diagnostics (2)

360-369: Simplify email regex and align with existing usage.

Reuse the simpler pattern already used in anonymize_syslog to reduce quoting pitfalls.

-  if (!$all) {
-    run("sed -ri 's/\\b[^[:space:]\\/<>\\"'\\'']+@[^[:space:]\\/<>\\"'\\'']+\\.[^[:space:]\\/<>\\"'\\'']+/email@removed.com/g' ".escapeshellarg($log));
-  }
+  if (!$all) {
+    run("sed -ri 's|\\b\\S+@\\S+\\.\\S+\\b|email@removed.com|g' ".escapeshellarg($log));
+  }

93-93: Only domain‑anonymize when handling text payloads.

This runs for select==2 (paths) unnecessarily. Gate it by select to avoid confusing behavior.

-  anonymize_domain($text);
+  if ($select === 1) anonymize_domain($text);
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9622db5 and a530ff1.

📒 Files selected for processing (1)
  • emhttp/plugins/dynamix/scripts/diagnostics (6 hunks)
🔇 Additional comments (6)
emhttp/plugins/dynamix/scripts/diagnostics (6)

357-358: Syslog anonymization will corrupt words until the TLD fix lands.

anonymize_domain_file("$log.txt") will mangle common words (e.g., “balance”) when TLD is “lan”. Safe once the boundary-aware fix in anonymize_domain() is applied (see Lines 383-396). Otherwise, skip domain anonymization for syslog.

Consider checking a generated syslog for unintended “removed_TLD” inside normal words (e.g., “btrfs balance”).


324-324: LGTM: Masking TLDs in URLs block.

Calling anonymize_domain($urls) here is correct and keeps urls.txt consistent with anonymized mode.


544-546: LGTM: ident.cfg TLD masking.

Processing the copied ident.cfg via anonymize_domain_file keeps diagnostics sanitized without touching the live file.


776-776: LGTM: GraphQL log email anonymization hook.

Switching to anonymize_email($graphql) is consistent with other log handlers and respects $all.


812-812: LGTM: servers.conf TLD masking.

Masking TLDs before IP redaction and other sed rules is fine; newline() afterwards preserves CRLF format.


371-381: LGTM with a minor note.

Reads, transforms, and writes back safely. Once anonymize_domain() is boundary-safe (see Lines 383-396), this will avoid unintended word corruption across files.

Confirm file sizes are reasonable before loading into memory if very large files are later added to this path.

Comment thread emhttp/plugins/dynamix/scripts/diagnostics
@Squidly271 Squidly271 marked this pull request as draft October 2, 2025 19:42
@Squidly271 Squidly271 marked this pull request as ready for review October 2, 2025 23:52

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
emhttp/plugins/dynamix/scripts/diagnostics (1)

383-396: Overbroad substring replacement and duplicate null-coalesce remain unaddressed.

The previous review identified critical issues that are only partially fixed:

  1. Substring corruption risk: Line 393's str_ireplace($domain, ...) still performs substring replacement. While line 392 now skips single-label TLDs (e.g., "lan"), any multi-label TLD like "my.lan" will still cause corruption in words containing those substrings.

  2. Duplicate null-coalesce (line 390): ($ident['LOCAL_TLD'] ?? ($ident['LOCAL_TLD'] ?? "local")) has redundant nesting.

Apply the previous review's regex-based solution to match TLDs only at boundaries:

 function anonymize_domain(&$text) {
   global $all;
-  static $domain = "";
+  static $domain;
+  if ($all) return;
 
-  if (!$all) {
-    if ($domain == "") {
-      $ident = @parse_ini_file('/boot/config/ident.cfg');
-      $domain = strtolower(is_array($ident) ? ($ident['LOCAL_TLD'] ?? ($ident['LOCAL_TLD'] ?? "local")) : "local");
-    }
-    if (strpos($domain,".") !== false) {
-      $text = str_ireplace($domain,"removed_TLD",$text);
-    }
+  if (!isset($domain)) {
+    $ident  = @parse_ini_file('/boot/config/ident.cfg');
+    $domain = strtolower(is_array($ident) ? ($ident['LOCAL_TLD'] ?? 'local') : 'local');
   }
+  if ($domain === 'local') return;
+
+  $q = preg_quote($domain, '/');
+  // Replace ".tld" or "@tld"
+  $text = preg_replace("/(?<=\\.|@){$q}\\b/i", "removed_TLD", $text);
+  // Replace standalone token (e.g., 'Local TLD: lan'), avoiding substrings in larger words
+  $text = preg_replace("/(^|[^A-Za-z0-9_-]){$q}($|[^A-Za-z0-9_-])/i", "\\1removed_TLD\\2", $text);
 }
🧹 Nitpick comments (2)
emhttp/plugins/dynamix/scripts/diagnostics (2)

324-326: LGTM!

The domain anonymization is correctly applied after building the complete URLs output and before returning.

The empty line 325 can be removed for cleaner code:

   $urls .= "\n\n".implode("\n", $dirlisting)."\n";
   anonymize_domain($urls);
-  
   return str_replace("\n", "\r\n", $urls);

371-381: LGTM!

The function provides a clean wrapper for file-based domain anonymization with appropriate error handling via the false check.

For defensive coding, consider adding an explicit file_exists() check:

 function anonymize_domain_file($file) {
   global $all;
   
-  if ( !$all ) {
+  if (!$all && file_exists($file)) {
     $text = @file_get_contents($file);
-    if ( $text !== false) {
+    if ($text !== false) {
       anonymize_domain($text);
       file_put_contents($file, $text);
     }
   }
 }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a530ff1 and 478ba60.

📒 Files selected for processing (1)
  • emhttp/plugins/dynamix/scripts/diagnostics (6 hunks)
🔇 Additional comments (6)
emhttp/plugins/dynamix/scripts/diagnostics (6)

93-93: LGTM!

The placement of anonymize_domain($text) at the beginning of the anonymize() function ensures consistent domain anonymization across all code paths.


357-358: LGTM!

The domain anonymization is correctly applied as the final step after all other syslog processing, ensuring domains are masked in the output files.


360-369: LGTM!

The function correctly generates output filenames with .txt extension using pathinfo($file, PATHINFO_FILENAME), addressing the previous review concern about inconsistent naming.

The email regex pattern excludes common delimiters (<>"'/ and whitespace) to avoid false matches in structured contexts.


545-545: LGTM!

Anonymizing ident.cfg is appropriate as this file contains the LOCAL_TLD configuration that should be masked in diagnostics.


776-776: LGTM!

Applying email anonymization to the GraphQL API log is appropriate as this log likely contains user email addresses from API operations.


812-812: LGTM!

Anonymizing servers.conf.txt is appropriate as nginx server configuration typically contains domain names that should be masked in diagnostics.

if (!$all) {
if ($domain == "") {
$ident = @parse_ini_file('/boot/config/ident.cfg');
$domain = strtolower(is_array($ident) ? ($ident['LOCAL_TLD'] ?? ($ident['LOCAL_TLD'] ?? "local")) : "local");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's simplify this to:
$domain = strtolower(is_array($ident) ? ($ident['LOCAL_TLD'] ?? 'local') : 'local');

@@ -90,6 +90,7 @@ function shareDisks($share) {
function anonymize($text, $select) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I struggled to understand the $select param, how about adding this DocBlock

/**
 * Anonymizes sensitive data in text based on the specified anonymization type
 * 
 * @param string $text The text content to anonymize
 * @param int $select Anonymization type:
 *                   1 = general text
 *                   2 = file paths (specifically share cfg files)
 * @return string The anonymized text
 */

@ljm42 ljm42 added the 7.2 label Oct 13, 2025
@limetech limetech merged commit a9a769d into unraid:master Oct 15, 2025
2 checks passed
@Squidly271 Squidly271 deleted the fix/diagnostics branch November 5, 2025 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants