Fix: Anonymize domains and email addresses in diagnostics #2412
Conversation
WalkthroughAdds three anonymization utilities— Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant Diag as diagnostics script
participant geturls as geturls()
participant ADtext as anonymize_domain()
participant ADfile as anonymize_domain_file()
participant AEmail as anonymize_email()
participant FS as Filesystem
User->>Diag: run diagnostics (with/without $all)
Diag->>geturls: collect URLs
alt not $all
geturls->>ADtext: anonymize_domain($urls)
ADtext->>Diag: masked URLs
else $all
geturls->>Diag: raw URLs
end
Diag->>FS: write urls.txt
Diag->>ADfile: anonymize_domain_file("/$diag/config/ident.cfg") (if not $all)
ADfile->>FS: write ident.cfg
Diag->>ADfile: anonymize_domain_file("/$diag/system/servers.conf.txt") (if not $all)
ADfile->>FS: write servers.conf.txt
Diag->>AEmail: anonymize_email($graphql) (if not $all)
AEmail->>FS: write graphql-api.txt
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests
Comment |
🔧 PR Test Plugin AvailableA test plugin has been generated for this PR that includes the modified files. Version: 📥 Installation Instructions:Install via Unraid Web UI:
Alternative: Direct Download
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
emhttp/plugins/dynamix/scripts/diagnostics (1)
363-363: Email regex lacks trailing word boundary.The regex uses
\bat the start but not at the end. This could cause issues with emails followed immediately by certain punctuation (e.g.,user@example.com;might not match correctly).Consider adding
\bat the end of the pattern:- run("sed -ri 's/\b[^[:space:]\/<>\"'\'']+@[^[:space:]\/<>\"'\'']+\.[^[:space:]\/<>\"'\'']+/email@removed.com/g' ".escapeshellarg($log)); + run("sed -ri 's/\b[^[:space:]\/<>\"'\'']+@[^[:space:]\/<>\"'\'']+\.[^[:space:]\/<>\"'\'']+(,|;|\\.|\\))?\\b/email@removed.com/g' ".escapeshellarg($log));Actually, that's getting complex. A simpler approach:
- run("sed -ri 's/\b[^[:space:]\/<>\"'\'']+@[^[:space:]\/<>\"'\'']+\.[^[:space:]\/<>\"'\'']+/email@removed.com/g' ".escapeshellarg($log)); + run("sed -ri 's/\b[^[:space:]\/<>\"'\'']+@[^[:space:]\/<>\"'\'']+\.[^[:space:]\/<>\"'\'']+(\\b|[[:punct:]])/email@removed.com\\1/g' ".escapeshellarg($log));Wait, the character class
[^[:space:]\/<>\"'\'']+already excludes common punctuation boundaries, so the lack of\bat the end may not be critical. The current pattern should work for most cases.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
emhttp/plugins/dynamix/scripts/diagnostics(3 hunks)
🔇 Additional comments (2)
emhttp/plugins/dynamix/scripts/diagnostics (2)
3-3: LGTM!Copyright year update is appropriate for 2025.
740-744: LGTM!The change correctly applies email anonymization to the GraphQL log as stated in the PR objective. The function call properly passes the
$graphqlfile path for processing.Note: This inherits the output filename issue from
anonymize_email(see previous comment).
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
emhttp/plugins/dynamix/scripts/diagnostics (2)
360-369: Simplify email regex and align with existing usage.Reuse the simpler pattern already used in anonymize_syslog to reduce quoting pitfalls.
- if (!$all) { - run("sed -ri 's/\\b[^[:space:]\\/<>\\"'\\'']+@[^[:space:]\\/<>\\"'\\'']+\\.[^[:space:]\\/<>\\"'\\'']+/email@removed.com/g' ".escapeshellarg($log)); - } + if (!$all) { + run("sed -ri 's|\\b\\S+@\\S+\\.\\S+\\b|email@removed.com|g' ".escapeshellarg($log)); + }
93-93: Only domain‑anonymize when handling text payloads.This runs for select==2 (paths) unnecessarily. Gate it by select to avoid confusing behavior.
- anonymize_domain($text); + if ($select === 1) anonymize_domain($text);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
emhttp/plugins/dynamix/scripts/diagnostics(6 hunks)
🔇 Additional comments (6)
emhttp/plugins/dynamix/scripts/diagnostics (6)
357-358: Syslog anonymization will corrupt words until the TLD fix lands.anonymize_domain_file("$log.txt") will mangle common words (e.g., “balance”) when TLD is “lan”. Safe once the boundary-aware fix in anonymize_domain() is applied (see Lines 383-396). Otherwise, skip domain anonymization for syslog.
Consider checking a generated syslog for unintended “removed_TLD” inside normal words (e.g., “btrfs balance”).
324-324: LGTM: Masking TLDs in URLs block.Calling anonymize_domain($urls) here is correct and keeps urls.txt consistent with anonymized mode.
544-546: LGTM: ident.cfg TLD masking.Processing the copied ident.cfg via anonymize_domain_file keeps diagnostics sanitized without touching the live file.
776-776: LGTM: GraphQL log email anonymization hook.Switching to anonymize_email($graphql) is consistent with other log handlers and respects $all.
812-812: LGTM: servers.conf TLD masking.Masking TLDs before IP redaction and other sed rules is fine; newline() afterwards preserves CRLF format.
371-381: LGTM with a minor note.Reads, transforms, and writes back safely. Once anonymize_domain() is boundary-safe (see Lines 383-396), this will avoid unintended word corruption across files.
Confirm file sizes are reasonable before loading into memory if very large files are later added to this path.
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
emhttp/plugins/dynamix/scripts/diagnostics (1)
383-396: Overbroad substring replacement and duplicate null-coalesce remain unaddressed.The previous review identified critical issues that are only partially fixed:
Substring corruption risk: Line 393's
str_ireplace($domain, ...)still performs substring replacement. While line 392 now skips single-label TLDs (e.g., "lan"), any multi-label TLD like "my.lan" will still cause corruption in words containing those substrings.Duplicate null-coalesce (line 390):
($ident['LOCAL_TLD'] ?? ($ident['LOCAL_TLD'] ?? "local"))has redundant nesting.Apply the previous review's regex-based solution to match TLDs only at boundaries:
function anonymize_domain(&$text) { global $all; - static $domain = ""; + static $domain; + if ($all) return; - if (!$all) { - if ($domain == "") { - $ident = @parse_ini_file('/boot/config/ident.cfg'); - $domain = strtolower(is_array($ident) ? ($ident['LOCAL_TLD'] ?? ($ident['LOCAL_TLD'] ?? "local")) : "local"); - } - if (strpos($domain,".") !== false) { - $text = str_ireplace($domain,"removed_TLD",$text); - } + if (!isset($domain)) { + $ident = @parse_ini_file('/boot/config/ident.cfg'); + $domain = strtolower(is_array($ident) ? ($ident['LOCAL_TLD'] ?? 'local') : 'local'); } + if ($domain === 'local') return; + + $q = preg_quote($domain, '/'); + // Replace ".tld" or "@tld" + $text = preg_replace("/(?<=\\.|@){$q}\\b/i", "removed_TLD", $text); + // Replace standalone token (e.g., 'Local TLD: lan'), avoiding substrings in larger words + $text = preg_replace("/(^|[^A-Za-z0-9_-]){$q}($|[^A-Za-z0-9_-])/i", "\\1removed_TLD\\2", $text); }
🧹 Nitpick comments (2)
emhttp/plugins/dynamix/scripts/diagnostics (2)
324-326: LGTM!The domain anonymization is correctly applied after building the complete URLs output and before returning.
The empty line 325 can be removed for cleaner code:
$urls .= "\n\n".implode("\n", $dirlisting)."\n"; anonymize_domain($urls); - return str_replace("\n", "\r\n", $urls);
371-381: LGTM!The function provides a clean wrapper for file-based domain anonymization with appropriate error handling via the
falsecheck.For defensive coding, consider adding an explicit
file_exists()check:function anonymize_domain_file($file) { global $all; - if ( !$all ) { + if (!$all && file_exists($file)) { $text = @file_get_contents($file); - if ( $text !== false) { + if ($text !== false) { anonymize_domain($text); file_put_contents($file, $text); } } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
emhttp/plugins/dynamix/scripts/diagnostics(6 hunks)
🔇 Additional comments (6)
emhttp/plugins/dynamix/scripts/diagnostics (6)
93-93: LGTM!The placement of
anonymize_domain($text)at the beginning of theanonymize()function ensures consistent domain anonymization across all code paths.
357-358: LGTM!The domain anonymization is correctly applied as the final step after all other syslog processing, ensuring domains are masked in the output files.
360-369: LGTM!The function correctly generates output filenames with
.txtextension usingpathinfo($file, PATHINFO_FILENAME), addressing the previous review concern about inconsistent naming.The email regex pattern excludes common delimiters (
<>"'/and whitespace) to avoid false matches in structured contexts.
545-545: LGTM!Anonymizing
ident.cfgis appropriate as this file contains the LOCAL_TLD configuration that should be masked in diagnostics.
776-776: LGTM!Applying email anonymization to the GraphQL API log is appropriate as this log likely contains user email addresses from API operations.
812-812: LGTM!Anonymizing
servers.conf.txtis appropriate as nginx server configuration typically contains domain names that should be masked in diagnostics.
| if (!$all) { | ||
| if ($domain == "") { | ||
| $ident = @parse_ini_file('/boot/config/ident.cfg'); | ||
| $domain = strtolower(is_array($ident) ? ($ident['LOCAL_TLD'] ?? ($ident['LOCAL_TLD'] ?? "local")) : "local"); |
There was a problem hiding this comment.
Let's simplify this to:
$domain = strtolower(is_array($ident) ? ($ident['LOCAL_TLD'] ?? 'local') : 'local');
| @@ -90,6 +90,7 @@ function shareDisks($share) { | |||
| function anonymize($text, $select) { | |||
There was a problem hiding this comment.
I struggled to understand the $select param, how about adding this DocBlock
/**
* Anonymizes sensitive data in text based on the specified anonymization type
*
* @param string $text The text content to anonymize
* @param int $select Anonymization type:
* 1 = general text
* 2 = file paths (specifically share cfg files)
* @return string The anonymized text
*/
Summary by CodeRabbit
New Features
Bug Fixes