-
-
Couldn't load subscription status.
- Fork 1.5k
check mail encoding and format if charset is not specified #21575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 11.0/bugfixes
Are you sure you want to change the base?
Conversation
|
Please add a test case |
|
Test added for case handled in fix |
src/MailCollector.php
Outdated
| if ($detected_charset !== false) { | ||
| $charset = $detected_charset; | ||
| } else { | ||
| // Fallback to ISO-8859-1 as it's the most common for mail headers without charset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really not sure :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following our discussion, here are some thoughts on improving charset handling for emails without a specified charset parameter.
Current Issue
Original bug (#21540): Yahoo DMARC emails without charset → MySQL error Incorrect string value because quoted-printable accented chars (e.g., =E8 for è) aren't converted to UTF-8.
Current approach uses mb_detect_encoding() with ISO-8859-1 fallback, but can be improved.
Option A: RFC 2045 Compliant Cascade Detection
According to RFC 2045 - 5.2, default charset should be US-ASCII. Here's a more robust approach:
if ($charset === null) {
// 1. Check if already valid UTF-8 (avoid unnecessary conversion)
if (mb_check_encoding($contents, 'UTF-8')) {
$charset = 'UTF-8';
}
// 2. Check if pure ASCII (RFC 2045 compliant)
elseif (mb_check_encoding($contents, 'ASCII')) {
$charset = 'US-ASCII';
}
// 3. Try detection for non-ASCII content
else {
$detected = mb_detect_encoding($contents,
['UTF-8', 'ISO-8859-1', 'ISO-8859-15', 'Windows-1252', 'Windows-1251'],
true
);
if ($detected !== false) {
$charset = $detected;
} else {
// 4. Pragmatic fallback (accepts all bytes, prevents MySQL errors)
$charset = 'ISO-8859-1';
}
}
}Pros: RFC compliant when possible, prevents MySQL errors, no data loss
Cons: More complex
Option B: Strict Fallback - silent
Alternative: Use US-ASCII as strict RFC-compliant default, which may fail for non-ASCII content but respects standards:
if ($charset === null) {
// RFC 2045: default to US-ASCII when charset not specified
// This may cause conversion errors for non-ASCII content,
// which alerts users to fix malformed emails at the source
$charset = 'US-ASCII';
// Note: Will fail gracefully if content has non-ASCII chars,
// preventing silent data corruption
}Pros: RFC compliant, enforces standards, reveals malformed emails
Cons: May reject legitimate emails (Yahoo DMARC), less pragmatic
Recommendation
Option A is more pragmatic for real-world emails while respecting standards when possible. Option B enforces strict RFC compliance but may lose legitimate mail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting for @cedric-anne feedback but option A seems fine IMO.
I guess this will be a very rare case anyway, and we can always keep improving the design if we find out that some mails are still not handled as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also vote for option A.
Just a note: why not using mb_list_encodings() (and maybe strict detection) in mb_detect_encoding() instead of listing only a few possible encodings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The php.net manual has a big warning saying that mb_detect_encoding is unreliable so I guess it should be avoided.
The name of this function is misleading, it performs "guessing" rather than "detection".
The guesses are far from accurate, and therefore you cannot use this function to accurately detect the correct character encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no real alternative to mb_detect_encoding() :(
To be discussed probably but it seems given example is not a real example of a mail that must be collected; maybe should we just "properly" reject mails without encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems calling mb_check_encoding manually multiple time is the alternative (which is what option A suggest I think).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To reproduce the behaviour from GLPI 10.0 removed in commit a655c63.
if ($charset === null) {
$charset = mb_check_encoding($contents, 'UTF-8') ? 'UTF-8' : 'ISO-8859-1';
}
src/MailCollector.php
Outdated
| if ($detected_charset !== false) { | ||
| $charset = $detected_charset; | ||
| } else { | ||
| // Fallback to ISO-8859-1 as it's the most common for mail headers without charset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting for @cedric-anne feedback but option A seems fine IMO.
I guess this will be a very rare case anyway, and we can always keep improving the design if we find out that some mails are still not handled as expected.
|
(Lint and tests are failing) |
tests/imap/MailCollectorTest.php
Outdated
| } | ||
|
|
||
| $this->assertNotNull($body_text, 'No text/plain part found in email'); | ||
| $this->assertStringContainsString('ATTENTION', $body_text); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe special characters must be tested too?
Also, maybe tests with several encodings could be done aswell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, i'll review all new tests once we have an answer for the chosen strategy above.
Description
Add a new check in case charset is not specified as in the .eml file shared in GLPI 11.0.1 : Collector mail error charset #21540 also add a new test .eml file with same configuration.