check mail encoding and format if charset is not specified #21575

f2cmb · 2025-10-21T12:58:32Z

Description

It fixes GLPI 11.0.1 : Collector mail error charset #21540
Here is a brief description of what this PR does :
Add a new check in case charset is not specified as in the .eml file shared in GLPI 11.0.1 : Collector mail error charset #21540 also add a new test .eml file with same configuration.

trasher · 2025-10-21T13:06:28Z

Please add a test case

f2cmb · 2025-10-21T13:49:24Z

Test added for case handled in fix

trasher · 2025-10-22T08:30:47Z

src/MailCollector.php

+            if ($detected_charset !== false) {
+                $charset = $detected_charset;
+            } else {
+                // Fallback to ISO-8859-1 as it's the most common for mail headers without charset


I'm really not sure :/

Following our discussion, here are some thoughts on improving charset handling for emails without a specified charset parameter.

Current Issue

Original bug (#21540): Yahoo DMARC emails without charset → MySQL error Incorrect string value because quoted-printable accented chars (e.g., =E8 for è) aren't converted to UTF-8.

Current approach uses mb_detect_encoding() with ISO-8859-1 fallback, but can be improved.

Option A: RFC 2045 Compliant Cascade Detection

According to RFC 2045 - 5.2, default charset should be US-ASCII. Here's a more robust approach:

if ($charset === null) { // 1. Check if already valid UTF-8 (avoid unnecessary conversion) if (mb_check_encoding($contents, 'UTF-8')) { $charset = 'UTF-8'; } // 2. Check if pure ASCII (RFC 2045 compliant) elseif (mb_check_encoding($contents, 'ASCII')) { $charset = 'US-ASCII'; } // 3. Try detection for non-ASCII content else { $detected = mb_detect_encoding($contents, ['UTF-8', 'ISO-8859-1', 'ISO-8859-15', 'Windows-1252', 'Windows-1251'], true ); if ($detected !== false) { $charset = $detected; } else { // 4. Pragmatic fallback (accepts all bytes, prevents MySQL errors) $charset = 'ISO-8859-1'; } } }

Pros: RFC compliant when possible, prevents MySQL errors, no data loss
Cons: More complex

Option B: Strict Fallback - silent

Alternative: Use US-ASCII as strict RFC-compliant default, which may fail for non-ASCII content but respects standards:

if ($charset === null) { // RFC 2045: default to US-ASCII when charset not specified // This may cause conversion errors for non-ASCII content, // which alerts users to fix malformed emails at the source $charset = 'US-ASCII'; // Note: Will fail gracefully if content has non-ASCII chars, // preventing silent data corruption }

Pros: RFC compliant, enforces standards, reveals malformed emails
Cons: May reject legitimate emails (Yahoo DMARC), less pragmatic

Recommendation

Option A is more pragmatic for real-world emails while respecting standards when possible. Option B enforces strict RFC compliance but may lose legitimate mail.

Waiting for @cedric-anne feedback but option A seems fine IMO.

I guess this will be a very rare case anyway, and we can always keep improving the design if we find out that some mails are still not handled as expected.

I'd also vote for option A.
Just a note: why not using mb_list_encodings() (and maybe strict detection) in mb_detect_encoding() instead of listing only a few possible encodings?

The php.net manual has a big warning saying that mb_detect_encoding is unreliable so I guess it should be avoided.

The name of this function is misleading, it performs "guessing" rather than "detection".
The guesses are far from accurate, and therefore you cannot use this function to accurately detect the correct character encoding.

There is no real alternative to mb_detect_encoding() :(

To be discussed probably but it seems given example is not a real example of a mail that must be collected; maybe should we just "properly" reject mails without encoding.

It seems calling mb_check_encoding manually multiple time is the alternative (which is what option A suggest I think).

To reproduce the behaviour from GLPI 10.0 removed in commit a655c63.

if ($charset === null) { $charset = mb_check_encoding($contents, 'UTF-8') ? 'UTF-8' : 'ISO-8859-1'; }

AdrienClairembault · 2025-10-23T14:19:58Z

src/MailCollector.php

+            if ($detected_charset !== false) {
+                $charset = $detected_charset;
+            } else {
+                // Fallback to ISO-8859-1 as it's the most common for mail headers without charset


Waiting for @cedric-anne feedback but option A seems fine IMO.

I guess this will be a very rare case anyway, and we can always keep improving the design if we find out that some mails are still not handled as expected.

AdrienClairembault · 2025-10-23T14:20:45Z

(Lint and tests are failing)

trasher · 2025-10-24T05:42:51Z

tests/imap/MailCollectorTest.php

+        }
+
+        $this->assertNotNull($body_text, 'No text/plain part found in email');
+        $this->assertStringContainsString('ATTENTION', $body_text);


Maybe special characters must be tested too?
Also, maybe tests with several encodings could be done aswell.

Sure, i'll review all new tests once we have an answer for the chosen strategy above.

f2cmb added 2 commits October 21, 2025 14:24

check mail encoding and format if charset is not specified

519d617

add missing charset .eml file in tests

5fcabf3

f2cmb requested review from AdrienClairembault, cedric-anne and trasher October 21, 2025 12:58

trasher added the need unit tests label Oct 21, 2025

add test

b5bcee7

f2cmb removed the need unit tests label Oct 21, 2025

f2cmb marked this pull request as ready for review October 21, 2025 14:29

modify call for Mbox in test

482a055

trasher reviewed Oct 22, 2025

View reviewed changes

AdrienClairembault approved these changes Oct 23, 2025

View reviewed changes

trasher reviewed Oct 24, 2025

View reviewed changes

f2cmb added 4 commits October 27, 2025 12:10

remove redundant / always true cond

b7c2420

readd ancient behavior for non-specified charset

46f0e4f

simplify eml test file, refine test for more accuracy with fix

18fe2bc

adjust test and eml file

94bfcd7

Uh oh!

Uh oh!

check mail encoding and format if charset is not specified #21575

Are you sure you want to change the base?

check mail encoding and format if charset is not specified #21575

Uh oh!

Conversation

f2cmb commented Oct 21, 2025

Description

Uh oh!

trasher commented Oct 21, 2025

Uh oh!

f2cmb commented Oct 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Current Issue

Option A: RFC 2045 Compliant Cascade Detection

Option B: Strict Fallback - silent

Recommendation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AdrienClairembault commented Oct 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants