Skip to content

Conversation

@f2cmb
Copy link
Contributor

@f2cmb f2cmb commented Oct 21, 2025

Description

@trasher
Copy link
Contributor

trasher commented Oct 21, 2025

Please add a test case

@f2cmb
Copy link
Contributor Author

f2cmb commented Oct 21, 2025

Test added for case handled in fix

@f2cmb f2cmb marked this pull request as ready for review October 21, 2025 14:29
if ($detected_charset !== false) {
$charset = $detected_charset;
} else {
// Fallback to ISO-8859-1 as it's the most common for mail headers without charset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really not sure :/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following our discussion, here are some thoughts on improving charset handling for emails without a specified charset parameter.

Current Issue

Original bug (#21540): Yahoo DMARC emails without charset → MySQL error Incorrect string value because quoted-printable accented chars (e.g., =E8 for è) aren't converted to UTF-8.

Current approach uses mb_detect_encoding() with ISO-8859-1 fallback, but can be improved.

Option A: RFC 2045 Compliant Cascade Detection

According to RFC 2045 - 5.2, default charset should be US-ASCII. Here's a more robust approach:

if ($charset === null) {
    // 1. Check if already valid UTF-8 (avoid unnecessary conversion)
    if (mb_check_encoding($contents, 'UTF-8')) {
        $charset = 'UTF-8';
    }
    // 2. Check if pure ASCII (RFC 2045 compliant)
    elseif (mb_check_encoding($contents, 'ASCII')) {
        $charset = 'US-ASCII';
    }
    // 3. Try detection for non-ASCII content
    else {
        $detected = mb_detect_encoding($contents,
            ['UTF-8', 'ISO-8859-1', 'ISO-8859-15', 'Windows-1252', 'Windows-1251'],
            true
        );

        if ($detected !== false) {
            $charset = $detected;
        } else {
            // 4. Pragmatic fallback (accepts all bytes, prevents MySQL errors)
            $charset = 'ISO-8859-1';
        }
    }
}

Pros: RFC compliant when possible, prevents MySQL errors, no data loss
Cons: More complex

Option B: Strict Fallback - silent

Alternative: Use US-ASCII as strict RFC-compliant default, which may fail for non-ASCII content but respects standards:

if ($charset === null) {
    // RFC 2045: default to US-ASCII when charset not specified
    // This may cause conversion errors for non-ASCII content,
    // which alerts users to fix malformed emails at the source
    $charset = 'US-ASCII';

    // Note: Will fail gracefully if content has non-ASCII chars,
    // preventing silent data corruption
}

Pros: RFC compliant, enforces standards, reveals malformed emails
Cons: May reject legitimate emails (Yahoo DMARC), less pragmatic

Recommendation

Option A is more pragmatic for real-world emails while respecting standards when possible. Option B enforces strict RFC compliance but may lose legitimate mail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for @cedric-anne feedback but option A seems fine IMO.

I guess this will be a very rare case anyway, and we can always keep improving the design if we find out that some mails are still not handled as expected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also vote for option A.
Just a note: why not using mb_list_encodings() (and maybe strict detection) in mb_detect_encoding() instead of listing only a few possible encodings?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The php.net manual has a big warning saying that mb_detect_encoding is unreliable so I guess it should be avoided.

The name of this function is misleading, it performs "guessing" rather than "detection".
The guesses are far from accurate, and therefore you cannot use this function to accurately detect the correct character encoding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no real alternative to mb_detect_encoding() :(

To be discussed probably but it seems given example is not a real example of a mail that must be collected; maybe should we just "properly" reject mails without encoding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems calling mb_check_encoding manually multiple time is the alternative (which is what option A suggest I think).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reproduce the behaviour from GLPI 10.0 removed in commit a655c63.

if ($charset === null) {
    $charset = mb_check_encoding($contents, 'UTF-8') ? 'UTF-8' : 'ISO-8859-1';
}

if ($detected_charset !== false) {
$charset = $detected_charset;
} else {
// Fallback to ISO-8859-1 as it's the most common for mail headers without charset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for @cedric-anne feedback but option A seems fine IMO.

I guess this will be a very rare case anyway, and we can always keep improving the design if we find out that some mails are still not handled as expected.

@AdrienClairembault
Copy link
Contributor

(Lint and tests are failing)

}

$this->assertNotNull($body_text, 'No text/plain part found in email');
$this->assertStringContainsString('ATTENTION', $body_text);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe special characters must be tested too?
Also, maybe tests with several encodings could be done aswell.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, i'll review all new tests once we have an answer for the chosen strategy above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GLPI 11.0.1 : Collector mail error charset

4 participants