Body character coding problem #54

meehi · 2012-03-19T15:40:48Z

Hi!

I'm working on a program in VS2010 (C#) and using your component to receive unseen messages. It's working nice but today I received a mail that is not shown correctly and containing '?' characters (Unicode characters).
Here is the code I use:

System.Lazy<AE.Net.Mail.MailMessage>[] myMessages = _client2.SearchMessages(AE.Net.Mail.SearchCondition.Unseen());
foreach (System.Lazy<AE.Net.Mail.MailMessage> message in myMessages)
{
AE.Net.Mail.MailMessage msg = message.Value;
}

I have added a watch on "msg" variable and can see the following:
msg.AlternativeViews[0].Content -> A h�ten virtu�lis
msg.AlternativeViews[0].ContentEncoding -> 8bit
msg.BodyHtml -> the same as above, but with html tags

msg.AlternativeViews[0].Content should be look like this -> A héten virtuális...

As you can see the message body contains accute (Unicode) characters (they are Hungarian characters, and might be in ISO-8859-1, or ISO-8859-2).

andyedinborough · 2012-03-20T03:53:26Z

Can you forward an example message to me? [email protected]

meehi · 2012-03-20T08:14:37Z

hi,
yeah sure. I have forwarded the email from my gmail account: [email protected]. The mail came from [email protected].

Thanks for the help,
Mihaly Sogorka

-----Original Message-----
From: Andy Edinborough [email protected]
To: meehi [email protected]
Sent: Tue, Mar 20, 2012 4:53 am
Subject: Re: [aenetmail] Body character coding problem (#54)

Can you forward an example message to me? [email protected]

Reply to this email directly or view it on GitHub:
#54 (comment)

meehi · 2012-03-20T09:10:54Z

Hi!

I modified and complied your code in ImapClient.cs class in the GetMessages() procedure with the following code:

    string pattern = "charset=";
    string temp_charset = Encoding.Default.BodyName;
    string charset = "";
    while (remaining > 0) {
      read = _Stream.Read(buffer, 0, Math.Min(remaining, buffer.Length));
      if (temp_charset != charset)
      {
          string temp_body = System.Text.Encoding.GetEncoding(temp_charset).GetString(buffer, 0, read);
          temp_body = temp_body.Replace("\"", "").Replace(" ", "");
          if (temp_body.ToLower().Contains(pattern))
          {
              int start_pos = temp_body.IndexOf(pattern) + pattern.Length;
              int end_pos = temp_body.IndexOf("\r\n", start_pos);
              if (end_pos == -1)
              {
                  end_pos = temp_body.IndexOf("\r", start_pos);
                  if (end_pos == -1)
                      end_pos = temp_body.IndexOf(";", start_pos);
              }
              if (end_pos != -1)
              {
                  int length = end_pos - start_pos;
                  charset = temp_body.Substring(start_pos, length).Replace(";", "");
                  if (charset != "" && charset != temp_charset)
                      temp_charset = charset;
              }
          }
      }
      body.Append(System.Text.Encoding.GetEncoding(temp_charset).GetString(buffer, 0, read));
      remaining -= read;
    }

I have tested it with Unicode mails and they now displaying correctly. The code is not nice but working. It might give you an idea how to workaround this problem.

Best regards,
Mihaly Sogorka

-----Original Message-----
From: Andy Edinborough [email protected]
To: meehi [email protected]
Sent: Tue, Mar 20, 2012 4:53 am
Subject: Re: [aenetmail] Body character coding problem (#54)

Can you forward an example message to me? [email protected]

Reply to this email directly or view it on GitHub:
#54 (comment)

meehi · 2012-03-20T12:53:24Z

Related to this problem I found another one in HeaderObject.cs class in SetBody(string value) procedure:
This line carries on the same character decoding issue:

var data = Convert.FromBase64String(value);

Previous line should look like something like this:

value = System.Text.Encoding.GetEncoding(??BODY_CHARSET??).GetString(Convert.FromBase64String(value));

So the character encoding of the body (charset) should be stored in the object model somehow. This way you could refer to it any time when you need to use a decoder function.

piher · 2012-03-29T09:12:42Z

I can see two cases where your code will fail :

if the "charset=" bytes have been truncated because of the buffer's limited length (tmpBody can end with "char" and next one will start with "set=.....")
if tmpBody actually contains several parts with different charsets

And i think the charset header is not even mandatory.

meehi · 2012-03-29T09:18:38Z

well I know it's not perfect but working. If you check the code then you can see it will use your OS default codepage if it cant find the charset in body: string temp_charset = Encoding.Default.BodyName.
But in most cases it finds the charset I made many test with different mails.

meehi · 2012-03-29T09:22:56Z

Can you please provide me with a codeblock what you have changed?
thanks

piher · 2012-03-31T20:04:05Z

Hi, could you try with this code ?
It seems to be working for me.

[Edited : the part where I decoded quoted-printable and base64 was just stupid........]
[EDITED 2 : removed the getMessages code which was taking too much space. See a few post down for a rtf containing the function

Added in utilities :

      {
          int indexOf=-1;
          int tmp = -1;
          int j = 0;
          end = end == -1 ? fullArray.Length - 1 : end;
          for (int i = start; i <= end; i++)
          {
              if (fullArray.GetValue(i).Equals(innerArray.GetValue(j)))
              {
                  if (j == 0) { tmp = i;}
                  else { if (j == innerArray.Length - 1) { indexOf = tmp; break; } }
                  j++;
              }
              else { j = 0; }
          }
          return indexOf;
      }

internal static int LastIndexOfArray(Array fullArray, Array innerArray, int start = 0, int end = -1)
{
int lastIndexOf = -1;
int tmp = -1;
int j = innerArray.Length - 1;
end = end == -1 ? fullArray.Length - 1 : end;
for (int i = end; i > start; i--)
{
if (fullArray.GetValue(i).Equals(innerArray.GetValue(j)))
{
if (j == innerArray.Length - 1) { tmp = i; }
else { if (j == 0) { lastIndexOf = tmp; break; } }
j--;
}
else { j = innerArray.Length - 1; }
}
return lastIndexOf;
}```

meehi · 2012-03-31T20:17:16Z

Please send me the whole GetMessages() function because some of the declaration is missing from the above code.

meehi · 2012-03-31T20:25:44Z

what is cancellationPending variable?

piher · 2012-03-31T20:27:12Z

Sorry, you should just remove the lines "if (cancellationPending) throw new GetMessagesCancelledException();" i forgot to take them off, they are not part of the library.

piher · 2012-03-31T20:29:23Z

I think the regex needs to be adjusted because i found a few emails that would make it fail.
E.g. when the charset end with a ";".

piher · 2012-03-31T20:34:42Z

Full method :

http://cjoint.com/12ma/BCFwIsp7wjC_getmessageswithcharsetparsing.rtf

meehi · 2012-03-31T20:34:54Z

It's working with UTF8 mails but not with Latin1 charachters.
It throw an exception in Utitilities, this line:

if (fullArray.GetValue(i).Equals(innerArray.GetValue(j)))

Index was outside the bounds of the array.

piher · 2012-03-31T20:44:54Z

Have you tried with the method in the rtf file i sent ?
I had inverted two lines, which caused the issue.
About latin charachters, with me it seems to work, i can see all accents fine.

So far i noticed to bugs :

"UTF-8;" and encoding.getencoding exception
Some multipart emails are badly parsed, i have to change the regex to ensure it's not possible

meehi · 2012-03-31T20:51:58Z

Yes with the rtf file it's working. Nice.
I just bulk pasted the code didn't examine it yet :)

meehi · 2012-03-31T21:01:09Z

In some cases it still appears that it can't find the right decoder. Just received a mail that still contains '?' (question marks)

piher · 2012-03-31T21:01:23Z

Cool !
I think it's better if we use a code based on a regex because we can make improvements without changing the whole method.

Just a thought :
I'm wondering if we should keep trying to parse the encoding of body and parts while downloading.
Maybe this should be optional since most of the times when we download emails, the info we want while downloading is in the headers, the rest should only be parsed when needed. Therefor, perhaps the MailMessages should store the raw body bytes rather than the raw body string.

piher · 2012-03-31T21:05:25Z

It's possible, as an example my code won't work if there is no charset specified at all (it happens !) or if the headers of a part are like that:
Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: quoted-printable Content-Disposition: inline
Rather than :

Content-Transfer-Encoding: quoted-printable

The regex has to be updated but i think the general idea is there.

meehi · 2012-03-31T21:08:23Z

yes, good thought.
And in HeaderObject.cs there is also a function called SetBody() which "hard codes" the encoder or uses the default that can be a problem:
var data = Convert.FromBase64String(value);
This overrides the function what you have write above..

andyedinborough · 2012-04-02T17:23:04Z

The code in HeaderObject.cs should work just fine. The value is converted to a byte and then it uses a StreamReader to decode it -- using detectEncodingFromByteOrderMarks=true. If it is failing, send me an example message to [email protected] and I'll add it to the unit tests.

piher · 2012-04-03T11:37:06Z

Much simpler, better and faster version of the previous custom GetMessages :
http://cjoint.com/12av/BDdnH6aFbJs_getmessageswithcharsetparsingv2.rtf
Some stuff I was decoding wasn't actually useful (like the non us-ASCII headers since they are encoded in quoted printable which is only ASCII).The regex should now take into account every kind of content-type header fields.
Also, I tried to limit as much as I could the extraction of information to keep the complexity low.

meehi · 2012-04-03T11:55:28Z

Thanks, I'm going to test it as soon as I can.

meehi · 2012-04-05T11:36:54Z

So, I have tested your last GetMessages() function with different mails, here is the result: some of the mails with Latin charset are displaying well but in 1 case there is still '?' character instead of unicode chars. This is a text/html mail.

There is another problem: line breaks are missing. In original version there were \n\r line breaks but now they gone.

meehi · 2012-04-05T11:44:24Z

I have forwarded a mail that is displaying incorrectly to: [email protected]

meehi · 2012-04-06T13:12:58Z

Here is an example where your code fail:

Mail snippet:

--b1_03e47a899ab3248e32619e3ddc24148c
Content-Type: text/html; charset = "iso-8859-2"
Content-Transfer-Encoding: 8bit

Regards,
Mihaly Sogorka

meehi · 2012-04-06T14:02:12Z

Conclusion: returning to roots.

So I have tested your code block with 5 different mails and got the following result: 4 of 5 mails displaying correctly while 1 mail is still incorrect. I changed back my GetMessages() function to my previous version explained here: #54 (comment) and this code displaying 5/5 mails correctly. So I'm returning to this version.

piher · 2012-04-07T12:24:38Z

Could you send me the full problematic email and tell me where exactly it fails parsing ? (use github messages)

Are the line breaks missing in every message ?

meehi · 2012-04-07T13:24:40Z

I have forwarded the mail.

…ent is used, and default the charset to `ISO-8859-1`.

andyedinborough · 2012-04-08T16:27:40Z

Hey guys, I think this is simply an issue of not using the right default character set. I've updated the code to use ISO-8859-1, Latin 1, as the default, and then updated the SetBody method to obey the charset specified in headers.

meehi · 2012-04-09T07:17:02Z

Hi Andy,

so this fix works not just for ISO-8859-1 rather every kind of messages?

meehi · 2012-04-09T07:20:46Z

You might want to use this code, instead of hard coding ISO-8859-1?

--modifiied--

public string DEFAULT_CHARSET = System.Text.Encoding.Default.BodyName;

meehi · 2012-04-09T07:55:01Z

There is 1 case when your code fails: I've got a mail containing the following line:

charset=binary

Never seen this before but it seems like it's possible.

public static Encoding ParseCharsetToEncoding(string characterSet)
{
if (string.IsNullOrEmpty(characterSet))
characterSet = DEFAULT_CHARSET;

        string charSetUpper = characterSet.ToUpperInvariant();
        if (charSetUpper.Contains("WINDOWS") || charSetUpper.Contains("CP"))
        {
            // It seems the character set contains an codepage value, which we should use to parse the encoding
            charSetUpper = charSetUpper.Replace("CP", ""); // Remove cp
            charSetUpper = charSetUpper.Replace("WINDOWS", ""); // Remove windows
            charSetUpper = charSetUpper.Replace("-", ""); // Remove - which could be used as cp-1554

            // Now we hope the only thing left in the characterSet is numbers.
            int codepageNumber = int.Parse(charSetUpper, System.Globalization.CultureInfo.InvariantCulture);

            return Encoding.GetEncoding(codepageNumber);
        }

        // It seems there is no codepage value in the characterSet. It must be a named encoding
        return Encoding.GetEncoding(characterSet);
    }

when you call return Encoding.GetEncoding(characterSet); characterSet variable value holds this "binary" and it fails here.

meehi · 2012-04-09T08:07:39Z

possible workaround:

if (string.IsNullOrEmpty(characterSet) || charset.Contains("binary"))
  characterSet = DEFAULT_CHARSET;

or

try
{
   return Encoding.GetEncoding(characterSet);
}
catch
{
  return Encoding.GetEncoding(DEFAULT_CHARSET);
}

…ader value when reading in attachments (closes #63)

meehi · 2012-04-09T15:20:37Z

Stack overflow in HeaderObject.cs, here:

return _Headers ?? (_Headers = HeaderDictionary.Parse(RawHeaders, Encoding));

By the way this improvements looking promising :)

andyedinborough · 2012-04-09T15:26:38Z

Oy. Can't believe I missed that. I should really learn to run the unit tests before checking in. :]

meehi · 2012-04-09T15:34:20Z

Great work!

All the messages are displaying correctly. Decoding can't be a problem anymore :).
But I would be interested getting a letter from China or from Japan.

meehi · 2012-04-09T17:06:17Z

There is still 1 case when the message is displaying incorrectly:

Message has a Charset=UTF-8
OS has default Charset=ISO-8859-2 (or else, but not UTF-8)

This time the message is being decoded with ISO-8859-2 while it was encoded with UTF-8 so unicode characters will be displayed wrong.

piher · 2012-04-09T22:22:19Z

I don't understand how using the system's default encoding can be a solution. This probably works fine on simple cases where emails are exchanged within people using the same environnment but i'm sure soon enough someone will open an issue saying "hey I received an email from this unknown country with that particular encoding and it's not displaying correctly"

andyedinborough · 2012-04-09T22:27:51Z

It's a solution because as far as I can tell, SMTP headers are supposed to be encoded with Latin1. Anything that isn't Latin1 should be encoded in place with the =?encoding-name?tobedecoded?= syntax.

andyedinborough · 2012-04-09T22:29:21Z

@meehi I've set TextClient to specifically request Latin1. Let me know how that works for that last case.

piher · 2012-04-09T22:38:09Z

Fair enough but I was more thinking about the body, would it work in every case ?

andyedinborough · 2012-04-09T22:45:15Z

Hmmm... I think there's still an issue. Shoot. To set the body, it switches to the encoding specified in the headers, but by that point, it's already been decoded. I've got an idea. I'll work on it tonight.

meehi · 2012-04-10T07:17:01Z

Ok, I will wait for that idea and then test it.

jstedfast · 2014-01-11T17:38:46Z

Looks like you guys seriously need to use MimeKit to solve this problem.

meehi mentioned this issue Mar 26, 2012

Charset in imap doesn't work correctly #48

Open

meehi closed this as completed Mar 29, 2012

meehi reopened this Mar 29, 2012

piher mentioned this issue Mar 31, 2012

From is null or has character set issues #62

Open

andyedinborough added a commit that referenced this issue Apr 8, 2012

Potential fix for #54; ensures the Encoding property of the ImapCli…

e186bd8

…ent is used, and default the charset to `ISO-8859-1`.

andyedinborough added a commit that referenced this issue Apr 8, 2012

Make sure the correct charset is used -- #54

ab4975f

andyedinborough added a commit that referenced this issue Apr 9, 2012

updates for default encoding #54

43508f8

andyedinborough added a commit that referenced this issue Apr 9, 2012

Additional updates for proper encoding (#54); also respects name he…

a049cf5

…ader value when reading in attachments (closes #63)

andyedinborough added a commit that referenced this issue Apr 9, 2012

Add default value for encoding (#54)

71c4440

andyedinborough added a commit that referenced this issue Apr 9, 2012

*slaps forehead* #54

956ec06

meehi closed this as completed Apr 9, 2012

meehi reopened this Apr 9, 2012

andyedinborough added a commit that referenced this issue Apr 9, 2012

require Latin1 -- #54

804a9f6

Body character coding problem #54

Body character coding problem #54

Comments

meehi commented Mar 19, 2012

andyedinborough commented Mar 20, 2012

meehi commented Mar 20, 2012

meehi commented Mar 20, 2012

meehi commented Mar 20, 2012

piher commented Mar 29, 2012

meehi commented Mar 29, 2012

meehi commented Mar 29, 2012

piher commented Mar 31, 2012

meehi commented Mar 31, 2012

meehi commented Mar 31, 2012

piher commented Mar 31, 2012

piher commented Mar 31, 2012

piher commented Mar 31, 2012

meehi commented Mar 31, 2012

piher commented Mar 31, 2012

meehi commented Mar 31, 2012

meehi commented Mar 31, 2012

piher commented Mar 31, 2012

piher commented Mar 31, 2012

meehi commented Mar 31, 2012

andyedinborough commented Apr 2, 2012

piher commented Apr 3, 2012

meehi commented Apr 3, 2012

meehi commented Apr 5, 2012

meehi commented Apr 5, 2012

meehi commented Apr 6, 2012

meehi commented Apr 6, 2012

piher commented Apr 7, 2012

meehi commented Apr 7, 2012

andyedinborough commented Apr 8, 2012

meehi commented Apr 9, 2012

meehi commented Apr 9, 2012

meehi commented Apr 9, 2012

meehi commented Apr 9, 2012

meehi commented Apr 9, 2012

andyedinborough commented Apr 9, 2012

meehi commented Apr 9, 2012

meehi commented Apr 9, 2012

piher commented Apr 9, 2012

andyedinborough commented Apr 9, 2012

andyedinborough commented Apr 9, 2012

piher commented Apr 9, 2012

andyedinborough commented Apr 9, 2012

meehi commented Apr 10, 2012

jstedfast commented Jan 11, 2014