Charset in imap doesn't work correctly #48

paolosanchi · 2012-03-10T09:25:57Z

i think something should be done in the internal void SetBody(string value) method..
because in the value that is assigned to Body has wrong characters:
latin characters like 'à' and 'ò' are converted to '?'

paolosanchi · 2012-03-11T15:53:39Z

Ok,i analyzed the problem and studied a solution
(this article explain how charsets and text encoding works: http://www.joelonsoftware.com/articles/Unicode.html)
The general problem is that the email is decoded from the stream thinking it as encoded in UTF8.
ImapClinet.cs line 422

    while (remaining > 0) {
      read = _Stream.Read(buffer, 0, Math.Min(remaining, buffer.Length));
      body.Append(System.Text.Encoding.UTF8.GetString(buffer, 0, read));
      remaining -= read;
    }

This should true for the headers, but the email content is encoded using the encoding specified in the Content-Type header of the email,
like this:
Content-Type: text/html; charset=UTF-8

That's not all, because the content could be of this type:
Content-Type: multipart/alternative;
that means that the body could have different rappresentations such as text/plain or text/html and it could be encoded using a different encoding like the ISO-8859-1,

Content-Type: text/plain; charset="ISO-8859-1"

The real problem if we get the string of the content encoded in ISO-8859-1 using the UTF8 decoder we loose information, because if the body contains culture specific characters (like òèàùàè) it interprets them as '?'.

Store the RawBody as a string is not bad, as we know the c# strings have 16bit per char (they are unicode), but just before the mail.Load(body.ToString(), headersonly); in the GetMessages() method
we have to use the right Decoder for the right part and have no wrong character at all.

At this point there is another problem, because the implicit operator that cast a MailMessage do not care about the encoding at all. the Attachment.GetData() method is wrong, and the attachment.ContentType is wrong too, because they do not care of the original encoding of the various parts..

I found for my purpose a working solution (a workaroud), it was simple because utf8 has the character of my language.

I hope that these considerations may help someone find a smarter solution, because unfortunately I do not have time to do it, now.

piher · 2012-03-15T14:35:03Z

So you say you found a way to work with accentuation ?

meehi · 2012-03-19T17:57:35Z

reporcello you are right

this line is wrong:
body.Append(System.Text.Encoding.UTF8.GetString(buffer, 0, read));

it should look like something like this:
body.Append(System.Text.Encoding.GetEncoding(charset).GetString(buffer, 0, read));

charset is a string variable and should be take its value from the body ContentType.
I have tested and complied again the component and now Latin1 characters (like acute unicode characters) are looking fine.

nakhli · 2012-03-26T13:04:41Z

These seems to be related to closed issue 49. Do you still have this problem with latest version?

meehi · 2012-03-26T14:24:17Z

I still have this problem with the latest version. Issue #49 does not fix it. In my previous comment I have added a sample code logic how it should work properly. You might want to check it out.

meehi · 2012-03-26T14:27:04Z

And I think I have duplicated the problem here: #54

paolosanchi · 2012-03-29T06:13:33Z

I did some change in my local version that solved the problem for west european languages, because utf-8 is compatible with that.
My solution is pretty brutal: i read the email 2 times, the first just for search the string ISO-8859-1, if i find it i will use the utf-8 decoder, otherway i use the ISO-8859-1 (from pages).
The email shouldn't be red using just one encoder, we should be able to switch to the proper one when we find the "Content-Type:" lable
let me know

meehi · 2012-03-29T09:05:30Z

reporcello:
I use the same approach as you do but with a little tune up. I don't hard code the codepage rather search for it in body and use it dinamically. Here you can find the complete solution for what I use on local: #54

piher · 2012-03-31T06:19:05Z

Maybe we could start by reading the bytes as ASCII, then when we encounter a "=?something?" or a "charset=" (or any other header specifying encoding) we switch to the specified encoding and read the bytes.
We could some sort of byte-matching as we know the bytes representing the end of line in headers and the bytes representing the "charset=".

meehi · 2012-03-31T12:36:12Z

This is a working solution: #54 (comment)

I have tested on many Latin1 and UTF8 character encoded mails and it has decoded all of them without problem.

It needs further testing and some adjustment.

jstedfast · 2014-01-11T17:41:28Z

The only way to truly solve issues like this is to write a parser that doesn't require the message data to be converted into a unicode string first. In other words, the MIME parser needs to parse byte arrays.

See MimeKit for an example of a MIME parser that does this.

piher mentioned this issue Mar 31, 2012

From is null or has character set issues #62

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Charset in imap doesn't work correctly #48

Charset in imap doesn't work correctly #48

paolosanchi commented Mar 10, 2012

paolosanchi commented Mar 11, 2012

piher commented Mar 15, 2012

meehi commented Mar 19, 2012

nakhli commented Mar 26, 2012

meehi commented Mar 26, 2012

meehi commented Mar 26, 2012

paolosanchi commented Mar 29, 2012

meehi commented Mar 29, 2012

piher commented Mar 31, 2012

meehi commented Mar 31, 2012

jstedfast commented Jan 11, 2014

Charset in imap doesn't work correctly #48

Charset in imap doesn't work correctly #48

Comments

paolosanchi commented Mar 10, 2012

paolosanchi commented Mar 11, 2012

piher commented Mar 15, 2012

meehi commented Mar 19, 2012

nakhli commented Mar 26, 2012

meehi commented Mar 26, 2012

meehi commented Mar 26, 2012

paolosanchi commented Mar 29, 2012

meehi commented Mar 29, 2012

piher commented Mar 31, 2012

meehi commented Mar 31, 2012

jstedfast commented Jan 11, 2014