Watch, Follow, &
Connect with Us

For forums, blogs and more please visit our
Developer Tools Community.


Welcome, Guest
Guest Settings
Help

Thread: TIdMessage Hebrew subject decoding issue


This question is answered.


Permlink Replies: 13 - Last Post: Jun 23, 2017 11:29 AM Last Post By: Remy Lebeau (Te...
John May

Posts: 81
Registered: 6/25/10
TIdMessage Hebrew subject decoding issue  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Dec 5, 2016 10:50 AM
This message:

From: a <a@a.a>
Subject:
 =?UTF-8?B?15HXlden16gg15jXldeRINec15vXnCDXlNeX15HXqNeZ150sINee15Qg16DX?=
 =?UTF-8?B?qdee16I=?=
Date: Sun, 4 Dec 2016 08:31:35 +0200
 
a


Is decoded as:
בוקר טוב לכל החברים, מה נ��מע


Instead of the correct version:
בוקר טוב לכל החברים, מה נשמע


Tested in a few email programs, the usual gang, Outlook, WLM, Thunderbird, Gmail webmail... they all decode the subject as expected and also tried https://www.base64decode.org/ which too decodes it well, after merging the B64 data so B64 data is not damaged, Indy is doing something to damage it.

The subject is decoded well if I merge the two lines into one and remove the =?UTF-8?B? from the second one of course.

I can't see the reason why it is not decoded in Indy, can you? I use the latest Indy.
Remy Lebeau (Te...


Posts: 9,447
Registered: 12/23/01
Re: TIdMessage Hebrew subject decoding issue [Edit]
Helpful
Click to report abuse...   Click to reply to this thread Reply
  Posted: Dec 5, 2016 3:12 PM   in response to: John May in response to: John May
John wrote:

Tested in a few email programs, the usual gang, Outlook, WLM,
Thunderbird, Gmail webmail... they all decode the subject as expected
and also tried https://www.base64decode.org/ which too decodes it
well, after merging the B64 data so B64 data is not damaged, Indy is
doing something to damage it.

The email header is malformed to begin with, and Indy does not attempt to
correct the malformness.

The header is breaking up the base64 data into two separate pieces, which
must be decoded and treated individually, per RFC 2047. The question marks
are appearing where a Unicode character ('?', codepoint U+05E9) has been
encoded as a 2-byte UTF-8 sequence (D7 A9) which has then been split between
the two pieces, so the sequence is incomplete and thus invalid within each
piece. Splitting a multi-byte sequence mid-sequence like that is forbidden
by RFC 2047:

encoded-word = "=?" charset "?" encoding "?" encoded-text "?="

...

The 'encoded-text' in an 'encoded-word' must be self-contained;
'encoded-text' MUST NOT be continued from one 'encoded-word' to
another. This implies that the 'encoded-text' portion of a "B"
'encoded-word' will be a multiple of 4 characters long; for a "Q"
'encoded-word', any "=" character that appears in the 'encoded-text'
portion will be followed by two hexadecimal characters.

Each 'encoded-word' MUST encode an integral number of octets. The
'encoded-text' in each 'encoded-word' must be well-formed according
to the encoding specified; the 'encoded-text' may not be continued in
the next 'encoded-word'. (For example, "=?charset?Q?=?=
=?charset?Q?AB?=" would be illegal, because the two hex digits "AB"
must follow the "=" in the same 'encoded-word'.)

Each 'encoded-word' MUST represent an integral number of characters.
A multi-octet character may not be split across adjacent 'encoded-
word's.

Your issue falls within the last paragraph. Indy does not attempt to recover
from this (and that is allowed by RFC 2047). And frankly, I'm surprised
any RFC-compliant reader does attempt it, given the strictness of RFC 2047.

The subject is decoded well if I merge the two lines into one and
remove the =?UTF-8?B? from the second one of course.

Yes, because the two base64 pieces are now one piece and the affected UTF-8
sequence is no longer broken.

I can't see the reason why it is not decoded in Indy, can you?

This is by design in accordance with the RFC 2047, and I have no intention
of changing it at this point. You should contact the author of the sender's
email app and tell them that their software has an encoding bug.

--
Remy Lebeau (TeamB)
John May

Posts: 81
Registered: 6/25/10
Re: TIdMessage Hebrew subject decoding issue [Edit]  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Dec 6, 2016 7:41 AM   in response to: Remy Lebeau (Te... in response to: Remy Lebeau (Te...
Remy Lebeau (TeamB) wrote:
This is by design in accordance with the RFC 2047, and I have no intention
of changing it at this point. You should contact the author of the sender's
email app and tell them that their software has an encoding bug.

Well, you are the author and I am contacting you because the message was generated by Indy. Try putting
בוקר טוב לכל החברים, מה נשמע


Into the subject and use TIdMessageBuilder to generate it...
So not only Indy generates false header but also is unable to read it while other programs are able to...

Example code to generate it - place 2 buttons and Edit box on form and then use these for first and second button:

First button code (saves editbox contentes to eml file):

boost::scoped_ptr<TIdMessage>			 IdMsg(new TIdMessage(this));
boost::scoped_ptr<TIdMessageBuilderHtml> IdMsgBldrHtml(new TIdMessageBuilderHtml);
IdMsgBldrHtml->PlainTextCharSet			= "utf-8";
IdMsgBldrHtml->HtmlCharSet				= "utf-8";
 
IdMsg->From->Name	  = "a";
IdMsg->From->Address  = "aa@aa.aa";
IdMsg->Subject		  = Edit1->Text; // Put the text in the Edit...
IdMsgBldrHtml->FillMessage(IdMsg.get());
 
#pragma warn -8111
Idmessagehelper::TIdMessageHelper_SaveToFile(IdMsg.get(), "testmsg.eml", false, false);
#pragma warn .8111


Second button code (loads Editbox contents from file):

boost::scoped_ptr<TIdMessage>			 IdMsg(new TIdMessage(this));
 
#pragma warn -8111
Idmessagehelper::TIdMessageHelper_LoadFromFile(IdMsg.get(), "testmsg.eml", false, false);
#pragma warn .8111
 
Edit1->Text = IdMsg->Subject;


First button generates this (identical thing to above Subject example):
From: "a" <aa@aa.aa>
Subject:
 =?UTF-8?B?15HXlden16gg15jXldeRINec15vXnCDXlNeX15HXqNeZ150sINee15Qg16DX?=
 =?UTF-8?B?qdee16I=?=
Content-Type: text/plain; charset=us-ascii
Date: Tue, 6 Dec 2016 15:59:05 +0100
 


However - I have to report one other completely strange thing - the code above on Windows 7 and 10 shows these question marks and on Windows XP it is just a bit shorter - one letter is missing - than the original text (probably the question marks are excluded). But this is probably just a small difference how WinXP or Win7/10 handle invalid codepoints. The end result is probably that indy generates it incorrectly and reads it incorrectly - other programs are able to correct the problem in their decoders.
Remy Lebeau (Te...


Posts: 9,447
Registered: 12/23/01
Re: TIdMessage Hebrew subject decoding issue [Edit] [Edit]
Correct
Click to report abuse...   Click to reply to this thread Reply
  Posted: Dec 6, 2016 10:19 AM   in response to: John May in response to: John May
John wrote:

Well, you are the author and I am contacting you because the message
was generated by Indy.

Then you should have led with that :-p You said the problem was with Indy
decoding an email, not with encoding an email.

In that case, this is a known issue with Indy's EncodeHeader() function:

// TODO: this function needs to take encoded codeunits into account when
// deciding where to split the encoded data between adjacent encoded-words,
// so that a single encoded character does not get split between encoded-words
// thus corrupting that character...

That being said, try using the TIdMessage.OnInitializeISO event to set the
header encoding to 'Q' (quoted-printable) instead of 'B' (base64). The problem
goes away for the example given (though I suppose the issue could still happen
for other cases involving long strings).

--
Remy Lebeau (TeamB)
John May

Posts: 81
Registered: 6/25/10
Re: TIdMessage Hebrew subject decoding issue [Edit] [Edit]  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Dec 6, 2016 2:06 PM   in response to: Remy Lebeau (Te... in response to: Remy Lebeau (Te...
Remy Lebeau (TeamB) wrote:
That being said, try using the TIdMessage.OnInitializeISO event to set the
header encoding to 'Q' (quoted-printable) instead of 'B' (base64). The problem
goes away for the example given (though I suppose the issue could still happen
for other cases involving long strings).

I actually wasn't aware myself until I tested it that Indy produces this.

OK my friend but I want to use B64 encoding as it is more efficient than quoted printable and not much point in having B64 if it is not there to be used.
Can this be fixed? It is a serious deficiency.... as it renders Indy incapable of producing proper output... Can I help with that?
Remy Lebeau (Te...


Posts: 9,447
Registered: 12/23/01
Re: TIdMessage Hebrew subject decoding issue  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Dec 6, 2016 2:42 PM   in response to: John May in response to: John May
John wrote:

Can this be fixed?

Hmm, not easily. Right now, Indy charset-encodes the entire header string
to a single byte array, and then qp/base64-encodes the bytes, inserting breaks
where the encoded lines would exceed 75 characters. At that level, charset
information is already gone, so the code doesn't know if a break is being
inserted in the middle of a character sequence or at a character boundary.
To really fix this problem, I suppose Indy's encoder would have to be re-written
to step through the input header string one character at a time, charset-encoding
and byte-encoding along the way, inserting breaks only at character boundaries.
I don't know what that will do to performance, not to mention that Indy
has no charset streaming capabilities. I'm not saying it can't be done,
but I don't have that kind of time to do it myself. Maybe in a few weeks
when I'm on vacation...

The issue only really matters for charsets that can use multiple bytes for
Unicode codepoints, such as UTFs. In your example, you could use ISO-8859-8
or Windows-1255 instead (settable in the TIdMessage.OnInitializeISO event),
which are single-byte charsets for Hebrew.

It is a serious deficiency....

Only for whitespace-delimited strings that are longer than 75 characters
after encoding.

--
Remy Lebeau (TeamB)
John May

Posts: 81
Registered: 6/25/10
Re: TIdMessage Hebrew subject decoding issue  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Jun 8, 2017 6:18 AM   in response to: Remy Lebeau (Te... in response to: Remy Lebeau (Te...
Is there a chance we can expect some kind of fix for this? Or maybe it has been done already?
This is a serious defect in the message encoder and not only it ruins the subject lines but also it ruins attachment names...

Is it possible to use this solution:
https://stackoverflow.com/questions/24651339/indy-message-with-unicode-subject

And fold the header later?

Will the same work on attachment names?

Also, I have re-made the decoder which extracts To/Cc/Bcc emails - are you interested (it is C++ code). It fixes these bugs:
https://github.com/IndySockets/Indy/issues/123
https://github.com/IndySockets/Indy/issues/13

Properly decodes all of these:
//		RawHdr = "abc <=?ISO8859-1?B?YWJjQGV4YW1wbGUuY29t=?=>, =?ISO8859-1?B?YWJjQGV4YW1wbGUuY29t=?=,"
//				 "\"Sm\\\"art\\\"POP@JAM\" <smart@pop.de>,"
//				 "<someuser@somehost.somehost2.com>,"
//				 "<<someuser@somehost.somehost2.com>>,"
//				 "\"name\" <\"user\"@domain>,"
//				 "\"This is \\\\\" <this@another.addr>,"
//				 "\"This is \\\" \\\\\" <this@another.addr>,"
//				 "\"a<,\" <\"a,\"@a.a> (a\\\\,a), "
//				 " \"b\\\\\\\",b\" <b@b.b>,"
//				 " c@c.c, d@d.d (ddd), <e@e.e>, (fff) f@f.f, (ggg)111<g@g.g>,222<h@h.h>(hhh), (iii)333<i@i.i>(jjj), kkk<k@k.k>(11),lll<l(22)@l.l>    ,   ,   <>,   ,   (aaa), @, a@, aa@, @a, @aa, a@a, \"a@a\""
//				 ;
Remy Lebeau (Te...


Posts: 9,447
Registered: 12/23/01
Re: TIdMessage Hebrew subject decoding issue [Edit]  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Jun 8, 2017 12:22 PM   in response to: John May in response to: John May
John May wrote:

Is there a chance we can expect some kind of fix for this?

Are you referring to this issue?

https://github.com/IndySockets/Indy/issues/157

This is a serious defect in the message encoder and not only it ruins
the subject lines but also it ruins attachment names...

In what way exactly? Please be more specific. Can you provide an
example?

Is it possible to use this solution:

https://stackoverflow.com/questions/24651339/indy-message-with-unicode-subject

That "solution" only applies to pre-Unicode versions of
Delphi/C++Builder/FreePascal, not to Unicode versions.

And fold the header later?

No.

Will the same work on attachment names?

Any "solution" would apply to all headers that are encoded using the
EncodeHeader() function.

Also, I have re-made the decoder which extracts To/Cc/Bcc emails -
are you interested (it is C++ code). It fixes these bugs:

https://github.com/IndySockets/Indy/issues/123
https://github.com/IndySockets/Indy/issues/13

I can review whatever you choose to offer.

--
Remy Lebeau (TeamB)
John May

Posts: 81
Registered: 6/25/10
Re: TIdMessage Hebrew subject decoding issue [Edit]  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Jun 9, 2017 7:25 AM   in response to: Remy Lebeau (Te... in response to: Remy Lebeau (Te...
Remy Lebeau (TeamB) wrote:
Are you referring to this issue?
https://github.com/IndySockets/Indy/issues/157

Yes, that's the one, no need for further examples, the ones cover it just fine, it is the same issue. Any chance of having a fix for this critical issue?
It would more than welcome to fix the existing issues before adding new enhancements!

Any "solution" would apply to all headers that are encoded using the
EncodeHeader() function.

But the problem occurs during folding, not during encoding?
Is the preparation (character counting) done by using UnicodeString::ByteType function (to fold by the character and not split character in half)?
Maybe I am ignorant, but it doesn't seem like much of a challenge to implement this and fix the above bug?
Or perhaps a conversion into UTF32 and then counting offsets?

I can review whatever you choose to offer.

I'll send via PM.
Remy Lebeau (Te...


Posts: 9,447
Registered: 12/23/01
Re: TIdMessage Hebrew subject decoding issue [Edit] [Edit]  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Jun 9, 2017 9:36 AM   in response to: John May in response to: John May
John May wrote:

Yes, that's the one, no need for further examples, the ones cover it
just fine, it is the same issue. Any chance of having a fix for this
critical issue?

Eventually, yes. But I have no ETA on that.

But the problem occurs during folding, not during encoding?

They are related, because the encoder needs to know where the folding
will occur so the appropriate delimiters can be placed at the correct
positions. If encoding and folding were to be separated, the folder
would have to decode and re-encode the data. Unless the encoding
process were moved into the TIdHeaderList class directly when its
QuoteType is QuoteMIME and its FoldLines property is true. But that is
a much bigger re-write than simply fixing the EncodeHeader() function
to not split multi-byte codepoint sequences anymore.

Is the preparation (character counting) done by using
UnicodeString::ByteType function (to fold by the character and not
split character in half)?

At this time, neither encoding nor folding logic takes ByteType into
account, no. It is not a big deal for folding, because MIME headers
are ASCII-only, so splitting can occur on any character. It is
RFC2047-style MIME encoding (whaat EncodeHeader() implements) that
needs to be made to be more codepoint-aware.

Maybe I am ignorant, but it doesn't seem like much of a challenge to
implement this and fix the above bug?

It is not so much a matter of challenge than of time. I have very
little free time available to work on projects outside of my day job.
I have a LOT of things that need to be worked on in Indy, I get to them
whenever I can. This issue has been on my TODO list for a long while.

Or perhaps a conversion into UTF32 and then counting offsets?

Something like that, yes. But not in UTF-32 itself. The encoder's
processing already begins in UTF-16, which is sufficient for Unicode
handling, so it would be more a matter of scanning through the UTF-16
codeunit sequences, charset-encoding and MIME encoding each sequence
individually and inserting folds at the appropriate lengths.

Part of the problem is that the encoder currently charset-encodes the
entire UTF-16 string into a single byte array, and then MIME-encodes
those bytes. Any information about where each Unicode codepoint exists
in the byte array is lost, so accurate folding is not possible. That
would have to change, but probably at the cost of slower performance
since charset-encoding would be applied on a per-codepoint basis
instead of a per-string basis.

--
Remy Lebeau (TeamB)
John May

Posts: 81
Registered: 6/25/10
Re: TIdMessage Hebrew subject decoding issue  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Jun 13, 2017 9:32 AM   in response to: Remy Lebeau (Te... in response to: Remy Lebeau (Te...
After examining this for a LONG time (a few days) I found one important thing - this is not the issue with encoder - but with DECODER!
Indy - encoded data is read just fine in other mail programs. I tested a bunch.

Encoding as stream of bytes (after converting to UTF8) is just fine for Base64 as it seems.

Again, the issue is NOT in the encoder - encoder seems to work fine - all the examples on https://github.com/IndySockets/Indy/issues/157 are readable in various mail programs.

However this example from above page is not the same result:
であるか、シェイクスピア1606質問ですそのようにしない。彼の本から引用
 
//RESULT - this is actual result I get when running through Indy encoder - this one decodes well in other email programs but does not decode well in indy, but it is obviously WELL ENCODED
=?UTF-8?B?44Gn44GC44KL44GL44CB44K344Kn44Kk44Kv44K544OU44KiMTYwNuizquWV?=
=?UTF-8?B?j+OBp+OBmeOBneOBruOCiOOBhuOBq+OBl+OBquOBhOOAguW9vOOBruacrOOB?=
=?UTF-8?B?i+OCieW8leeUqA==?=
 
//Supposed RESULT FROM as suggested by https://github.com/IndySockets/Indy/issues/157 - I don't get this result - this one does not decode well in other email programs
=?UTF-8?B?44Gn44GC44KL44GL44CB44K344Kn44Kk44Kv44K544OU44KiMTYwNuizqu+/?=
=?UTF-8?B?ve+/veOBp+OBmeOBneOBruOCiOOBhuOBq+OBl+OBquOBhOOAguW9vOOBruac?=
=?UTF-8?B?rO+/ve+/veOCieW8leeUqA==?=
 
// THE ABOVE IS NOT THE RESULT which Indy gets - is it because the encoder was changed but the issue on the above page wasn't updated, can't tell


Clearly - the encoder is working FINE, only the decoder is problem here.

So, is there a possibility to fix the decoder which PROBABLY doesn't merge the lines before decoding them but decodes them line by line and then merges strings... which is clearly not what other email programs do - as they decode the string well.

Also consider these examples:

// The Subject: is generated by encoding であるか、シェイクスピア1606質問ですそのようにしない。彼の本から引用 in Indy
UnicodeString TEST =	"Subject:"
					" =?UTF-8?B?44Gn44GC44KL44GL44CB44K344Kn44Kk44Kv44K544OU44KiMTYwNuizquWV?="
					" =?UTF-8?B?j+OBp+OBmeOBneOBruOCiOOBhuOBq+OBl+OBquOBhOOAguW9vOOBruacrOOB?="
					" =?UTF-8?B?i+OCieW8leeUqA==?=";
 
Memo1->Lines->Add(DecodeHeader(TEST)); // Decoded incorrectly - であるか、シェイクスピア1606質��ですそのようにしない。彼の本��ら引用
// But it IS CORRECTLY decoded in other email programs.
 
 
// Just a unfolded and merged version of the same thing above (with =?UTF-8?B? and ?= removed):
UnicodeString TEST = "Subject: =?UTF-8?B?44Gn44GC44KL44GL44CB44K344Kn44Kk44Kv44K544OU44KiMTYwNuizquWVj+OBp+OBmeOBneOBruOCiOOBhuOBq+OBl+OBquOBhOOAguW9vOOBruacrOOBi+OCieW8leeUqA==?=";
 
Memo1->Lines->Add(DecodeHeader(TEST)); // Decoded correctly - であるか、シェイクスピア1606質問ですそのようにしない。彼の本から引用


So the problem is only in how decoder merges the above folded lines. Could be unfolding issue.

I could write a code to properly unfold and merge things if they start with the same encoding e.g. if =?UTF-8?B? is in a few lines then merge that as a single string... if you are low on time... or is the fix simpler than that?
Remy Lebeau (Te...


Posts: 9,447
Registered: 12/23/01
Re: TIdMessage Hebrew subject decoding issue [Edit]  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Jun 13, 2017 10:08 AM   in response to: John May in response to: John May
John May wrote:

Again, the issue is NOT in the encoder - encoder seems to work fine

Yes, actually, the ENCODER has problems, as described in issue #157.

all the examples on https://github.com/IndySockets/Indy/issues/157
are readable in various mail programs.

Because their DECODERS are more lenient to errors in bad ENCODINGS.

Indy is fairly strict when it comes to RFC compliance, it is not very
lenient to bad input data (and this effects many aspects of Indy, not
just MIME encodings). In this case, the DECODER currently does not
decode the 1st and 3rd examples shown in issue #157 (oddly, it does
decode the 2nd example fine). And that is because it decodes each MIME
encoded-word individually into a String, and then concats the strings
together. If a codepoint is split between multiple MIME encoded-words,
the trailing and leading characters of adjacent encoded-words get
decoded as U+FFFD instead of being pieced together before decoding.
MIME encoded-words are supposed to be self-contained, the RFC does not
provide for handling errors in bad encodings. Some implementations
handle it, some don't.

The problem is only in decoder and not even DecodeHeader function.

Wrong. The problem is in the ENCODER. It is still not splitting the
input data correctly when creating MIME encoded-words.

であるか、シェイクスピア1606質問ですそのようにしない。彼の本から引用

//RESULT REAL - this is actual result I get when running through Indy
encoder:
=?UTF-8?B?44Gn44GC44KL44GL44CB44K344Kn44Kk44Kv44K544OU44KiMTYwNuizquWV
?=
=?UTF-8?B?j+OBp+OBmeOBneOBruOCiOOBhuOBq+OBl+OBquOBhOOAguW9vOOBruacrOOB
?= =?UTF-8?B?i+OCieW8leeUqA==?=

I get the same result, and that result is still WRONG, as evident by
DECODING it produces this result:

'であるか、シェイクスピア1606質��ですそのようにしない。彼の本��ら引用'


See how '問' becomes '��' and 'か' becomes '��'. That is because those
codepoints are split incorrectly during ENCODING, not during DECODING.

//Supposed RESULT FROM
https://github.com/IndySockets/Indy/issues/157 - I don't get this
result
=?UTF-8?B?44Gn44GC44KL44GL44CB44K344Kn44Kk44Kv44K544OU44KiMTYwNuizqu+/
?=
=?UTF-8?B?ve+/veOBp+OBmeOBneOBruOCiOOBhuOBq+OBl+OBquOBhOOAguW9vOOBruac
?= =?UTF-8?B?rO+/ve+/veOCieW8leeUqA==?=

// THE ABOVE IS NOT THE RESULT which Indy gets

It is the result that Indy used to get at the time issue #157 was
first written. There have been changes made since then, but the root
issue still remains open. I have now updated the examples in ticket
#157 to reflect the current results.

is it because the encoder was changed but the issue on the above page
wasn't updated

Yes.

Clearly - the encoder is working FINE

No, it is not.

only the decoder is problem here.

No, it is not.

--
Remy Lebeau (TeamB)

John May

Posts: 81
Registered: 6/25/10
Re: TIdMessage Hebrew subject decoding issue  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Jun 23, 2017 6:53 AM   in response to: Remy Lebeau (Te... in response to: Remy Lebeau (Te...
Sorry, but I still think the decoder should treat these as continuous stream.

Examples in https://www.ietf.org/rfc/rfc2047.txt

   (=?ISO-8859-1?Q?a?= =?ISO-8859-1?Q?b?=)     (ab)
 
           White space between adjacent 'encoded-word's is not
           displayed.
 
   (=?ISO-8859-1?Q?a?=  =?ISO-8859-1?Q?b?=)    (ab)
 
        Even multiple SPACEs between 'encoded-word's are ignored
        for the purpose of display.
 
   (=?ISO-8859-1?Q?a?=                         (ab)
       =?ISO-8859-1?Q?b?=)
 
           Any amount of linear-space-white between 'encoded-word's,
           even if it includes a CRLF followed by one or more SPACEs,
           is ignored for the purposes of display.


Even though above examples do work - if the decoder would treat these as continuous stream, the top message example would be decoded correctly.

There are also additional experimental RFC-s which propose the use of UTF-8 without encoding in message headers (https://tools.ietf.org/html/rfc5335). So US-ASCII is may not be considered as good choice for default especially due to fact that codepoints above 0x7F are not defined and show up as question marks in such emails which (admittedly) incorrect use of US-ASCII encoding in Windows-1252 encoded words - however the incorrect use is relatively COMMON so should be considered a default for the decoder.

Even furthermore - US-ASCII is completely unnecessary as default charset. It can be entirely replaced with Windows-1252 with practically no drawbacks. Windows-1252 will behave much better due to more codepoints defined and especially due to use of smart quotes 0x91 and 0x92. I tested this on a number of subjects and found zero benefits from keeping US-ASCII as a default in decoder. It is also better choice than ISO 8859-1 - this is explained in - https://en.wikipedia.org/wiki/Windows-1252

I understand your viewpoint that the messages should be well encoded. But the reality is - a certain percentage is not. I still think Indy as a whole is a great tool and you've done fantastic job (along with other developers) on it - but for the purpose of decoding messages, it is simply not flexible enough as it gives no choice to user to "force" the encoding and always assumes its own decoding.

But - don't spend more time on this - I've started working on my own decoder which handles all the misbehaving headers and also works faster than Indy-one.
Remy Lebeau (Te...


Posts: 9,447
Registered: 12/23/01
Re: TIdMessage Hebrew subject decoding issue [Edit]
Helpful
Click to report abuse...   Click to reply to this thread Reply
  Posted: Jun 23, 2017 11:29 AM   in response to: John May in response to: John May
John May wrote:

Sorry, but you are wrong. The decoder is not working properly.

I know for a fact that the decoder works just fine (provided the input
is not malformed, which includes codepoints split across encoded-words,
which is strictly forbidden by RFC 2047).


Those examples you quoted decode fine for me. They all return '(ab)',
as documented. The decoder has been tested against the RFC's examples
(and many more) over the years.

Any white spaces between encoded words should be ignored.

Which the decoder does. That is not the issue.

If that would be the case, *even the incorrectly split codepoints
would behave properly* if they would be accepted as single stream of
data (and they do in all the mail clients I've tested).

Not true.

By RFC 2047's definition, encoded-words must be **self-contained**.
Data cannot span across multiple adjacent encoded-words. The RFC is
very strict about that (and Indy's encoder is not fully compliant with
that rule).

The 'encoded-text' in an 'encoded-word' must be self-contained;
'encoded-text' MUST NOT be continued from one 'encoded-word' to
another. This implies that the 'encoded-text' portion of a "B"
'encoded-word' will be a multiple of 4 characters long; for a "Q"
'encoded-word', any "=" character that appears in the 'encoded-text'
portion will be followed by two hexadecimal characters.

Each 'encoded-word' MUST encode an integral number of octets. The
'encoded-text' in each 'encoded-word' must be well-formed according
to the encoding specified; the 'encoded-text' may not be continued in
the next 'encoded-word'. (For example, "=?charset?Q?=?=
=?charset?Q?AB?=" would be illegal, because the two hex digits "AB"
must follow the "=" in the same 'encoded-word'.)

Each 'encoded-word' MUST represent an integral number of characters.
A multi-octet character may not be split across adjacent 'encoded-
word's.

...

Any 'encoded-word's so recognized are decoded, and if possible, the
resulting **unencoded text** is displayed in the original character
set.

Adjacent encoded-words may have different charsets and byte encodings.
And in fact, the very first example in RFC 2047 Section 8 shows exactly
that. In the "Subject" header, there are 2 adjacent encoded-words,
where the first one uses ISO-8859-1 and the second uses ISO-8859-2.

So the decoder is definitely not working as defined by the RFC.

Actually, it is working exactly as RFC 2047 defines.

If decoder would do them as a single, continuous stream, the above
example (in the first post) would work properly as it does in various
email clients which respect the RFC2047.

RFC 2047 requires the content of an encoded-word to be decoded as-is
before being displayed in the final output. It does not allow
codepoints erroneously split across adjacent encoded-words to be
recoverable. In practice, that may be possible, but that is not how
RFC 2047 defines it.

RFC 2047 does not have any provisions to allow adjacent encoded-words
using the same charset to be decoded into their respective raw bytes,
then merged together into a single byte buffer before finally
charset-decoding the bytes. That is what you are proposing as a "fix"
to decode erroneously-split codepoints. Indy could feisibly
implement that to address the split issue, but that is not a
MIME-compliant solution, and will not work when different charsets are
used.

Furthermore - there are additional experimental RFC-s which propose
the use of UTF-8 without encoding in message headers
(https://tools.ietf.org/html/rfc5335).

I was not aware of that RFC. Though it doesn't matter since it
requires cooperation with the SMTP 'UTF8SMTP' extension, which Indy
does not implement yet.

However, Indy's encoder does support the ability to encode a header as
raw UTF-8, by using the TIdMessage.OnInitializeISO event to set the
VCharset parameter to `utf-8` and the VHeaderEncoding parameter to '8'
(8bit) rather than 'Q' (quoted-printable) or 'B' (base64).

Even furthermore - US-ASCII is completely unnecessary as default
charset. It can be entirely replaced with Windows-1252. Windows-1252
will behave much better due to more codepoints defined and especially
due to use of smart quotes 0x91 and 0x92. I tested this on a number
of subjects and found zero benefits from keeping US-ASCII as a
default in decoder.

The decoder doesn't have a default charset of its own. Outside of MIME
encoded-words, which specify their own charsets, raw bytes are decoded
to Unicode text using Indy's global charset (see the
GIdDefaultTextEncoding variable in the IdGlobal unit, which yes, is
intentionally US-ASCII by default) if not overriden by TIdIOHandler
(via the DefStringEncoding property, or the AByteEncoding parameter of
various reading methods).

--
Remy Lebeau (TeamB)
Legend
Helpful Answer (5 pts)
Correct Answer (10 pts)

Server Response from: ETNAJIVE02