Watch, Follow, &
Connect with Us

For forums, blogs and more please visit our
Developer Tools Community.


Welcome, Guest
Guest Settings
Help

Thread: Ansistring / RawByteString code page madness


This question is answered. Helpful answers available: 2. Correct answers available: 1.


Permlink Replies: 16 - Last Post: Dec 21, 2017 9:19 AM Last Post By: Remy Lebeau (Te... Threads: [ Previous | Next ]
Arthur Hoornweg

Posts: 414
Registered: 6/2/98
Ansistring / RawByteString code page madness  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Dec 14, 2017 2:21 AM
Hello all,

I've just stumbled upon something really weird in Delphi XE. As we all know, Ansistrings are 8-bit strings that have a code page. Assigning an Ansistring with a known codepage X to another ansistring with codepage Y involves a codepage conversion to be performed on the data. Now consider the following test function:


procedure TForm1.Button1Click(Sender: TObject);
var a:utf8string;
      b:rawbytestring;
      c:Ansistring;
begin
     a:='üöä';
     c:=a;    
     b:=a;        
     Showmessage(format('Code page of A is %u,length is %d, contents=%s',[stringcodepage(A),length(A),A]));
     Showmessage(format('Code page of B is %u,length is %d, contents=%s',[stringcodepage(B),length(B),B]));
     Showmessage(format('Code page of C is %u,length is %d, contents=%s',[stringcodepage(C),length(C),C]));
     c:=b;
     Showmessage(format('Code page of C is %u,length is %d, contents=%s',[stringcodepage(C),length(C),C]));
end;

On my system, the assignment "c:=a" behaves as expected, it converts the UTF8 string to code page 1252 with length=3.

The assignment "b:=a" also works as expected, it keeps the UTF8 codepage intact and also the length of 6. After this assignment, B has code page UTF8 which can be verified using StringCodePage(B). The string contents are displayed correctly so Delphi is using the correct code page.

But the assignment "c:=b" does something really weird, it behaves differently from "c:=a". Instead of performing the same UTF8->CP1252 conversion as before, C suddenly gets the UTF8 code page and length 6.

What's going on here? Strings A and B have the same code page, which I can verify using function StringCodePage(). So IMHO any assignments having A or B as source should produce the same result. Can anyone enlighten me?

Remy Lebeau (Te...


Posts: 9,447
Registered: 12/23/01
Re: Ansistring / RawByteString code page madness  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Dec 14, 2017 10:52 AM   in response to: Arthur Hoornweg in response to: Arthur Hoornweg
Arthur Hoornweg wrote:

But the assignment "c:=b" does something really weird, it behaves
differently from "c:=a". Instead of performing the same UTF8->CP1252
conversion as before, C suddenly gets the UTF8 code page and length 6.

Assigning a RawByteString to an AnsiString does not perform a codepage
conversion, it just copies the data pointer and increments its
reference count, just like when assigning an AnsiString to another
AnsiString.

You really shouldn't be using RawByteString outside of function
parameters. It is not designed to be a standalone string type like the
other string types. It is a helper type to facility writing
codepage-agnostic functions.

--
Remy Lebeau (TeamB)
Arthur Hoornweg

Posts: 414
Registered: 6/2/98
Re: Ansistring / RawByteString code page madness  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Dec 15, 2017 12:39 AM   in response to: Remy Lebeau (Te... in response to: Remy Lebeau (Te...
Remy Lebeau (TeamB) wrote:

Assigning a RawByteString to an AnsiString does not perform a codepage
conversion, it just copies the data pointer and increments its
reference count, just like when assigning an AnsiString to another
AnsiString.

Hi Remy,

There's something I don't understand.

Rawbytestrings normally don't have a code page. In my test code, I assign a string with a well-defined code page to the rawbytestring and afterwards I test if the rawbytestring has a code page. And guess what? It does.

So after the assignment it's like any other ansistring with a code page and it should be completely indistinguishable from other ansistrings. But it turns out that that's not the case. So inside the memory layout of ansistring there must be an undocumented field somewhere that holds the desired/pre-defined code page as opposed to the currently assigned code page. Or how else should Delphi know that this string requires special treatment?

[Edit]

I just did a disassembly and I see that Delphi inserts the target code page as a literal constant in the ECX register when it converts from one code page to another. First the source string gets converted to UTF16, then the widestring gets converted to the target code page. So the code page affinity of the string is not stored inside the string variable itself. The compiler only "knows" the codepage affinity if the string is in direct scope.

That has interesting implications. For example, the compiler won't let me pass utf8string, rawbytestring or any other string with a pre-defined codepage to the following procedure:

Procedure  dosomething (VAR b:Ansistring);
Begin
  b:='test';
End;


... but it is possible to write overloaded procedures for the desired code pages. Somehow that's messy.

Edited by: Arthur Hoornweg on Dec 15, 2017 1:20 AM
Lajos Juhasz

Posts: 801
Registered: 3/14/14
Re: Ansistring / RawByteString code page madness  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Dec 15, 2017 1:29 AM   in response to: Arthur Hoornweg in response to: Arthur Hoornweg
Arthur Hoornweg wrote:

Remy Lebeau (TeamB) wrote:

Assigning a RawByteString to an AnsiString does not perform a
codepage conversion, it just copies the data pointer and increments
its reference count, just like when assigning an AnsiString to
another AnsiString.

Hi Remy,

There's something I don't understand.

Rawbytestrings normally don't have a code page. In my test code, I
assign a string with a well-defined code page to the rawbytestring
and afterwards I test if the rawbytestring has a code page. And guess
what? It does.

So after the assignment it's like any other ansistring with a code
page and it should be completely indistinguishable from other
ansistrings. But it turns out that that's not the case. So inside
the memory layout of ansistring there must be an undocumented field
somewhere that holds the desired/pre-defined code page as opposed
to the currently assigned code page. Or how else should Delphi know
that this string requires special treatment?


It's documented that every ansi string has a asspciated code page at
the negative offset. Also it's documented when you assign an ansistring
to the rawbytestring it will copy also the code page to the
rawbytestring. That's the reason why you should use rawbytestring only
as a parameter of a function/method. There you can take ansi string
with any associated code page process it and return the value as an
ansistring with the same code page.
Arthur Hoornweg

Posts: 414
Registered: 6/2/98
Re: Ansistring / RawByteString code page madness  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Dec 15, 2017 1:58 AM   in response to: Lajos Juhasz in response to: Lajos Juhasz
Lajos Juhasz wrote:

It's documented that every ansi string has a asspciated code page at
the negative offset.

NO ! That's just the code page of the string's contents. It's not the codepage affinity of the string. If the string is totally empty then the pointer is NIL and still the compiler knows what code page the string defaults to.

So it's not so that all Ansistrings are created equal and have just a different value in some codepage field. The compiler treats them as different types and the code page affinity is hardwired into the generated assembler code, not into the variable itself. If you have a procedure with a VAR parameter of type Ansistring, you can't even call the procedure passing a variable of type Rawbytestring or UTF8String, it won't compile.


My reason for messing with this stuff in the first place is that I need to convert some older Delphi 2007 code that uses Turbopower Lockbox 2 (an encryption library) to a newer Delphi version without breaking compatibility with existing data. The Lockbox code uses Ansistrings and the last thing I need is the encryption breaking due to some auto-magic codepage conversion. Therefore I want to know exactly what's going on under the hood when I pass Ansistrings or Rawbytestrings to and fro.

Lajos Juhasz

Posts: 801
Registered: 3/14/14
Re: Ansistring / RawByteString code page madness  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Dec 15, 2017 8:46 AM   in response to: Arthur Hoornweg in response to: Arthur Hoornweg
Arthur Hoornweg wrote:

The compiler treats them as different types and the code page
affinity is hardwired into the generated assembler code, not into the
variable itself.

Are you sure about this, Embarcadero has documented this way
(http://docwiki.embarcadero.com/Libraries/Tokyo/en/System.AnsiString):

The AnsiString structure contains a 32-bit length indicator, a 32-bit
reference count, a 16-bit data length indicating the number of bytes
per character, and a 16-bit code page. This code page is set, by
default, to the operating system's code page. It can be changed by
calling SetMultiByteConversionCodePage.
Arthur Hoornweg

Posts: 414
Registered: 6/2/98
Re: Ansistring / RawByteString code page madness  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Dec 18, 2017 1:14 AM   in response to: Lajos Juhasz in response to: Lajos Juhasz
Lajos Juhasz wrote:

Are you sure about this, Embarcadero has documented this way
(http://docwiki.embarcadero.com/Libraries/Tokyo/en/System.AnsiString):

The AnsiString structure contains a 32-bit length indicator, a 32-bit
reference count, a 16-bit data length indicating the number of bytes
per character, and a 16-bit code page. This code page is set, by
default, to the operating system's code page. It can be changed by
calling SetMultiByteConversionCodePage.


I am absolutely sure. Because the structure mentioned by you only exists when the string has contents. If the string has no contents there is no data structure and the variable is a NIL pointer.

When the string gets a value, the runtime has to know which code page the string defaults to (=code page affinity). And that codepage is just a hardwired constant that you define yourself when you declare the string variable. Normally it defaults to 0 (=codepage of the running system) but you can select different values. But it is just a constant.

Look at this example (my system code page is 1252).

Type Str1250=Type Ansistring(1250);
 
VAR a,b:Ansistring;   //Two strings with codepage affinity 0 (=Default )
        c: Str1250; //String with codepage affinity 1250
Begin
    a:= 'üüü';    //CP=1252, length is 3
    a:= utf8encode('üüü');    //CP=65001 , length is 6
 
    b:=a;    //  B points to same data as A (same affinity) , CP=65001, length is 6
    c:=a;    //  CP=1250  (conversion 65001->1250), length=3
    a:=c;   //  CP=1252  (conversion 1250 -> 1252)
    b:=a;   //  B points to same data as A (same affinity) CP=1252
End;
 
Remy Lebeau (Te...


Posts: 9,447
Registered: 12/23/01
Re: Ansistring / RawByteString code page madness  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Dec 18, 2017 11:11 AM   in response to: Lajos Juhasz in response to: Lajos Juhasz
Lajos Juhasz wrote:

Are you sure about this, Embarcadero has documented this way
(http://docwiki.embarcadero.com/Libraries/Tokyo/en/System.AnsiString):

What Embarcadero documents is the same as what Lajos stated.

The codepage field inside a non-empty string indicates the actual
codepage of the character data at runtime, not the string's affinity at
compile-time (though they are usually the same, but can be overwritten
with SetCodePage() at runtime).

--
Remy Lebeau (TeamB)
Remy Lebeau (Te...


Posts: 9,447
Registered: 12/23/01
Re: Ansistring / RawByteString code page madness [Edit]  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Dec 18, 2017 11:09 AM   in response to: Arthur Hoornweg in response to: Arthur Hoornweg
Arthur Hoornweg wrote:

Rawbytestrings normally don't have a code page.

At compile-time, yes. Assigning strings with different compile-time
codepage affinities will force a conversion at runtime instead.

In my test code, I assign a string with a well-defined code page to
the rawbytestring and afterwards I test if the rawbytestring has a
code page. And guess what? It does.

Yes, because RawByteString is special in that it has no codepage
affinity at compile-time. It inherits the codepage of whatever source
string is assigned to it at runtime. Calling StringCodePage() on a
non-empty RawByteString returns whatever codepage is stored in the
string's allocated StrRec header. If the string is empty, there is no
data allocated, so StringCodePage() will simply return
DefaultSystemCodePage instead.

So after the assignment it's like any other ansistring with a code
page and it should be completely indistinguishable from other
ansistrings.

At runtime, yes, because the RawByteString is just pointing to the same
data block as the source string, incrementing the data's refcount. But
RawByteString is still its own data type separate from other string
types. And the compiler knows that.

But it turns out that that's not the case. So inside the memory
layout of ansistring there must be an undocumented field somewhere
that holds the desired/pre-defined code page as opposed to the
currently assigned code page.

Nope. There is a known and documented field in the StrRec header for
the actual codepage of the character data. There is nothing else
hidden. StringCodePage() returns that codepage field at runtime.

Or how else should Delphi know that this string requires special
treatment?

Um, because the compiler knows what RawByteString is (it is still a
unique string type, after all) and has special rules for RawByteString
that it doesn't have for other string types.

I just did a disassembly and I see that Delphi inserts the target
code page as a literal constant in the ECX register when it converts
from one code page to another.

When assigning a string of one type to a string of another type, the
compiler knows the codepage affinity of the destination string, so it
tells the RTL to convert the source data from its current codepage to
that destination codepage. Except in the case when RawByteString is
the destination, then there is no conversion (unless the source string
is a (Unicode|Wide)String, then it has to convert to a temp AnsiString
first before then doing the assignment to RawByteString).

First the source string gets converted to UTF16, then the widestring
gets converted to the target code page.

Yes, that is how a codepage conversion works, to avoid data loss as
much as possible, in case both codepages support the same characters
just in different formats.

So the code page affinity of the string is not stored inside the
string variable itself.

The affinity itself, no. Nor does it need to (though it is possible to
get it with RTTI at runtime). Assigning data to a string with a given
codepage affinity will force the data to that codepage at runtime, and
that codepage is stored in the string's StrRec header for later
conversions to use.

For example, the compiler won't let me pass utf8string, rawbytestring
or any other string with a pre-defined codepage to the following
procedure:

Procedure  dosomething (VAR b:Ansistring);
Begin
  b:='test';
End;

Of course not, because AnsiString is its own unique string type (with a
codepage affinity of 0, which gets filled in with DefaultSystemCodePage
at runtime). When you pass a variable to a 'var' parameter, the
variable must match the parameter type exactly. That has nothing to
do with strings specifically, that is just how 'var' parameters work in
general.

... but it is possible to write overloaded procedures for the desired
code pages.

Why would you want to, though? Your business logic shouldn't be
messing around with codepages to begin with. You should be using
UnicodeString for everything, and only perform conversions where
codepaged data enters/leaves your app.

--
Remy Lebeau (TeamB)
Arthur Hoornweg

Posts: 414
Registered: 6/2/98
Re: Ansistring / RawByteString code page madness [Edit]  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Dec 19, 2017 4:44 AM   in response to: Remy Lebeau (Te... in response to: Remy Lebeau (Te...
Remy Lebeau (TeamB) wrote:

The affinity itself, no. Nor does it need to (though it is possible to
get it with RTTI at runtime). Assigning data to a string with a given
codepage affinity will force the data to that codepage at runtime

In theory, but it is not a reliable mechanism. If ansistrings A and B have the same affinity then an assignment B:=A will never perform a codepage conversion because Delphi does not bother to check if the contents actually match the affinity.

Type aStr=Type Ansistring(1252);
VAR a,b:  aStr;
Begin
   a:=UTF8Encode('test');
   b:=a; // -> code page is not 1252  but 65001
End;

Why would you want to, though? Your business logic shouldn't be
messing around with codepages to begin with. You should be using
UnicodeString for everything, and only perform conversions where
codepaged data enters/leaves your app.

I am porting some legacy Delphi 2007 code that uses Turbopower Lockbox 2 into the unicode world and I need to ensure compatibility with existing data. What I needed to know was under which conditions exactly a codepage conversion takes place.

The existing data is AES encrypted license stuff burned into the eeprom of a USB copy protection dongle and the decoder uses pAnsichar. Since any auto-magic codepage conversion would be fatal, I simply needed to verify how Delphi 2009+ ansistrings behave. Whilst your general approach is correct, for this special purpose (copy protection) it is necessary that encrypted data remain encrypted in memory; converting everything into legible unicodestring is a no-no. Finalization sections in methods must even guarantee that decoded strings are wiped from RAM...

It sure would have been nice if Delphi had more comprehensive documentation about codepage behaviour. By trial and error I discovered some interesting details:

- Having an Ansistring with codepage affinity "cp" really does not mean that the string's contents will always have codepage(cp), it is just a default affinity.
- Converting an Ansistring(cp) to Unicode does not use codepage (cp) but rather the true code page of the contents.
- Assigning Ansistring(cp=x) to an Ansistring (cp=y) does not simply convert from codepage X to Y. It rather converts from the content's true code page to UTF16 and then to code page Y
- A codepage conversion never takes place between two ansistrings with the same affinity. Not even if the contents are in a totally different code page than the affinity.
- Rawbytestrings do have a code page after assignment! Just no affinity.
- After assignment to a rawbytestring the target simply points to the same data as the source.
- After assignment from a rawbytestring the target Ansistring simply points to the same data as the source. Even if its default affinity gets violated because of this. Unfortunately this situation persists when the ansistring is assigned to other ansistrings.

Lajos Juhasz

Posts: 801
Registered: 3/14/14
Re: Ansistring / RawByteString code page madness [Edit]  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Dec 19, 2017 7:47 AM   in response to: Arthur Hoornweg in response to: Arthur Hoornweg
Arthur Hoornweg wrote:

Remy Lebeau (TeamB) wrote:

The affinity itself, no. Nor does it need to (though it is
possible to get it with RTTI at runtime). Assigning data to a
string with a given codepage affinity will force the data to that
codepage at runtime

In theory, but it is not a reliable mechanism. If ansistrings A and B
have the same affinity then an assignment B:=A will never perform a
codepage conversion because Delphi does not bother to check if the
contents actually match the affinity.

Type aStr=Type Ansistring(1252);
VAR a,b:  aStr;
Begin
   a:=UTF8Encode('test');
   b:=a; // -> code page is not 1252  but 65001
End;

Yes you can change the codepage of an AnsiString, here it's COW b will
point to the same string as a. Maybe it should convert.


I am porting some legacy Delphi 2007 code that uses Turbopower
Lockbox 2 into the unicode world and I need to ensure compatibility
with existing data. What I needed to know was under which conditions
exactly a codepage conversion takes place.

You should make a new version for Delphi 2009+. Keeping binary data in
a unicode version of Delphi is a bad idea.

What you could do is make methods that use AnsiStrings for AnsiVersions
and methods with TBytes for unicode versions of Delphi. Than the
classes could operate using TBytes.


It sure would have been nice if Delphi had more comprehensive
documentation about codepage behaviour. By trial and error I
discovered some interesting details:

[snip]

You have my vote on this one. There was a white paper for Delphi 2009,
I cannot find it and I forgot how detail that one was. Personally I
rarely use AnsiStrings in XE5. I just use the way Microsoft and
Embarcadero like. Convert AnsiStrings to UTF-16 as soon as possible and
convert back when have to write it to a file or database. If I need
binary data I will always use TBytes. It's easy and working as designed.
Remy Lebeau (Te...


Posts: 9,447
Registered: 12/23/01
Re: Ansistring / RawByteString code page madness [Edit]  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Dec 19, 2017 12:51 PM   in response to: Lajos Juhasz in response to: Lajos Juhasz
Lajos Juhasz wrote:

You have my vote on this one. There was a white paper for Delphi 2009,
I cannot find it and I forgot how detail that one was.

There are several Unicode whitepapers in Embarcadero's Migration Center:

https://www.embarcadero.com/rad-in-action/migration-upgrade-center

Personally I rarely use AnsiStrings in XE5.

As well you should be.

I just use the way Microsoft and Embarcadero like. Convert AnsiStrings
to UTF-16 as soon as possible and convert back when have to write it
to a file or database. If I need binary data I will always use TBytes.

Yes, exactly.

--
Remy Lebeau (TeamB)
Remy Lebeau (Te...


Posts: 9,447
Registered: 12/23/01
Re: Ansistring / RawByteString code page madness [Edit]  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Dec 19, 2017 12:48 PM   in response to: Arthur Hoornweg in response to: Arthur Hoornweg
Arthur Hoornweg wrote:

In theory, but it is not a reliable mechanism. If ansistrings A and B
have the same affinity then an assignment B:=A will never perform a
codepage conversion

Of course not, because under normal conditions, there is no need for a
conversion in that situation. The data's codepage should always match
the string's affinity. You would have to use SetCodePage() or other
trickery at runtime to circumvent that.

because Delphi does not bother to check if the contents actually
match the affinity.

Because under normal conditions, it shouldn't have to.

Type aStr=Type Ansistring(1252);
VAR a,b: aStr;
Begin
a:=UTF8Encode('test');
b:=a; // -> code page is not 1252 but 65001

UTF8Encode() (which is deprecated in D2009+, BTW, so you shouldn't be
using it at all) returns a 65001-encoded RawByteString, not a
UTF8String. As I stated earlier, assigning a RawByteString to an
AnsiString does not perform a conversion, regardless of affinity.

There would be a conversion if UTF8Encode() returned a UTF8String
instead (like it used to prior to D2009). For example:

function MyUTF8Encode(const S: string): UTF8String;
begin
  Result := UTF8String(S);
end;
 
a := MyUTF8Encode('test');


Or simply:

a := UTF8String('test');


Will produce the result you are expecting.

RawByteString is special, so you have to be careful in how you use
it. The only place it is MEANT to be used is as a function parameter.
If you use it anywhere else, you are asking for trouble. Just don't do
it.

I am porting some legacy Delphi 2007 code that uses Turbopower
Lockbox 2 into the unicode world and I need to ensure compatibility
with existing data.

RawByteString is not the way to do that.

What I needed to know was under which conditions exactly a codepage
conversion takes place.

There should be a conversion performed Whenever any string type is
assigned to another string type, EXCEPT when one of the strings is a
RawByteString and the other string is any AnsiString(N) type, then it
is just a stright assignment without conversion.

The existing data is AES encrypted license stuff burned into the
eeprom of a USB copy protection dongle and the decoder uses
pAnsichar.

But, what character encoding does that PAnsiChar data use exactly?

Since any auto-magic codepage conversion would be fatal, I simply
needed to verify how Delphi 2009+ ansistrings behave.

Plain AnsiString (aka: AnsiString(N) where N=0) behaves exactly the
same as it always has. Since the legacy code would have been using
plain AnsiString to begin with, continuing to use AnsiString without an
explicit codepage affinity should behave the same as it did before.
You just have to watch out that you don't assign a plain AnsiString to
any string type other than plain AnsiString or RawByteString, otherwise
a conversion will occur.

Whilst your general approach is correct, for this special purpose
(copy protection) it is necessary that encrypted data remain
encrypted in memory; converting everything into legible unicodestring
is a no-no.

You shouldn't be using strings for encrypting in the first place.
Encryption deals in bytes, not in string characters. Use a byte array
instead of an AnsiString.

Finalization sections in methods must even guarantee that decoded
strings are wiped from RAM...

Any compiler-managed type will do that, including dynamic arrays.

- Having an Ansistring with codepage affinity "cp" really does not
mean that the string's contents will always have codepage(cp), it is
just a default affinity.

Under normal conditions, it WILL be the same codepage. It is only when
you do funny things that the codepage may vary, then you have to be
prepared to handle the consequences.

- Converting an Ansistring(cp) to Unicode does not use codepage (cp)
but rather the true code page of the contents.

Yes, which 99.9999% of the time will be the specified codepage.

- Assigning Ansistring(cp=x) to an Ansistring (cp=y) does not simply
convert from codepage X to Y. It rather converts from the content's
true code page to UTF16 and then to code page Y

Yes, which 99.9999% of the time will be a conversion from X to Y (via
UTF-16).

- A codepage conversion never takes place between two ansistrings with
the same affinity.

Yes, because 99.9999% of the time no conversion is needed.

+Not even if the contents are in a totally different code page than
the affinity+.

Makes sense. Otherwise, if the compiler can't trust the data to match
the affinity, it has to force a runtime check on every string
assignment, even for assignments between the same type. That is just
wasted overhead, especially when porting legacy code.

- Rawbytestrings do have a code page after assignment! Just no
affinity.

Yes, by design. RawByteString is a special case, primarily intended
to allow function AnsiString parameters to accept any AnsiString(N)
value regardless of the actual value of N. Otherwise, you would have
to overload the function on each individual N, which is not desirable
(and not feisible for RTL functions that accept AnsiString parameters).

- After assignment to a rawbytestring the target simply points to the
same data as the source.

Yes, by design. RawByteString inherits the codepage of the source
string, by simply pointing it the same data block and incrementing its
refcount. This avoids needing to make unnecessary copies in memory.

- After assignment from a rawbytestring the target Ansistring simply
points to the same data as the source. Even if its default affinity
gets violated because of this.

Yes, by design. For instance, let's take the UTF8Encode() example
again. UTF8String and UTF8(En|De)code() were first introduced in D6,
but UTF8String was just an alias for a plain AnsiString, the RTL had no
real concept of UTF-8. So UTF8Encode() returned a 65001-encoded
AnsiString.

Then, codepaged AnsiStrings were introduced in D2009, and UTF8String
became a true UTF-8 string type. But that caused a problem that needed
to be solved - legacy code that assigned the output of UTF8Encode() to
an AnsiString variable instead of a UTF8String variable (which was
perfectly legal prior to D2009). If UTF8Encode() returned a
UTF8String, there would be a conversion performed from codepage 65001
to codepage 0. To maintain the legacy behavior, UTF8Encode() returns a
65001-encoded RawByteString, and assigning that RawByteString to an
AnsiString is a straight assignment without conversion. Thus you end
up with a 65001-encoded AnsiString, same as before.

Unfortunately this situation persists when the ansistring is assigned
to other ansistrings.

You can thank legacy code support for that. Modify the code to use
strings properly and then you don't have to worry about this anymore.

--
Remy Lebeau (TeamB)
Arthur Hoornweg

Posts: 414
Registered: 6/2/98
Re: Ansistring / RawByteString code page madness [Edit]  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Dec 20, 2017 1:25 AM   in response to: Remy Lebeau (Te... in response to: Remy Lebeau (Te...
Remy Lebeau (TeamB) wrote:

Of course not, because under normal conditions, there is no need for a
conversion in that situation. The data's codepage should always match
the string's affinity.

What's "normal" for some may be "exceptional" for others. I often work with streams of 8-bit character data received from devices, serial ports and networks (data interchange with various third-parties on oil wells). You know that you're gonna get text, but it's always a surprise what character set is used. How am I supposed to match affinity and code page if I don't know beforehand what code page I'm going to get? That's "normal" to me...


Remy, I'm not advertizing or condoning the use of Ansistrings for binary data. I was merely asked to verify if an existing piece of decryption code would work reliably if used in an unicode version of a Delphi application. No more, no less.
The original C api of the driver uses "char *" to get at the decrypted data and the Delphi wrapper simply translates it as "pchar" (which I changed into pAnsichar upon converting to Delphi XE).

The developer of the decryption code kept the data in strings, presumably because the Turbopower Lockbox decryption library used strings as well. This was all written in the Delphi 2007 era. So I looked at the Turbopower Lockbox code and noticed a few hairy things. It does stuff like performing a SetLength() on an empty string and then pulling in data from a tMemoryStream. Since SetLength() is not an assignment, I had no way of knowing what the resulting string code page would be because that is not documented. My initial hunch was to modify the return type of the function to "rawbytestring" but I would rather not change a piece of third-party source code. So I had to investigate what really happens to Ansistrings and their code pages, because encryption shuffles bits around and any code page conversion would break the data.

because Delphi does not bother to check if the contents actually
match the affinity.

Because under normal conditions, it shouldn't have to.

If only I could work under your normal conditions...

If a string is assigned to an ansistring with code page affinity X then Delphi should convert the text to code page X if the string has a different code page. That's what the docs say but it's not what the code does. If the user is free to change the code page of a string then it is not correct to simply assume that the user never does that. It would have taken only a few extra bytes of code to verify it properly. I consider this to be a bug.

If "by design" is "as intended" then the documentation should be updated. If not, then the design has a flaw.

.......
Will produce the result you are expecting.

I am not expecting anything. Just verifying and documenting.

RawByteString is special, so you have to be careful in how you use
it. The only place it is MEANT to be used is as a function parameter.

http://docs.embarcadero.com/products/rad_studio/delphiAndcpp2009/HelpUpdate2/EN/html/devcommon/stringtypes_xml.html :
"RawByteString should only be used as a const or value type parameter or a return type from a function."

https://www.embarcadero.com/images/old/pdf/Delphi-Unicode181213.pdf : "As such, it can become a handy replacement of the string (or AnsiString) type in code that uses strings for generic and custom data processing which you want to keep with a 1-character per byte representation".

If you use it anywhere else, you are asking for trouble. Just don't do
it.

Like I said, I'm merely trying to verify if a piece of legacy code will work reliably. I'm not going to re-write it using tBytes, the risk of breaking something is too big.

The existing data is AES encrypted license stuff burned into the
eeprom of a USB copy protection dongle and the decoder uses
pAnsichar.

But, what character encoding does that PAnsiChar data use exactly?

None. AES shuffles bits around and whatever code page was there gets obfuscated. The C declaration of the driver says "char *" and the Delphi developer translated it as pChar.
In C, everything is a char. In Delphi, it seems as if nothing is nowadays ...

You shouldn't be using strings for encrypting in the first place.
Encryption deals in bytes, not in string characters. Use a byte array
instead of an AnsiString.

I would if I were to re-write the bloody thing. But I'm not.

Finalization sections in methods must even guarantee that decoded
strings are wiped from RAM...

Any compiler-managed type will do that, including dynamic arrays.

To the best of my knowledge, no. You know, wiping and freeing memory are two entirely different things. Wiping overwrites legible data with zeroes before freeing it - this makes sure that no legible confidential data winds up in a process dump or in the swap file. In copy protection code, such precautions need to be taken to make sure nothing legible remains behind.

- Having an Ansistring with codepage affinity "cp" really does not
mean that the string's contents will always have codepage(cp), it is
just a default affinity.

Under normal conditions, it WILL be the same codepage. It is only when
you do funny things that the codepage may vary, then you have to be
prepared to handle the consequences.

Forget normal conditions, I just happen to "receive" funny strings. That's what happens when people interchange data. If the computer at the sending end of the line is Polish then it's quite likely that the strings I receive will contain a character or two that's outside my normal code page. I would be a very happy man if the world would finally agree on using UTF8.
Remy Lebeau (Te...


Posts: 9,447
Registered: 12/23/01
Re: Ansistring / RawByteString code page madness [Edit]  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Dec 20, 2017 10:33 AM   in response to: Arthur Hoornweg in response to: Arthur Hoornweg
Arthur Hoornweg wrote:

What's "normal" for some may be "exceptional" for others. I often
work with streams of 8-bit character data received from devices,
serial ports and networks (data interchange with various
third-parties on oil wells). You know that you're gonna get text, but
it's always a surprise what character set is used.

Why aren't your interchange protocols defining specific charsets when
exchanging text? That is a recipe for disaster.

How am I supposed to match affinity and code page if I don't know
beforehand what code page I'm going to get?

You need to know the actual encoding of the character data in order to
operate on it without risk of data loss, ESPECIALLY when dealing with
interop with other processes. RawByteString is NOT a general-purpose
string type that you should be using just because you don't know the
encoding beforehand.

Remy, I'm not advertizing or condoning the use of Ansistrings for
binary data. I was merely asked to verify if an existing piece of
decryption code would work reliably if used in an unicode version
of a Delphi application. No more, no less.

Again, encryption operates on bytes, not on characters. If you are
(en|de)crypting strings, CHARSETS MATTER! DO NOT use strings with
encryption without deciding on a definitive and consistent charset for
string<->byte translations. Otherwise the results will not always be
accurate.

The original C api of the driver uses "char *" to get at the decrypted
data

C doesn't have a distinct byte type, like Delphi does. 'unsigned char'
is typically used for that purpose, but some naive C APIs may use
'(signed) char' instead (apparently, yours does). 'char' is the only
data type in C that is guaranteed to be 1 byte in size (but 1 byte is
not guaranteed to be 8 bits on all platforms!).

and the Delphi wrapper simply translates it as "pchar" (which I
changed into pAnsichar upon converting to Delphi XE).

Use PByte instead. And then convert input strings to bytes using a
given charset, and convert output bytes to strings using the same
charset. That is the only way to avoid data loss when (en|de)crypting
strings.

The developer of the decryption code kept the data in strings,
presumably because the Turbopower Lockbox decryption library used
strings as well. This was all written in the Delphi 2007 era.

It wasn't uncommon back then to use AnsiString as a binary container.
That is bad news nowadays.

So I looked at the Turbopower Lockbox code and noticed a few hairy
things. It does stuff like performing a SetLength() on an empty string
and then pulling in data from a tMemoryStream.

That is "safe" provided the encoding of the TMemoryStream data matches
the string's affinity, or SetCodePage() is called after filling the
string.

Since SetLength() is not an assignment, I had no way of knowing what
the resulting string code page would be because that is not
documented.

Resizing any AnsiString(N) type will preserve the data's current
codepage only if the data's refcount is 1. Otherwise, a new data block
must be allocated, and its codepage will be set to the string's
compile-time affinity (if N=0 then System.DefaultSystemCodePage is
used), not the codepage of the previous data (even though the char data
is copied as-is). Look at the source code for System._LStrSetLength()
to verify this.

In the case of resizing a RawByteString, there is no affinity, so N=0
(and thus DefaultSystemCodePage) is used instead.

For example, this produces a bad result:

var
  u: UTF8String;
  r: RawByteString;
begin
  u := '€€'; // length=6 because '€' is 3 bytes in UTF-8
  r := u;
  ShowMessage(r + ' ' + IntToStr(StringCodePage(r))); // shows '€€
65001' as expected
  SetLength(r, 3); // CODEPAGE LOST!
  ShowMessage(r + ' ' + IntToStr(StringCodePage(r))); // shows '€
1252'!
end;


My initial hunch was to modify the return type of the function to
"rawbytestring" but I would rather not change a piece of third-party
source code. So I had to investigate what really happens to
Ansistrings and their code pages, because encryption shuffles bits
around and any code page conversion would break the data.

This is why you can't rely on implicit/unknown charsets when dealing
with encryption. You have to know the charset of the encrypted string
data so you can process the decrypted string correctly.

If only I could work under your normal conditions...

Then you have your work cut out for you.

If a string is assigned to an ansistring with code page affinity X
then Delphi should convert the text to code page X if the string has
a different code page.

Which it does, when it sees a non-RawByteString being assigned to
another non-RawByteString. It sees two distinct affinities, and if
they are different then it will perform a conversion accordingly. But,
if one of the strings is RawByteString, then no conversion is done,
because that is what RawByteString does.

That's what the docs say but it's not what the code does.

Yes, it does - UNLESS one of the strings is RawByteString. Again,
RawByteString is SPECIAL, so it requires extra care to use.

If the user is free to change the code page of a string

The only way to change a string's codepage at runtime is with the
SetCodePage() function. But it expects a reference to a RawByteString
as input, so it won't accept any other string type without an explicit
type-cast. Which means the user must REALLY know what they are doing,
and accept the responsibility of dealing with the consequences, when
changing the codepage at runtime.

then it is not correct to simply assume that the user never does
that. It would have taken only a few extra bytes of code to verify
it properly.

Multiplied by the potentially thousands of string usages in a process
over its lifetime. Even those few extra bytes could multiply to big
performance decreases, unexpected implicit data conversions, etc.

I consider this to be a bug.

I don't, because users shouldn't be messing around with RawByteString
as a general-purpose string type to begin with. It is not intended for
that purpose. It is just a helper type in specific circumstances.

RawByteString is special, so you have to be careful in how you use
it. The only place it is MEANT to be used is as a function
parameter.

http://docs.embarcadero.com/products/rad_studio/delphiAndcpp2009/HelpU
pdate2/EN/html/devcommon/stringtypes_xml.html : "RawByteString
should only be used as a const or value type parameter *or a return
type from a function*."


Exactly. Its primary and intended purpose is as an input parameter
type. But while it can be used as a return type, that doesn't mean
data can't be converted once the function exits. But it does make
sense in some functions for the return type's codepage to match the
same codepage as the input type, and RawByteString is the only way to
do that. Just be careful what you assign that return value to.

https://www.embarcadero.com/images/old/pdf/Delphi-Unicode181213.pdf :
"As such, it can become a handy replacement of the string (or
AnsiString) type in code that uses strings for generic and custom
data processing which you want to keep with a 1-character per byte
representation".

RawByteString is good to use only in code that needs to be
codepage-agnostic and does not need to perform data conversions while
processing. Outside of that, RawByteString shouldn't be used AT ALL.

Like I said, I'm merely trying to verify if a piece of legacy code
will work reliably. I'm not going to re-write it using tBytes, the
risk of breaking something is too big.

Then I suggest yoou contact the component/library author for an updated
version that works in Unicode environments.

But, what character encoding does that PAnsiChar data use exactly?

None.

That is impossible. A charset is what decides how characters are
represented in byte format. There is ALWAYS a charset involved in
string data, whether you specify one or not. The EEPROM is encrypted
bytes, but the decrypted string data MUST have SOME kind of charset
associated with it.

AES shuffles bits around and whatever code page was there gets
obfuscated.

Encryption has no concept of strings, only bytes. The encrypter must
have known the charset of the string data being encrypted, so it will
know the charset of the decrypted data coming back out. Since you are
decrypting someone else's encrypted data, you need to know the charset
of the data that was encrypted, or else you risk data corruption/loss.

To the best of my knowledge, no. You know, wiping and freeing memory
are two entirely different things. Wiping overwrites legible data
with zeroes before freeing it - this makes sure that no legible
confidential data winds up in a process dump or in the swap file. In
copy protection code, such precautions need to be taken to make sure
nothing legible remains behind.

Nothing in C does that wipe implicitly, it has to be done explicitly
(at least in C++, it could be handled in class destructors, but C
doesn't have that). The same is true in Delphi. So the particular
data type doesn't matter.

Forget normal conditions, I just happen to "receive" funny strings.

Then you are SOL.

That's what happens when people interchange data.

No, that is what happens when people interchange data without agreeing
on its format beforehand.

If the computer at the sending end of the line is Polish then it's
quite likely that the strings I receive will contain a character or
two that's outside my normal code page.

Any interchange worth its salt WILL specify the exact format, the exact
charset for strings, etc. Otherwise, the data can't be exchanged
reliably.

--
Remy Lebeau (TeamB)
Arthur Hoornweg

Posts: 414
Registered: 6/2/98
Re: Ansistring / RawByteString code page madness [Edit]  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Dec 21, 2017 1:46 AM   in response to: Remy Lebeau (Te... in response to: Remy Lebeau (Te...
Why aren't your interchange protocols defining specific charsets when
exchanging text? That is a recipe for disaster.

First of all, they aren't "my" protocols. Most of them were designed in the US in a time that pre-dated Unicode and where people just used ASCII. In these data interchange formats the character set was either completely unmentioned (tacitly assumed to be ascii) or, in the best case, specified to be ascii in the assumption that that was enough for everybody. Windows didn't exist yet. But once such formats and protocols started being applied outside of the US or UK ascii became unworkable, the very least that people need to be able to do is to spell the name of their customer correctly... So character values > 127 sneaked in, unavoidably. People don't even notice that it happens.

Worse even: If data exchange over a wire isn't possible then many companies that we work with will hand us data converted from Microsoft Excel to some kind of text format (for example tab-delimited or csv). These files are the worst of the worst: Excel doesn't even bother to ask the user what character set the file should be saved in. The creator of the file doesn't know the character set. And yet everybody calls it an "ascii" file ...

So we receive a lot of crummy 8-bit textual data, both as files and as volatile over-the-wire data. It is always "Ascii plus extra". We only notice that it's a different code page if it doesn't display correctly on our systems.

Anyway, we have to swallow whatever character set the data is in. If the data is converted to Unicode prematurely using a wrong code page guesstimation, that can't be un-done. Rawbytestrings would be totally handy containers because they allow a codepage change on the fly without damaging the octet values of the raw data so it can be un-done.

Unfortunately it turns out that well-defined unicode data exchange formats aren't always foolproof either. We recently had the case that a big British company was unable to read some of our files (in WitsML format) because some of the contained names contained German umlauts. Somewhere in their conversion chain, unicode data got Asciified. They were as shocked as we were.

Any interchange worth its salt WILL specify the exact format, the exact
charset for strings, etc. Otherwise, the data can't be exchanged
reliably.

I can't travel back in time to change these formats and correct their flaws. They are very English-centric.

Again, encryption operates on bytes, not on characters. If you are
(en|de)crypting strings, CHARSETS MATTER! DO NOT use strings with
encryption without deciding on a definitive and consistent charset for
string<->byte translations. Otherwise the results will not always be
accurate.

Perfectly clear. But in this specific case, I only need to make sure that the encrypted data comes out the way it was intended to come out. Checksums on character data must match. The strings will never be displayed, a conversion to unicode is not necessary.

Nothing in C does that wipe implicitly, it has to be done explicitly
(at least in C++, it could be handled in class destructors, but C
doesn't have that). The same is true in Delphi. So the particular
data type doesn't matter.

Which is why I said that the finalization sections of methods that work with temporarily decrypted data must wipe this data. You said that it was done automatically for managed types. It is not. I meant something like:

Procedure CheckLicenseData;
var Sensitive:String;
begin
   Sensitive:=TopSecretDecryptedStuff;
  try
      OKtoProceed:=VerifyIntegrity(Sensitive);
  finally
     OverwriteMemory(Sensitive);
  End;
end;


Edited by: Arthur Hoornweg on Dec 21, 2017 2:17 AM
Remy Lebeau (Te...


Posts: 9,447
Registered: 12/23/01
Re: Ansistring / RawByteString code page madness [Edit] [Edit]  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Dec 21, 2017 9:19 AM   in response to: Arthur Hoornweg in response to: Arthur Hoornweg
Arthur Hoornweg wrote:

First of all, they aren't "my" protocols. Most of them were designed
in the US in a time that pre-dated Unicode and where people just used
ASCII.

Then assume ASCII, or at least Windows-1252 and/or ISO-8859-1 (Latin-1).

So we receive a lot of crummy 8-bit textual data, both as files and
as volatile over-the-wire data. It is always "Ascii plus extra".
We only notice that it's a different code page if it doesn't display
correctly on our systems.

Anyway, we have to swallow whatever character set the data is in.

That will not work in a Unicode environment. You have to know the
correct charset, or you risk data loss. Otherwise, you could try using
heuristics to guess the charset (and there are libraries for that),
but guessing is always risky.

If the data is converted to Unicode prematurely using a wrong code
page guesstimation, that can't be un-done. Rawbytestrings would be
totally handy containers because they allow a codepage change on the
fly without damaging the octet values of the raw data so it can be
un-done.

But you still have to specify what the codepage is, oe way or the other.

Or, you could simply bite the bullet and stop using codepage-sensitive
strings at all.

Unfortunately it turns out that well-defined unicode data exchange
formats aren't always foolproof either. We recently had the case that
a big British company was unable to read some of our files (in WitsML
format) because some of the contained names contained German umlauts.
Somewhere in their conversion chain, unicode data got Asciified.
They were as shocked as we were.

OK, but that is not a problem with the exchange of the data, just the
processing of it. That is something they could fix on their end
without affecting your end.

I can't travel back in time to change these formats and correct their
flaws. They are very English-centric.

Then assume very English-centric charsets are being used.

in this specific case, I only need to make sure that the encrypted
data comes out the way it was intended to come out. Checksums on
character data must match. The strings will never be displayed, a
conversion to unicode is not necessary.

Checksums also operte on bytes, not characters. Anything and
everything related to encryption should be done in terms of bytes. I
know you don't like the idea of converting your strings to bytes and
back, but that is the way the world works.

Which is why I said that the finalization sections of methods that
work with temporarily decrypted data must wipe this data. You said
that it was done automatically for managed types. It is not.

I said the freeing of memory was done automatically, I didn't say
wiping the memory beforehand was done automatically.

--
Remy Lebeau (TeamB)
Legend
Helpful Answer (5 pts)
Correct Answer (10 pts)

Server Response from: ETNAJIVE02