Watch, Follow, &
Connect with Us

For forums, blogs and more please visit our
Developer Tools Community.


Welcome, Guest
Guest Settings
Help

Thread: bcc32, XE8 Enterprise, inline assembly question



Permlink Replies: 6 - Last Post: Jan 2, 2017 2:44 AM Last Post By: Jan Dijkstra Threads: [ Previous | Next ]
Jan Dijkstra

Posts: 206
Registered: 11/4/99
bcc32, XE8 Enterprise, inline assembly question
Click to report abuse...   Click to reply to this thread Reply
  Posted: Dec 16, 2016 4:31 AM
I have a little routine that executes a tight loop, like so

int __fastcall TRawASCII::GetChars (System::PByte Bytes, int ByteCount, System::WideChar *Chars, int CharCount)
{
  // In our specific situation, byte count and char count should be identical.
  // In case something went wrong, we'll just take the lower of the two values,
  // if they happen to differ. We don't want buffer overruns
  //
  // This is the upscale routine, which will zero-extend the value of each byte
  // into a WideChar
  int count = CharCount;
  if (count > ByteCount) count = ByteCount;
 
  if (count)
  {
      asm mov esi,[Bytes]
      asm mov edi,Chars
      asm mov ecx,count
      asm dec ecx
loop: asm movzx ax,[esi][ecx]
      asm mov [edi][ecx*2],ax
      asm dec ecx
      asm jns loop
 
  }
 
  return count;
}


When I "compile" this into assembly source, this is what comes out (highlighting only the inline assembly bit):
 ;	      asm mov esi,Chars
 ;
	?debug L 185
@44:
@46:
 	mov	 esi,dword ptr [ebp-4]
 ;	
 ;	      asm mov edi,Bytes
 ;	
	?debug L 186
 	mov	 edi,dword ptr [ebp+12]
 ;	
 ;	      asm mov ecx,count
 ;
	?debug L 187
 	mov	 ecx,dword ptr [ebp-8]
 ;	
 ;	      asm dec ecx
 ;	
	?debug L 188
 	dec	 ecx
 ;	
 ;	loop: asm mov al,[esi][ecx*2]
 ;	
	?debug L 189
@47:
 	mov	 al,[esi][ecx*2]
 ;	
 ;	      asm mov [edi][ecx],al
 ;	
	?debug L 190
 	mov	 [edi][ecx],al
 ;	
 ;	      asm dec ecx
 ;	
	?debug L 191
 	dec	 ecx
 ;	
 ;	      asm jns loop
 ;	
	?debug L 192
	jns       short @47

All looks well.

However, when I compile this for real (with the same settings, nothing has changed at the project) there is a problem

The instruction
 	mov	 esi,dword ptr [ebp-4]

is now replaced with
 	mov	 esi,dword ptr [esp-4]

and
 	mov	 edi,dword ptr [ebp+12]

is now replaced with
 	mov	 edi,dword ptr [esp+12]

The strange part is that the third reference (mov ecx,dword ptr [ebp-8]) is correctly generated.

Given that the function's entry code consists of
	push      ebp
	mov       ebp,esp
	add       esp,-8
	push      esi
	push      edi
	mov       dword ptr [ebp-4],edx
	mov       eax,dword ptr [ebp+8]

the change from using ebp into esp causes the wrong values to be loaded, and the routine subsequently crashes with an access violation.

What is going on here? Why is the compiler generating different actual code, compared to the assembly source it generates?

And secondly, what can I do to prevent this from happening?
Remy Lebeau (Te...


Posts: 9,447
Registered: 12/23/01
Re: bcc32, XE8 Enterprise, inline assembly question
Click to report abuse...   Click to reply to this thread Reply
  Posted: Dec 16, 2016 3:44 PM   in response to: Jan Dijkstra in response to: Jan Dijkstra
Jan wrote:

I have a little routine that executes a tight loop, like so

Why are you using inline assembly at all? You are not really benefitting
from it. I would suggesting coding it in plain C++, which is much more readable
and maintainable. Let the compiler handle the generation of assembly for
you:

int __fastcall TRawASCII::GetChars (System::PByte Bytes, int ByteCount, System::WideChar 
*Chars, int CharCount)
{
    // This is the upscale routine, which will zero-extend the value of each 
byte
    // into a WideChar
 
    int count = 0;
    while ((ByteCount > 0) && (CharCount > 0))
    {
        *Chars++ = *Bytes++;
        --ByteCount;
        --CharCount;
        ++count;
    }
    return count;
}


Alternatively:

int __fastcall TRawASCII::GetChars (System::PByte Bytes, int ByteCount, System::WideChar 
*Chars, int CharCount)
{
    // This is the upscale routine, which will zero-extend the value of each 
byte
    // into a WideChar
 
    System::WideChar *start = Chars;
    while ((ByteCount > 0) && (CharCount > 0))
    {
        *Chars++ = *Bytes++;
        --ByteCount;
        --CharCount;
    }
    return (int)(Chars - start);
}


--
Remy Lebeau (TeamB)
Jan Dijkstra

Posts: 206
Registered: 11/4/99
Re: bcc32, XE8 Enterprise, inline assembly question
Click to report abuse...   Click to reply to this thread Reply
  Posted: Dec 17, 2016 10:38 PM   in response to: Remy Lebeau (Te... in response to: Remy Lebeau (Te...
Remy Lebeau (TeamB) wrote:
Jan wrote:

I have a little routine that executes a tight loop, like so

Why are you using inline assembly at all? You are not really benefitting
from it. I would suggesting coding it in plain C++, which is much more readable
and maintainable. Let the compiler handle the generation of assembly for
you:

int __fastcall TRawASCII::GetChars (System::PByte Bytes, int ByteCount, System::WideChar 
*Chars, int CharCount)
{
    // This is the upscale routine, which will zero-extend the value of each 
byte
    // into a WideChar
 
    int count = 0;
    while ((ByteCount > 0) && (CharCount > 0))
    {
        *Chars++ = *Bytes++;
        --ByteCount;
        --CharCount;
        ++count;
    }
    return count;
}


Alternatively:

int __fastcall TRawASCII::GetChars (System::PByte Bytes, int ByteCount, System::WideChar 
*Chars, int CharCount)
{
    // This is the upscale routine, which will zero-extend the value of each 
byte
    // into a WideChar
 
    System::WideChar *start = Chars;
    while ((ByteCount > 0) && (CharCount > 0))
    {
        *Chars++ = *Bytes++;
        --ByteCount;
        --CharCount;
    }
    return (int)(Chars - start);
}


--
Remy Lebeau (TeamB)

I know how to code it in c++

I'm doing it in inline assembly, because the compiler generates more than twice as much assembly instructions as needed, making it more than twice as slow. This code is needed to process text files, which can be megabytes in size. Then those speed differences start to add up
Remy Lebeau (Te...


Posts: 9,447
Registered: 12/23/01
Re: bcc32, XE8 Enterprise, inline assembly question
Click to report abuse...   Click to reply to this thread Reply
  Posted: Dec 19, 2016 10:28 AM   in response to: Jan Dijkstra in response to: Jan Dijkstra
Jan wrote:

I know how to code it in c++

I'm doing it in inline assembly, because the compiler generates more
than twice as much assembly instructions as needed, making it more
than twice as slow.

I seriously doubt that. And also, if you ever port the project to 64bit,
you will have to either code the entire function in assembly, or get rid
of the assembly altogether, as the clang-based compilers do not allow mixing
C++ and inline assembly together in the same function.

This code is needed to process text files, which can be megabytes
in size. Then those speed differences start to add up

Are you reading the files into memory buffers before calling GetChars()?
Because that, in of itself, is slow. If you use memory mapped files, that
will be much faster, then you might not have to worry as much about the speed
of GetChars() itself. But if speed of the function is important, I would
suggest using __declpec(naked) to disable the function prolog and code the
entire function in assembly so you have full control over it.

--
Remy Lebeau (TeamB)
Alex Belo

Posts: 626
Registered: 10/8/06
Re: bcc32, XE8 Enterprise, inline assembly question
Click to report abuse...   Click to reply to this thread Reply
  Posted: Dec 18, 2016 7:48 AM   in response to: Jan Dijkstra in response to: Jan Dijkstra
Jan Dijkstra wrote:

What is going on here?

Another bug in compiler and nothing more ?..

And secondly, what can I do to prevent this from happening?

1) Try to move the code of class into separate unit and see if make a
difference.

2) Try to declare the function as static (AFAICS this function does not
use 'this' pointer) or move it out of class at all.

--
Alex
Remy Lebeau (Te...


Posts: 9,447
Registered: 12/23/01
Re: bcc32, XE8 Enterprise, inline assembly question
Click to report abuse...   Click to reply to this thread Reply
  Posted: Dec 19, 2016 10:28 AM   in response to: Jan Dijkstra in response to: Jan Dijkstra
Jan wrote:

However, when I compile this for real (with the same settings,
nothing has changed at the project) there is a problem

Are you compiling with stack frames enabled during one compile, and with
stack frames disabled for another compile?

When accessing function parameters, the EBP register should be used only
when stack frames are enabled, and ESP is used instead when stack frames
are disabled. Unless the compiler has a serious bug, it should not be mixing
EBP and ESP together like you claim.

When you compile for debug, stack frames are enabled by default, and when
you compile for release, stack frames are disabled by default. Have you
tried using #pragma to force enable/disable stack frames in code? Or, you
could use __declspec(naked) to remove the function prolog and just code the
entire function in assembly (which you will have to do anyway if you ever
port the project to 64bit), then you have full control over the usage of
CPU registers.

--
Remy Lebeau (TeamB)
Jan Dijkstra

Posts: 206
Registered: 11/4/99
Re: bcc32, XE8 Enterprise, inline assembly question
Click to report abuse...   Click to reply to this thread Reply
  Posted: Jan 2, 2017 2:44 AM   in response to: Remy Lebeau (Te... in response to: Remy Lebeau (Te...
Remy Lebeau (TeamB) wrote:
Jan wrote:

However, when I compile this for real (with the same settings,
nothing has changed at the project) there is a problem

Are you compiling with stack frames enabled during one compile, and with
stack frames disabled for another compile?

When accessing function parameters, the EBP register should be used only
when stack frames are enabled, and ESP is used instead when stack frames
are disabled. Unless the compiler has a serious bug, it should not be mixing
EBP and ESP together like you claim.

When you compile for debug, stack frames are enabled by default, and when
you compile for release, stack frames are disabled by default. Have you
tried using #pragma to force enable/disable stack frames in code? Or, you
could use __declspec(naked) to remove the function prolog and just code the
entire function in assembly (which you will have to do anyway if you ever
port the project to 64bit), then you have full control over the usage of
CPU registers.

--
Remy Lebeau (TeamB)

Standard stack frames are enabled.

And I'm not talking about switching between debug and release mode. I'm talking about actual compiling, and the "compile to assembly" option. The project is in debug target mode the whole time. Actual compile messes up with references to ESP, whereas "compile to assembly" generates the correct code.
Legend
Helpful Answer (5 pts)
Correct Answer (10 pts)

Server Response from: ETNAJIVE02