Watch, Follow, &
Connect with Us

For forums, blogs and more please visit our
Developer Tools Community.


Welcome, Guest
Guest Settings
Help

Thread: real case test MM parallel -> 4x more scalable (i7 6700)



Permlink Replies: 9 - Last Post: Jul 27, 2017 2:48 PM Last Post By: Roberto Della P...
Roberto Della P...

Posts: 83
Registered: 4/8/12
real case test MM parallel -> 4x more scalable (i7 6700)
Click to report abuse...   Click to reply to this thread Reply
  Posted: Jul 25, 2017 10:37 PM
I did a small test with real code scenario,
look at parallel zlib with my patch, zcompress loop 1000 of a 1100KB text file:

uses System.Zlib;

threadvar
INS: TMemoryStream;
OUTS: pointer;
SizeIn: integer;
SizeOUT: integer;

procedure TForm.CompressClick(Sender: TObject);
var
Count: integer;
begin
Count := GetTickCount;
TParallel.For(1,1000,procedure(I:integer)
begin
INS := TMemoryStream.Create;
INS.LoadFromFile('c:\teststream.txt');
SizeIn := INS.Size;
GetMem(OUTS, SizeIn);
SizeOUT := SizeIn;
ZCompress(INS.Memory, SizeIn, OUTS, SizeOUT, zcFastest);
INS.Free;
FreeMem(OUTS);
end);
ShowMessage(IntToStr(GetTickCount - Count));
end;

- fastmm4 900-1000msec
- brainMM 563msec
- msheap 532msec
- my patch Intel IPP + TTB 281 msec

www.dellapasqua.com
www.dellapasqua.com/intelTBB.rar (put a teststream.txt file on c:\ and run files)
Luigi Sandon

Posts: 353
Registered: 10/15/99
Re: real case test MM parallel -> 4x more scalable (i7 6700)
Click to report abuse...   Click to reply to this thread Reply
  Posted: Jul 26, 2017 6:54 AM   in response to: Roberto Della P... in response to: Roberto Della P...
For completeness, how did you setup FastMM4 in FastMM4.inc?
Eric Fleming Bo...

Posts: 48
Registered: 8/11/02
Re: real case test MM parallel -> 4x more scalable (i7 6700)
Click to report abuse...   Click to reply to this thread Reply
  Posted: Jul 26, 2017 7:17 AM   in response to: Roberto Della P... in response to: Roberto Della P...
Would you run the same test again but using FastMM 4992 (Latest version) with UseReleaseStack option?

You will have to add {$define UseReleaseStack} manually in FastMM4Options.inc file since its not there

Just make sure you use version 4992 with Release Stack.

This option was developed to minimize multi-thread contention... I tested in our application and am now using it for all our projects because the result is amazing...
Eric Fleming Bo...

Posts: 48
Registered: 8/11/02
Re: real case test MM parallel -> 4x more scalable (i7 6700)
Click to report abuse...   Click to reply to this thread Reply
  Posted: Jul 26, 2017 7:27 AM   in response to: Eric Fleming Bo... in response to: Eric Fleming Bo...
Multi-thread contention was a big problem for us.. Our application is highly multi-threaded. After updating the library, some of our customers experienced about 30 to 40% decrease in CPU usage. We never knew that our application was having so much thread contention.

We tried ScaleMM2 in the past but it uses too much memory. Now we are using FastMM4 with Release Stack and couldn't be better.

But I'm curious to see the difference between Fastmm with release stack and Intel TBB
Roberto Della P...

Posts: 83
Registered: 4/8/12
Re: real case test MM parallel -> 4x more scalable (i7 6700)
Click to report abuse...   Click to reply to this thread Reply
  Posted: Jul 26, 2017 9:14 AM   in response to: Eric Fleming Bo... in response to: Eric Fleming Bo...
Eric Fleming Bonilha wrote:
Multi-thread contention was a big problem for us.. Our application is highly multi-threaded. After updating the library, some of our customers experienced about 30 to 40% decrease in CPU usage. We never knew that our application was having so much thread contention.

We tried ScaleMM2 in the past but it uses too much memory. Now we are using FastMM4 with Release Stack and couldn't be better.

But I'm curious to see the difference between Fastmm with release stack and Intel TBB

TParallel.For zcompress 10k size files:
latest MM4 with your option enabled: 3812 msec
my IPP+TTB: 2040 msec

in the detailed benchmark (without parallel loops) test of David Heffernan this FastMM4 is still slower than ScaleMM2, so you can compare the results on my page http://www.dellapasqua.com/snappy64/

why you don't test it and tell us your results?

Roberto Della P...

Posts: 83
Registered: 4/8/12
Re: real case test MM parallel -> 4x more scalable (i7 6700)
Click to report abuse...   Click to reply to this thread Reply
  Posted: Jul 26, 2017 9:25 AM   in response to: Roberto Della P... in response to: Roberto Della P...
Roberto Della Pasqua wrote:
Eric Fleming Bonilha wrote:
Multi-thread contention was a big problem for us.. Our application is highly multi-threaded. After updating the library, some of our customers experienced about 30 to 40% decrease in CPU usage. We never knew that our application was having so much thread contention.

We tried ScaleMM2 in the past but it uses too much memory. Now we are using FastMM4 with Release Stack and couldn't be better.

But I'm curious to see the difference between Fastmm with release stack and Intel TBB

TParallel.For zcompress 10k size files:
latest MM4 with your option enabled: 3812 msec
my IPP+TTB: 2040 msec

in the detailed benchmark (without parallel loops) test of David Heffernan this FastMM4 is still slower than ScaleMM2, so you can compare the results on my page http://www.dellapasqua.com/snappy64/

why you don't test it and tell us your results?


IMHO we should need to test a real application client/server heavily mutithreaded and see the differences in terms of CPU ratio / speed I/O, but actually I don't have a similar setup
Roberto Della P...

Posts: 83
Registered: 4/8/12
Re: real case test MM parallel -> 4x more scalable (i7 6700)
Click to report abuse...   Click to reply to this thread Reply
  Posted: Jul 27, 2017 7:22 AM   in response to: Roberto Della P... in response to: Roberto Della P...
look at this bench, only low level op without streams, etc.: https://codingforspeed.com/integer-performance-comparison-for-c-c-delphi/ (shorter=faster)

TBB+IPP
Serial = 3812
Parallel1 = 875
Parallel2 = 750
Parallel3 = 750

Scale2 MM
Serial = 4000
Parallel1 = 890
Parallel2 = 782
Parallel3 = 781

Default MM4
Serial = 4688
Parallel1 = 1172
Parallel2 = 1062
Parallel3 = 1063
Eric Fleming Bo...

Posts: 48
Registered: 8/11/02
Re: real case test MM parallel -> 4x more scalable (i7 6700)
Click to report abuse...   Click to reply to this thread Reply
  Posted: Jul 27, 2017 7:46 AM   in response to: Roberto Della P... in response to: Roberto Della P...
IMHO we should need to test a real application client/server heavily mutithreaded and see the differences in terms of CPU ratio / speed I/O, but actually I don't have a similar setup

I can try in a real world application, but do you have 32bits of TBB as well?
Roberto Della P...

Posts: 83
Registered: 4/8/12
Re: real case test MM parallel -> 4x more scalable (i7 6700)
Click to report abuse...   Click to reply to this thread Reply
  Posted: Jul 27, 2017 10:16 AM   in response to: Eric Fleming Bo... in response to: Eric Fleming Bo...
Eric Fleming Bonilha wrote:
IMHO we should need to test a real application client/server heavily mutithreaded and see the differences in terms of CPU ratio / speed I/O, but actually I don't have a similar setup

I can try in a real world application, but do you have 32bits of TBB as well?

actually I did 64bit version, do you compile for 64bit?
Do you use zlib deflate algorithm to compress packets over the net between client and server?
Check also my zlib cloudflare version.
Roberto Della P...

Posts: 83
Registered: 4/8/12
Re: real case test MM parallel -> 4x more scalable (i7 6700)
Click to report abuse...   Click to reply to this thread Reply
  Posted: Jul 27, 2017 2:48 PM   in response to: Roberto Della P... in response to: Roberto Della P...
Roberto Della Pasqua wrote:
Eric Fleming Bonilha wrote:
IMHO we should need to test a real application client/server heavily mutithreaded and see the differences in terms of CPU ratio / speed I/O, but actually I don't have a similar setup

I can try in a real world application, but do you have 32bits of TBB as well?

actually I did 64bit version, do you compile for 64bit?
Do you use zlib deflate algorithm to compress packets over the net between client and server?
Check also my zlib cloudflare version.

I did a test with win32 (but without one routine), the results are only slightly better, probably because the i686 lacks of the newer extensions simd, and also RTL of delphi 32bit is really well optimized with a lot of assembler routiners.

Tomorrow I try to have time to finish
Legend
Helpful Answer (5 pts)
Correct Answer (10 pts)

Server Response from: ETNAJIVE02