Parallelizing Loops on the Face of a Program is Not Enough for Multicore Computing

Parallelizing Loops on the Face of a Program is not enough for Multicore Computing

[Posted by Jenn-Ching Luo on July 19, 2011 ]

        Multicore becomes a standard feature of personal computers. We have a common dream to speed up applications on multicores. The question is how to make it possible.

        Some people rely on parallelizing loops on the face of a program. The following uses an example to show it is impractical to parallelize loops on the face of a program.

THE EXAMPLE

        We use a common example, matrix multiplication (e.g. [C]=[A][B]) to show a failure from parallelizing loops.

        The example matrix [A] is of order (NRAxNCA), and matrix [B] is order (NCAxNCB), and matrix [C] is of order (NRAxNCB), where NRA=2400, and NCA=1850, and NCB=1900. Coefficients of matrices [A] and [B] were generated without a particular meaning, and are declared as 8-byte REAL variables in FORTRAN statements. The loops are parallelized in openMP as:

!
! Spawn a parallel region
!
!$OMP PARALLEL SHARED(A,B,C,CHUNK) &
!$OMP& PRIVATE(I,J,K,II,JJ)

!! where CHUNK is an integer and CHUNK=10;
!! A,B,C are matrices as described;
!! I,J,K,II,JJ: integer, temporary variables

!
! Initialize matrices
!
!$OMP DO SCHEDULE(STATIC, CHUNK)
DO J = 1, NCA
DO I = 1, NRA
CALL DATE_AND_TIME &
&(VALUES=TIME_ARRAY)
!! where TIME_ARRAY: integer array
!! with 8 elements
II = MOD(TIME_ARRAY(7),I)
JJ = MOD(TIME_ARRAY(8),J)
A(I,J) = (II-1)+(JJ-1)
END DO
END DO

!$OMP DO SCHEDULE(STATIC, CHUNK)
DO J = 1, NCB
DO I = 1, NCA
CALL DATE_AND_TIME &
&(VALUES=TIME_ARRAY)
II = MOD(TIME_ARRAY(7),I)
JJ = MOD(TIME_ARRAY(8),J)
B(I,J) = (II-1)*(JJ-1)
END DO
END DO

!$OMP DO SCHEDULE(STATIC, CHUNK)
DO J = 1, NCB
DO I = 1, NRA
C(I,J) = 0
END DO
END DO

!
! Do matrix multiply sharing iterations on outer loop
! Demonstrate who does iterations
!
!$OMP DO SCHEDULE(STATIC, CHUNK)
DO J = 1, NCB
DO K = 1, NCA
DO I = 1, NRA
C(I,J) = C(I,J) + A(I,K) * B(K,J)
END DO
END DO
END DO

!
! End of parallel region
!
!$OMP END PARALLEL

The above example includes several loops, which are completely parallelized. From the face of the program, it seems the example program could be parallelized well. We are going to see if it could work.

TEST PLATFORM

We tested the example on a SunFire v40z with quad dual-core opteron 870 chips on Windows Server 2008 R2, a total of 8 cores. The program was compiled against gfortran 4.7.

IMPLEMENTATION WITHOUT OPTIMIZATION

First, we executed the program without an optimization, e.g. option -O0. We get a set of performance:

With option -O0
number of cores	elapsed time (sec.)	speedup	efficiency (%)
1	193.83	1.00	100.00
2	98.61	1.97	98.28
3	66.13	2.93	97.70
4	52.82	3.67	91.74
5	45.69	4.24	84.85
6	42.42	4.57	76.16
7	40.45	4.79	68.45
8	37.33	5.19	64.90

At a glance, it seems the raw program that parallelizes loops on the face of a program has a good performance.

        The elapsed time could be reduced when more cores were used. Two cores reduced the elapsed time from 193.83 seconds to 98.61 seconds, a speedup of 1.97x and 98% efficiency. The raw code shows an efficient performance; four cores completed the example in 52.82 seconds, a speedup of 3.67x, about 92% of efficiency; that is also an efficient performance; eight cores could complete the example in 37.33 seconds, a speedup 5.19. The raw code yields an acceptable performance.

        However, it is impractical to run a program without an optimization, which is too slow. Un-optimized executable is for debugging purpose, not for applications.

        Next, we see performance after the program was optimized with compiler options.

IMPLEMENTATION WITH OPTIMIZATION

        Now, we re-compiled the example with optimization (e.g., with option -O1 -O2 -O3, respectively). We get a set of performance in the following:

With Option -O1
number of cores	elapsed time (sec.)	speedup	efficiency (%)
1	48.45	1.00	100.00
2	29.75	1.63	81.43
3	25.96	1.87	62.21
4	23.98	2.02	50.51
5	23.85	2.03	40.63
6	27.42	1.77	29.45
7	29.41	1.65	23.53
8	31.25	1.55	19.38

With Option -O2
number of cores	elapsed time (sec.)	speedup	efficiency (%)
1	48.72	1.00	100.00
2	33.71	1.45	72.26
3	24.63	1.98	65.94
4	21.72	2.24	56.08
5	25.49	1.91	38.23
6	27.60	1.77	29.42
7	30.22	1.61	23.03
8	31.57	1.54	19.29

With Option -O3
number of cores	elapsed time (sec.)	speedup	efficiency (%)
1	44.44	1.00	100.00
2	25.44	1.75	87.34
3	20.23	2.20	73.22
4	21.96	2.02	50.59
5	22.73	1.96	39.10
6	26.38	1.68	28.08
7	29.25	1.52	21.70
8	29.97	1.48	18.53

        Look at the second column for the elapsed time. In the range of 1 to 4 cores, the elapsed time can be reduced when using more cores; While in the range of 5~8 cores, it not only could not speed up but also got worse. Conversion of loops into parallelism only can take advantage of a small number of cores, which does not meet our needs. A chip gets more and more cores. For the time being, there are 12-core chips on the market. The example that parallelizes loops on the face of a program can take advantage of about 4 cores. That is far behind our demand.

        From this example, we can see it is impractical to parallelize loops on the face of a program. We cannot solely rely on conversion of loops into parallelism from the face of a program.

        If we don't rely on parallelizing loops, what can we do? That needs to find a parallelism from the problem, not to identify loops from the program. Performance could be quite different. For example, we copy a performance from "Grandpa on Multicores (2)", which was obtained with an option of optimization -O3 on the same testing platform:

number of cores	elapsed time (sec.)	speedup	efficiency (%)
1	1588.65	1.00	100.00
2	795.04	2.00	99.91
3	531.32	2.99	99.67
4	399.61	3.98	99.39
5	320.27	4.96	99.21
6	268.23	5.93	98.71
7	231.91	6.85	97.86
8	202.46	7.85	98.08

Grandpa LAIPE shows an almost perfect performance. The above performance speaks itself. On 8 cores, grandpa LAIPE can speed up to 7.85x.

Grandpa LAIPE almost linearly reduced the elapsed time when using more cores. What's the secret? Grandpa LAIPE does not parallelize loops on the face of program, but has asynchronous parallelism that was from mathematical derivation. We can see grandpa LAIPE's performance is quite different from parallelizing loops on the face of program.

How to take advantage of multicores? Apparently, parallelizing loops on the face of a program might be not enough.