Parallel Matrix Multiplication on Soft Cores

[Posted by Jenn-Ching Luo on Mar. 01, 2012 ]

        Since this post, this writer starts introducing parallel performance of LAIPE2 on multicores.

        LAIPE2 follows grandpa LAIPE to provide a library to "Link and in Parallel Execute" programs. The big difference between grandpa LAIPE and LAIPE2 is that grandpa LAIPE is programmed in MTASK and LAIPE2 is based on soft core computing (e.g., neuLoop). This post gives us a first look on LAIPE2 performance.

        Parallel performance of the function, laipe$matmul, which does matrix multiplication (e.g., [c]=[a][b]), is presented in this post. The function laipe$matmul is programmed in standard matrix multiplication, not in the original form but in a new form for parallel processing.

TEST PROGRAM AND PLATFORM

        The program gets number of physical cores on computer, and initializes matrices [a] and [b], and then repeatedly executes the function laipe$matmul_8 and collects timing results with 1 soft core, 2 soft cores, and etc. The function laipe$matmul_8 is for 8-byte REAL variables, e.g., double precision.

        After timing the function laipe$matmul_8, the program also times the FORTRAN intrinsic function MATMUL for comparison. The following has the FORTRAN program. Interested user can link the program against laipe2, and see performance on his computer.

program mx_8
!! default attribute
implicit none

!! parameters
integer*4, parameter :: NRA = 2400
integer*4, parameter :: NCA = 1850
integer*4, parameter :: NCB = 1900

!! common variables
integer*4 :: cores
integer*4 :: i, j, ii, jj
integer*4 :: time_array(8)
real*8, save :: a(nra,nca), b(nca,ncb)
real*8, save :: c(nra,ncb)
real*4 :: elapsedTime
real*4 :: userTime
real*4 :: kernelTime
real*4 :: totalTime

!
! number of physical cores
!
call laipe$getce(cores)

!
! arbitrarily initialize matrices
!
do j = 1, nca
do i = 1, nra
call date_and_time(values=time_array)
ii = mod(time_array(7),i)
jj = mod(time_array(8),j)
a(i,j) = (ii-1)+(jj-1)
end do
end do

do j = 1, ncb
do i = 1, nca
call date_and_time(values=time_array)
ii = mod(time_array(7),i)
jj = mod(time_array(8),j)
b(i,j) = (ii-1)*(jj-1)
end do
end do

!
! collect timing results
!
do i = 1,cores
call laipe$use(i)
write(*,*)
write(*,'(''number of cores:'',i3)') i

!
! initialize timer -- start collecting time
!
call laipe$resetUserTimes

!
! do multiplication
!
call laipe$matmul_8(a,b,c,nra,nca,ncb)

!
! output user times
!
call laipe$getUserTimes( &
& elapsedTime, userTime,&
& kernelTime,totalTime)
write(*,'('' Elapsed Time (Seconds): '', &
& F8.2)') elapsedTime
write(*,'('' CPU Time in User Mode &
& (Seconds): '', F8.2)') userTime
write(*,'('' CPU Time in Kernel Mode &
&(Seconds): '', F8.2)') kernelTime
write(*,'('' Total CPU Time (Seconds): &
& '',F8.2)') totalTime
end do

!
! collect timing in implementing Fortran
! intrinsic function MATMUL
!
write(*,*)
write(*,'(''&
& Timing in implementing Fortran '' &
& ''intrinsic MATMUL'')')
call laipe$resetUserTimes
c = matmul(a,b)
call laipe$getUserTimes(elapsedTime, &
& userTime,kernelTime,totalTime)
write(*,'('' Elapsed Time (Seconds): '', &
& F8.2)') elapsedTime
write(*,'('' CPU Time in User Mode &
& (Seconds): '', F8.2)') userTime
write(*,'('' CPU Time in Kernel Mode &
&(Seconds): '', F8.2)') kernelTime
write(*,'('' Total CPU Time (Seconds): &
& '',F8.2)') totalTime

!
! deallocate soft cores
!
call laipe$done

!
! end of program
!
end program mx_8

The test platform remains the same as the previous post, "Programming Language is Not a Drug to Treat Poor Parallelism". It is a Windows server 2008 R2 with eight cores of opteron 870. The hardware, this writer uses, is out-of-dated. GFORTRAN is applied with the optimization (-O3).

TIMING RESULT

We collected a set of timing results as follows, including elapsed time, cpu time in user mode, and cpu time in kernel mode.

number of cores: 1
Elapsed Time (Seconds): 19.89
CPU Time in User Mode (Seconds): 19.87
CPU Time in Kernel Mode (Seconds): 0.02
Total CPU Time (Seconds): 19.89

number of cores: 2
Elapsed Time (Seconds): 9.98
CPU Time in User Mode (Seconds): 19.84
CPU Time in Kernel Mode (Seconds): 0.09
Total CPU Time (Seconds): 19.94

number of cores: 3
Elapsed Time (Seconds): 6.76
CPU Time in User Mode (Seconds): 20.14
CPU Time in Kernel Mode (Seconds): 0.06
Total CPU Time (Seconds): 20.20

number of cores: 4
Elapsed Time (Seconds): 5.15
CPU Time in User Mode (Seconds): 20.40
CPU Time in Kernel Mode (Seconds): 0.08
Total CPU Time (Seconds): 20.48

number of cores: 5
Elapsed Time (Seconds): 4.21
CPU Time in User Mode (Seconds): 20.61
CPU Time in Kernel Mode (Seconds): 0.06
Total CPU Time (Seconds): 20.67

number of cores: 6
Elapsed Time (Seconds): 3.59
CPU Time in User Mode (Seconds): 21.20
CPU Time in Kernel Mode (Seconds): 0.09
Total CPU Time (Seconds): 21.29

number of cores: 7
Elapsed Time (Seconds): 3.17
CPU Time in User Mode (Seconds): 21.90
CPU Time in Kernel Mode (Seconds): 0.06
Total CPU Time (Seconds): 21.96

number of cores: 8
Elapsed Time (Seconds): 2.92
CPU Time in User Mode (Seconds): 22.90
CPU Time in Kernel Mode (Seconds): 0.08
Total CPU Time (Seconds): 22.98

Timing in implementing Fortran intrinsic MATMUL

Elapsed Time (Seconds): 78.42
CPU Time in User Mode (Seconds): 78.42
CPU Time in Kernel Mode (Seconds): 0.00
Total CPU Time (Seconds): 78.42

First, we see the time in implementing the FORTRAN intrinsic function MATMUL. MATMUL required 78.42 seconds, which was done on one core. Next, we see performance of the function laipe$matmul_8 on one core that took only 19.89 seconds to complete the identical job. The function, laipe$matmul_8, significantly outperforms the FORTRAN intrinsic function MATMUL.

We summarize the timing results to see speedup and efficiency of laipe$matmul_8.

number of cores	elapsed time (sec.)	speedup	efficiency (%)
1	19.89	1.00	100.00
2	9.98	1.99	99.65
3	6.76	2.94	98.08
4	5.15	3.86	96.55
5	4.21	4.72	94.49
6	3.59	5.54	92.34
7	3.17	6.27	89.63
8	2.92	6.81	85.15

From the above table, we can see that elapsed time was linearly reduced with number of soft cores. For example, the elapsed time is reduced from 19.89 seconds to 9.98 second when using 2 cores, which shows a 1.99x speedup and 99.65% efficiency; 4 soft cores can cut the elapsed time to 5.15 seconds, yielding a speedup 3.86x and 96.55% efficiency. It shows an almost linear speedup with number of cores.

The above example is in 8-byte REAL variable, e.g., double precision. The following has parallel performance of laipe$matmul in other data types. All of them consistently show efficiently parallel performance.

LAIPE$MATMUL_4 (FOR 4-BYTE REAL VARIABLE)

number of cores	elapsed time (sec.)	speedup	efficiency (%)
1	16.29	1.00	100.00
2	8.20	1.99	99.33
3	5.71	2.85	95.10
4	4.40	3.70	92.56
5	3.57	4.56	91.26
6	3.01	5.41	90.20
7	2.65	6.15	87.82
8	2.36	6.90	86.28

LAIPE$MATMUL_10 (FOR 10-BYTE REAL VARIABLE)

number of cores	elapsed time (sec.)	speedup	efficiency (%)
1	80.36	1.00	100.00
2	40.26	2.00	99.80
3	27.35	2.94	97.94
4	20.83	3.86	96.45
5	17.08	4.70	94.10
6	14.51	5.54	92.30
7	12.71	6.32	90.32
8	11.47	7.01	87.58

LAIPE$MATMUL_Z4 (FOR 8-BYTE COMPLEX VARIABLE)

number of cores	elapsed time (sec.)	speedup	efficiency (%)
1	44.62	1.00	100.00
2	22.32	2.00	99.96
3	14.90	2.99	99.82
4	11.25	3.97	99.16
5	9.03	4.94	98.83
6	7.58	5.89	98.11
7	6.54	6.82	97.47
8	5.79	7.71	96.33

LAIPE$MATMUL_Z8 (FOR 16-BYTE COMPLEX VARIABLE)

number of cores	elapsed time (sec.)	speedup	efficiency (%)
1	71.28	1.00	100.00
2	35.90	1.99	99.28
3	24.54	2.90	96.82
4	18.74	3.80	95.09
5	15.41	4.63	92.51
6	13.18	5.41	90.14
7	11.61	6.14	87.71
8	10.62	6.71	83.90

LAIPE$MATMUL_Z10 (FOR 20-BYTE COMPLEX VARIABLE)

number of cores	elapsed time (sec.)	speedup	efficiency (%)
1	171.23	1.00	100.00
2	83.62	2.05	102.39
3	57.03	3.00	100.08
4	44.13	3.88	97.00
5	37.05	4.62	92.43
6	32.59	5.25	87.57
7	30.23	5.66	80.92
8	29.22	5.86	73.25

LAIPE$MATMUL_16 (FOR 16-BYTE REAL VARIABLE)
WITH A SMALLER PROBLEM SIZE (NRA=960, NCA=740, NCB=760)

number of cores	elapsed time (sec.)	speedup	efficiency (%)
1	71.14	1.00	100.00
2	35.93	1.98	99.00
3	23.92	2.97	99.14
4	18.00	3.95	98.81
5	14.63	4.86	97.25
6	12.35	5.76	96.01
7	10.42	6.83	97.53
8	9.02	7.89	98.59

LAIPE$MATMUL_Z16 (FOR 32-BYTE COMPLEX VARIABLE)
WITH A SMALLER PROBLEM SIZE (NRA=960, NCA=740, NCB=760)

number of cores	elapsed time (sec.)	speedup	efficiency (%)
1	243.38	1.00	100.00
2	122.91	1.98	99.01
3	81.95	2.97	99.00
4	61.53	3.96	98.89
5	49.97	4.87	97.41
6	42.26	5.76	95.99
7	35.85	6.79	96.98
8	30.89	7.88	98.49