Equation Solution
High Performance by Design
 Page: 3   Implementation of Constant-Bandwidth Solver of LAIPE2 on 48 Cores   Parallel Performance of Skyline Solver on Soft Cores (3)   Parallel Performance of Skyline Solver on Soft Cores (2)   Parallel Performance of Skyline Solver on Soft Cores (I)   Parallel Matrix Multiplication on Soft Cores

Parallel Matrix Multiplication on Soft Cores

[Posted by Jenn-Ching Luo on Mar. 01, 2012 ]

Since this post, this writer starts introducing parallel performance of LAIPE2 on multicores.

LAIPE2 follows grandpa LAIPE to provide a library to "Link and in Parallel Execute" programs. The big difference between grandpa LAIPE and LAIPE2 is that grandpa LAIPE is programmed in MTASK and LAIPE2 is based on soft core computing (e.g., neuLoop). This post gives us a first look on LAIPE2 performance.

Parallel performance of the function, laipe\$matmul, which does matrix multiplication (e.g., [c]=[a][b]), is presented in this post. The function laipe\$matmul is programmed in standard matrix multiplication, not in the original form but in a new form for parallel processing.

TEST PROGRAM AND PLATFORM

The program gets number of physical cores on computer, and initializes matrices [a] and [b], and then repeatedly executes the function laipe\$matmul_8 and collects timing results with 1 soft core, 2 soft cores, and etc. The function laipe\$matmul_8 is for 8-byte REAL variables, e.g., double precision.

After timing the function laipe\$matmul_8, the program also times the FORTRAN intrinsic function MATMUL for comparison. The following has the FORTRAN program. Interested user can link the program against laipe2, and see performance on his computer.

 program mx_8       !! default attribute       implicit none          !! parameters       integer*4, parameter :: NRA = 2400       integer*4, parameter :: NCA = 1850       integer*4, parameter :: NCB = 1900          !! common variables       integer*4 :: cores       integer*4 :: i, j, ii, jj       integer*4 :: time_array(8)       real*8, save :: a(nra,nca), b(nca,ncb)       real*8, save :: c(nra,ncb)       real*4 :: elapsedTime       real*4 :: userTime       real*4 :: kernelTime       real*4 :: totalTime    ! ! number of physical cores !       call laipe\$getce(cores) ! ! arbitrarily initialize matrices !       do j = 1, nca             do i = 1, nra                   call date_and_time(values=time_array)                   ii = mod(time_array(7),i)                   jj = mod(time_array(8),j)                   a(i,j) = (ii-1)+(jj-1)             end do       end do           do j = 1, ncb             do i = 1, nca                   call date_and_time(values=time_array)                   ii = mod(time_array(7),i)                   jj = mod(time_array(8),j)                   b(i,j) = (ii-1)*(jj-1)             end do       end do     ! ! collect timing results !       do i = 1,cores             call laipe\$use(i)             write(*,*)             write(*,'(''number of cores:'',i3)') i           !       ! initialize timer -- start collecting time       !             call laipe\$resetUserTimes           !       ! do multiplication       !             call laipe\$matmul_8(a,b,c,nra,nca,ncb)           !       ! output user times       !             call laipe\$getUserTimes( &                 & elapsedTime, userTime,&                 & kernelTime,totalTime)             write(*,'('' Elapsed Time (Seconds): '', &                 & F8.2)') elapsedTime             write(*,'('' CPU Time in User Mode &                 & (Seconds): '', F8.2)') userTime             write(*,'('' CPU Time in Kernel Mode &                 &(Seconds): '', F8.2)') kernelTime             write(*,'('' Total CPU Time (Seconds): &                 & '',F8.2)') totalTime       end do     ! ! collect timing in implementing Fortran ! intrinsic function MATMUL !       write(*,*)       write(*,'(''&           & Timing in implementing Fortran '' &           & ''intrinsic MATMUL'')')       call laipe\$resetUserTimes       c = matmul(a,b)       call laipe\$getUserTimes(elapsedTime, &           & userTime,kernelTime,totalTime)       write(*,'('' Elapsed Time (Seconds): '', &           & F8.2)') elapsedTime       write(*,'('' CPU Time in User Mode &           & (Seconds): '', F8.2)') userTime       write(*,'('' CPU Time in Kernel Mode &           &(Seconds): '', F8.2)') kernelTime       write(*,'('' Total CPU Time (Seconds): &           & '',F8.2)') totalTime     ! ! deallocate soft cores !       call laipe\$done ! ! end of program ! end program mx_8

The test platform remains the same as the previous post, "Programming Language is Not a Drug to Treat Poor Parallelism". It is a Windows server 2008 R2 with eight cores of opteron 870. The hardware, this writer uses, is out-of-dated. GFORTRAN is applied with the optimization (-O3).

TIMING RESULT

We collected a set of timing results as follows, including elapsed time, cpu time in user mode, and cpu time in kernel mode.

number of cores: 1
Elapsed Time (Seconds): 19.89
CPU Time in User Mode (Seconds): 19.87
CPU Time in Kernel Mode (Seconds): 0.02
Total CPU Time (Seconds): 19.89

number of cores: 2
Elapsed Time (Seconds): 9.98
CPU Time in User Mode (Seconds): 19.84
CPU Time in Kernel Mode (Seconds): 0.09
Total CPU Time (Seconds): 19.94

number of cores: 3
Elapsed Time (Seconds): 6.76
CPU Time in User Mode (Seconds): 20.14
CPU Time in Kernel Mode (Seconds): 0.06
Total CPU Time (Seconds): 20.20

number of cores: 4
Elapsed Time (Seconds): 5.15
CPU Time in User Mode (Seconds): 20.40
CPU Time in Kernel Mode (Seconds): 0.08
Total CPU Time (Seconds): 20.48

number of cores: 5
Elapsed Time (Seconds): 4.21
CPU Time in User Mode (Seconds): 20.61
CPU Time in Kernel Mode (Seconds): 0.06
Total CPU Time (Seconds): 20.67

number of cores: 6
Elapsed Time (Seconds): 3.59
CPU Time in User Mode (Seconds): 21.20
CPU Time in Kernel Mode (Seconds): 0.09
Total CPU Time (Seconds): 21.29

number of cores: 7
Elapsed Time (Seconds): 3.17
CPU Time in User Mode (Seconds): 21.90
CPU Time in Kernel Mode (Seconds): 0.06
Total CPU Time (Seconds): 21.96

number of cores: 8
Elapsed Time (Seconds): 2.92
CPU Time in User Mode (Seconds): 22.90
CPU Time in Kernel Mode (Seconds): 0.08
Total CPU Time (Seconds): 22.98

Timing in implementing Fortran intrinsic MATMUL

Elapsed Time (Seconds): 78.42
CPU Time in User Mode (Seconds): 78.42
CPU Time in Kernel Mode (Seconds): 0.00
Total CPU Time (Seconds): 78.42

First, we see the time in implementing the FORTRAN intrinsic function MATMUL. MATMUL required 78.42 seconds, which was done on one core. Next, we see performance of the function laipe\$matmul_8 on one core that took only 19.89 seconds to complete the identical job. The function, laipe\$matmul_8, significantly outperforms the FORTRAN intrinsic function MATMUL.

We summarize the timing results to see speedup and efficiency of laipe\$matmul_8.

 number of cores elapsed time (sec.) speedup efficiency (%) 1 19.89 1.00 100.00 2 9.98 1.99 99.65 3 6.76 2.94 98.08 4 5.15 3.86 96.55 5 4.21 4.72 94.49 6 3.59 5.54 92.34 7 3.17 6.27 89.63 8 2.92 6.81 85.15

From the above table, we can see that elapsed time was linearly reduced with number of soft cores. For example, the elapsed time is reduced from 19.89 seconds to 9.98 second when using 2 cores, which shows a 1.99x speedup and 99.65% efficiency; 4 soft cores can cut the elapsed time to 5.15 seconds, yielding a speedup 3.86x and 96.55% efficiency. It shows an almost linear speedup with number of cores.

The above example is in 8-byte REAL variable, e.g., double precision. The following has parallel performance of laipe\$matmul in other data types. All of them consistently show efficiently parallel performance.

LAIPE\$MATMUL_4 (FOR 4-BYTE REAL VARIABLE)

 number of cores elapsed time (sec.) speedup efficiency (%) 1 16.29 1.00 100.00 2 8.20 1.99 99.33 3 5.71 2.85 95.10 4 4.40 3.70 92.56 5 3.57 4.56 91.26 6 3.01 5.41 90.20 7 2.65 6.15 87.82 8 2.36 6.90 86.28

LAIPE\$MATMUL_10 (FOR 10-BYTE REAL VARIABLE)

 number of cores elapsed time (sec.) speedup efficiency (%) 1 80.36 1.00 100.00 2 40.26 2.00 99.80 3 27.35 2.94 97.94 4 20.83 3.86 96.45 5 17.08 4.70 94.10 6 14.51 5.54 92.30 7 12.71 6.32 90.32 8 11.47 7.01 87.58

LAIPE\$MATMUL_Z4 (FOR 8-BYTE COMPLEX VARIABLE)

 number of cores elapsed time (sec.) speedup efficiency (%) 1 44.62 1.00 100.00 2 22.32 2.00 99.96 3 14.90 2.99 99.82 4 11.25 3.97 99.16 5 9.03 4.94 98.83 6 7.58 5.89 98.11 7 6.54 6.82 97.47 8 5.79 7.71 96.33

LAIPE\$MATMUL_Z8 (FOR 16-BYTE COMPLEX VARIABLE)

 number of cores elapsed time (sec.) speedup efficiency (%) 1 71.28 1.00 100.00 2 35.90 1.99 99.28 3 24.54 2.90 96.82 4 18.74 3.80 95.09 5 15.41 4.63 92.51 6 13.18 5.41 90.14 7 11.61 6.14 87.71 8 10.62 6.71 83.90

LAIPE\$MATMUL_Z10 (FOR 20-BYTE COMPLEX VARIABLE)

 number of cores elapsed time (sec.) speedup efficiency (%) 1 171.23 1.00 100.00 2 83.62 2.05 102.39 3 57.03 3.00 100.08 4 44.13 3.88 97.00 5 37.05 4.62 92.43 6 32.59 5.25 87.57 7 30.23 5.66 80.92 8 29.22 5.86 73.25

LAIPE\$MATMUL_16 (FOR 16-BYTE REAL VARIABLE)
WITH A SMALLER PROBLEM SIZE (NRA=960, NCA=740, NCB=760)

 number of cores elapsed time (sec.) speedup efficiency (%) 1 71.14 1.00 100.00 2 35.93 1.98 99.00 3 23.92 2.97 99.14 4 18.00 3.95 98.81 5 14.63 4.86 97.25 6 12.35 5.76 96.01 7 10.42 6.83 97.53 8 9.02 7.89 98.59

LAIPE\$MATMUL_Z16 (FOR 32-BYTE COMPLEX VARIABLE)
WITH A SMALLER PROBLEM SIZE (NRA=960, NCA=740, NCB=760)

 number of cores elapsed time (sec.) speedup efficiency (%) 1 243.38 1.00 100.00 2 122.91 1.98 99.01 3 81.95 2.97 99.00 4 61.53 3.96 98.89 5 49.97 4.87 97.41 6 42.26 5.76 95.99 7 35.85 6.79 96.98 8 30.89 7.88 98.49