Equation Solution  
    High Performance by Design

 
Page: 3
 
Implementation of Constant-Bandwidth Solver of LAIPE2 on 48 Cores
 
Parallel Performance of Skyline Solver on Soft Cores (3)
 
Parallel Performance of Skyline Solver on Soft Cores (2)
 
Parallel Performance of Skyline Solver on Soft Cores (I)
 
Parallel Matrix Multiplication on Soft Cores
 

  1   2   3   4   5   6  



Parallel Matrix Multiplication on Soft Cores


[Posted by Jenn-Ching Luo on Mar. 01, 2012 ]

        Since this post, this writer starts introducing parallel performance of LAIPE2 on multicores.

        LAIPE2 follows grandpa LAIPE to provide a library to "Link and in Parallel Execute" programs. The big difference between grandpa LAIPE and LAIPE2 is that grandpa LAIPE is programmed in MTASK and LAIPE2 is based on soft core computing (e.g., neuLoop). This post gives us a first look on LAIPE2 performance.

        Parallel performance of the function, laipe$matmul, which does matrix multiplication (e.g., [c]=[a][b]), is presented in this post. The function laipe$matmul is programmed in standard matrix multiplication, not in the original form but in a new form for parallel processing.

TEST PROGRAM AND PLATFORM

        The program gets number of physical cores on computer, and initializes matrices [a] and [b], and then repeatedly executes the function laipe$matmul_8 and collects timing results with 1 soft core, 2 soft cores, and etc. The function laipe$matmul_8 is for 8-byte REAL variables, e.g., double precision.

        After timing the function laipe$matmul_8, the program also times the FORTRAN intrinsic function MATMUL for comparison. The following has the FORTRAN program. Interested user can link the program against laipe2, and see performance on his computer.


program mx_8
      !! default attribute
      implicit none
  
      !! parameters
      integer*4, parameter :: NRA = 2400
      integer*4, parameter :: NCA = 1850
      integer*4, parameter :: NCB = 1900
  
      !! common variables
      integer*4 :: cores
      integer*4 :: i, j, ii, jj
      integer*4 :: time_array(8)
      real*8, save :: a(nra,nca), b(nca,ncb)
      real*8, save :: c(nra,ncb)
      real*4 :: elapsedTime
      real*4 :: userTime
      real*4 :: kernelTime
      real*4 :: totalTime
  
!
! number of physical cores
!
      call laipe$getce(cores)

!
! arbitrarily initialize matrices
!
      do j = 1, nca
            do i = 1, nra
                  call date_and_time(values=time_array)
                  ii = mod(time_array(7),i)
                  jj = mod(time_array(8),j)
                  a(i,j) = (ii-1)+(jj-1)
            end do
      end do
   
      do j = 1, ncb
            do i = 1, nca
                  call date_and_time(values=time_array)
                  ii = mod(time_array(7),i)
                  jj = mod(time_array(8),j)
                  b(i,j) = (ii-1)*(jj-1)
            end do
      end do
   
!
! collect timing results
!
      do i = 1,cores
            call laipe$use(i)
            write(*,*)
            write(*,'(''number of cores:'',i3)') i
   
      !
      ! initialize timer -- start collecting time
      !
            call laipe$resetUserTimes
   
      !
      ! do multiplication
      !
            call laipe$matmul_8(a,b,c,nra,nca,ncb)
   
      !
      ! output user times
      !
            call laipe$getUserTimes( &
                & elapsedTime, userTime,&
                & kernelTime,totalTime)
            write(*,'('' Elapsed Time (Seconds): '', &
                & F8.2)') elapsedTime
            write(*,'('' CPU Time in User Mode &
                & (Seconds): '', F8.2)') userTime
            write(*,'('' CPU Time in Kernel Mode &
                &(Seconds): '', F8.2)') kernelTime
            write(*,'('' Total CPU Time (Seconds): &
                & '',F8.2)') totalTime
      end do
   
!
! collect timing in implementing Fortran
! intrinsic function MATMUL
!
      write(*,*)
      write(*,'(''&
          & Timing in implementing Fortran '' &
          & ''intrinsic MATMUL'')')
      call laipe$resetUserTimes
      c = matmul(a,b)
      call laipe$getUserTimes(elapsedTime, &
          & userTime,kernelTime,totalTime)
      write(*,'('' Elapsed Time (Seconds): '', &
          & F8.2)') elapsedTime
      write(*,'('' CPU Time in User Mode &
          & (Seconds): '', F8.2)') userTime
      write(*,'('' CPU Time in Kernel Mode &
          &(Seconds): '', F8.2)') kernelTime
      write(*,'('' Total CPU Time (Seconds): &
          & '',F8.2)') totalTime
   
!
! deallocate soft cores
!
      call laipe$done

!
! end of program
!
end program mx_8

        The test platform remains the same as the previous post, "Programming Language is Not a Drug to Treat Poor Parallelism". It is a Windows server 2008 R2 with eight cores of opteron 870. The hardware, this writer uses, is out-of-dated. GFORTRAN is applied with the optimization (-O3).

TIMING RESULT

        We collected a set of timing results as follows, including elapsed time, cpu time in user mode, and cpu time in kernel mode.

  number of cores: 1
        Elapsed Time (Seconds): 19.89
        CPU Time in User Mode (Seconds): 19.87
        CPU Time in Kernel Mode (Seconds): 0.02
        Total CPU Time (Seconds): 19.89
 
  number of cores: 2
        Elapsed Time (Seconds): 9.98
        CPU Time in User Mode (Seconds): 19.84
        CPU Time in Kernel Mode (Seconds): 0.09
        Total CPU Time (Seconds): 19.94
 
  number of cores: 3
        Elapsed Time (Seconds): 6.76
        CPU Time in User Mode (Seconds): 20.14
        CPU Time in Kernel Mode (Seconds): 0.06
        Total CPU Time (Seconds): 20.20
 
  number of cores: 4
        Elapsed Time (Seconds): 5.15
        CPU Time in User Mode (Seconds): 20.40
        CPU Time in Kernel Mode (Seconds): 0.08
        Total CPU Time (Seconds): 20.48
 
  number of cores: 5
        Elapsed Time (Seconds): 4.21
        CPU Time in User Mode (Seconds): 20.61
        CPU Time in Kernel Mode (Seconds): 0.06
        Total CPU Time (Seconds): 20.67
 
  number of cores: 6
        Elapsed Time (Seconds): 3.59
        CPU Time in User Mode (Seconds): 21.20
        CPU Time in Kernel Mode (Seconds): 0.09
        Total CPU Time (Seconds): 21.29
 
  number of cores: 7
        Elapsed Time (Seconds): 3.17
        CPU Time in User Mode (Seconds): 21.90
        CPU Time in Kernel Mode (Seconds): 0.06
        Total CPU Time (Seconds): 21.96
 
  number of cores: 8
        Elapsed Time (Seconds): 2.92
        CPU Time in User Mode (Seconds): 22.90
        CPU Time in Kernel Mode (Seconds): 0.08
        Total CPU Time (Seconds): 22.98
 
 
Timing in implementing Fortran intrinsic MATMUL

        Elapsed Time (Seconds): 78.42
        CPU Time in User Mode (Seconds): 78.42
        CPU Time in Kernel Mode (Seconds): 0.00
        Total CPU Time (Seconds): 78.42

        First, we see the time in implementing the FORTRAN intrinsic function MATMUL. MATMUL required 78.42 seconds, which was done on one core. Next, we see performance of the function laipe$matmul_8 on one core that took only 19.89 seconds to complete the identical job. The function, laipe$matmul_8, significantly outperforms the FORTRAN intrinsic function MATMUL.

        We summarize the timing results to see speedup and efficiency of laipe$matmul_8.

number of cores elapsed time (sec.) speedup efficiency (%)
1 19.89 1.00 100.00
2 9.98 1.99 99.65
3 6.76 2.94 98.08
4 5.15 3.86 96.55
5 4.21 4.72 94.49
6 3.59 5.54 92.34
7 3.17 6.27 89.63
8 2.92 6.81 85.15

From the above table, we can see that elapsed time was linearly reduced with number of soft cores. For example, the elapsed time is reduced from 19.89 seconds to 9.98 second when using 2 cores, which shows a 1.99x speedup and 99.65% efficiency; 4 soft cores can cut the elapsed time to 5.15 seconds, yielding a speedup 3.86x and 96.55% efficiency. It shows an almost linear speedup with number of cores.

        The above example is in 8-byte REAL variable, e.g., double precision. The following has parallel performance of laipe$matmul in other data types. All of them consistently show efficiently parallel performance.

LAIPE$MATMUL_4 (FOR 4-BYTE REAL VARIABLE)

number of cores elapsed time (sec.) speedup efficiency (%)
1 16.29 1.00 100.00
2 8.20 1.99 99.33
3 5.71 2.85 95.10
4 4.40 3.70 92.56
5 3.57 4.56 91.26
6 3.01 5.41 90.20
7 2.65 6.15 87.82
8 2.36 6.90 86.28


LAIPE$MATMUL_10 (FOR 10-BYTE REAL VARIABLE)

number of cores elapsed time (sec.) speedup efficiency (%)
1 80.36 1.00 100.00
2 40.26 2.00 99.80
3 27.35 2.94 97.94
4 20.83 3.86 96.45
5 17.08 4.70 94.10
6 14.51 5.54 92.30
7 12.71 6.32 90.32
8 11.47 7.01 87.58

LAIPE$MATMUL_Z4 (FOR 8-BYTE COMPLEX VARIABLE)

number of cores elapsed time (sec.) speedup efficiency (%)
1 44.62 1.00 100.00
2 22.32 2.00 99.96
3 14.90 2.99 99.82
4 11.25 3.97 99.16
5 9.03 4.94 98.83
6 7.58 5.89 98.11
7 6.54 6.82 97.47
8 5.79 7.71 96.33

LAIPE$MATMUL_Z8 (FOR 16-BYTE COMPLEX VARIABLE)

number of cores elapsed time (sec.) speedup efficiency (%)
1 71.28 1.00 100.00
2 35.90 1.99 99.28
3 24.54 2.90 96.82
4 18.74 3.80 95.09
5 15.41 4.63 92.51
6 13.18 5.41 90.14
7 11.61 6.14 87.71
8 10.62 6.71 83.90

LAIPE$MATMUL_Z10 (FOR 20-BYTE COMPLEX VARIABLE)

number of cores elapsed time (sec.) speedup efficiency (%)
1 171.23 1.00 100.00
2 83.62 2.05 102.39
3 57.03 3.00 100.08
4 44.13 3.88 97.00
5 37.05 4.62 92.43
6 32.59 5.25 87.57
7 30.23 5.66 80.92
8 29.22 5.86 73.25

LAIPE$MATMUL_16 (FOR 16-BYTE REAL VARIABLE)
WITH A SMALLER PROBLEM SIZE (NRA=960, NCA=740, NCB=760)

number of cores elapsed time (sec.) speedup efficiency (%)
1 71.14 1.00 100.00
2 35.93 1.98 99.00
3 23.92 2.97 99.14
4 18.00 3.95 98.81
5 14.63 4.86 97.25
6 12.35 5.76 96.01
7 10.42 6.83 97.53
8 9.02 7.89 98.59

LAIPE$MATMUL_Z16 (FOR 32-BYTE COMPLEX VARIABLE)
WITH A SMALLER PROBLEM SIZE (NRA=960, NCA=740, NCB=760)


number of cores elapsed time (sec.) speedup efficiency (%)
1 243.38 1.00 100.00
2 122.91 1.98 99.01
3 81.95 2.97 99.00
4 61.53 3.96 98.89
5 49.97 4.87 97.41
6 42.26 5.76 95.99
7 35.85 6.79 96.98
8 30.89 7.88 98.49