Equation Solution  
    High Performance by Design

 
Page: 4
 
Programming Language is Not a Drug to Treat Poor Parallelism
 
A Time To Introduce Soft Core Computing
 
Grandpa on Multicores (6)
 
Parallelizing Loops on the Face of a Program is Not Enough for Multicore Computing
 
Tools or Parallelism
 

  1   2   3   4   5   6   7  
 



A Time to Introduce Soft Core Computing


[Posted by Jenn-Ching Luo on Feb. 08, 2012 ]

      Our demand to speed up computer applications never ends. Most of us had experiences with dial-up modem. What a slow connection dial-up modem was! It was impossible to watch movies on the internet with a dial up modem. Now, we can enjoy fast internet connections. A good example is broadband service. New technologies make our daily life easier and more convenient.

      This post introduces another new technology, neuLoop for soft core computing. neuLoop also meets our demand to speed up applications on modern computers.

      neuLoop is in a technology different from other well-known programming tools, e.g., OpenMP. Most of such well-known tools are based on threading. neuLoop is not based on threading, but runs on soft cores.

      Soft core is a virtual concept of physical core. That sits between physical cores and application. neuLoop opens the door to soft core computing.

      Programming in soft cores is quite different from the technologies that are based on a team of cooperative tasks (e.g., threading). For example, in threading, program itself creates a team of threads to share a computing. We used to identify such application as a parallel program; while in soft cores, program never creates a team of threads. Program in neuLoop is written in a way that soft cores can read it. Then, soft cores can execute the program simultaneously so as to speed up the application. That shows a fundamental difference between programming in soft cores and threading.

      neuLoop (new loop) provides simple syntax to rewrite a loop into a way that soft cores can read it, and also provides two types of soft core at the present time, homogeneous and heterogeneous. Preliminary tests show that neuLoop takes less overhead, as compared with OpenMP. In this introduction, we are going to see some overhead comparisons.

EXAMPLE

      Here uses the example in the post "Parallelizing Loops on the Face of a Program is not enough for Multicore Computing" for comparisons. The example is matrix multiplication, which was parallelized with OpenMP previously. Timing results with OpenMP are also published.

      This post re-writes the example program in neuLoop, and collects timing results. We can compare the timing results to see overhead difference. The FORTRAN program is rewritten as:

      PROGRAM MATMULT
 
!
! default attribute
!
      implicit none
 
!
! parameters
!
      ! rows of matrix [A]
      integer (4), parameter :: NRA = 2400
      ! columns of matrix [A]
      integer (4), parameter :: NCA = 1850
      ! columns of matrix [B]
      integer (4), parameter :: NCB = 1900
 
      !! variables
      integer (4) :: chunk
      integer (4) :: cores
      integer (4) :: i, j, k, m
      integer (4) :: ii, jj, kk
      real (8), save :: a(nra,nca)
      real (8), save :: b(nca,ncb)
      real (8), save :: c(nra,ncb)
      real (4) :: elapsedTime
      real (4) :: userTime
      real (4) :: kernelTime
      real (4) :: totalTime
 
!
! declaration of do-loop subroutines
!
      !! (This is required by neuLoop)
      external mata, matb, matc, mmul
 
!
! get the number of physical cores
!
      call nlp$getce(cores)
 
!
! define number of soft cores
!
      chunk = 10 !! assume a chunk
      do m = 1,cores
            !! use m soft cores
            call nlp$use(m)
            write(*,*)
            write(*,'(''number of cores:'',i3)') m
 
!
! initialize timer
!
            call nlp$resetUserTimes
 
!
! initialize matrices
!

            !
            ! !! the original loop
            ! do j = 1, nca, chunk
            !     do kk = j,min0(j+chunk-1,nca)
            !         do i = 1, nra
            !             call date_and_time &
            !                   &(values=time_array)
            !             ii = mod(time_array(7),i)
            !             jj = mod(time_array(8),kk)
            !             a(i,kk) = (ii-1)+(jj-1)
            !         end do
            !     end do
            ! end do
            !
            !
            ! The above is written into what
            ! soft cores can read -->
 
            call nlp$loop_8(mata,1,nca,chunk, &
                                      &chunk,nca,nra,a)
 
            ! mata: the loop statements.
 
 
            !
            ! !! the original loop
            ! do j = 1, ncb, chunk
            !     do kk = j, min0(j+chunk-1,ncb)
            !         do i = 1, nca
            !             call date_and_time &
            !                   &(values=time_array)
            !             ii = mod(time_array(7),i)
            !             jj = mod(time_array(8),kk)
            !             b(i,kk) = (ii-1)*(jj-1)
            !         end do
            !     end do
            ! end do
            !
 
            !
            ! The above loop is written into twhat
            ! soft cores can read -->
 
            call nlp$loop_8(matb,1,ncb,chunk, &
                                      &chunk,nca,ncb,b)
 
            ! matb: the loop statements.


            !
            ! !! The original loop
            ! do j = 1, ncb, chunk
            !     do kk = j, min0(j+chunk-1,ncb)
            !         do i = 1, nra
            !             c(i,kk) = 0
            !         end do
            !     end do
            ! end do
            !

            !
            ! The above loop is written into what
            ! soft cores can read -->

            call nlp$syncloop_8(matc,1,ncb,chunk, &
                                              &chunk,nra,ncb,c)

            ! matc: the loop statements.

!
! do matrix multiply
!

            !
            ! !! the original loop
            ! do j = 1, ncb, chunk
            !     do kk = j, min0(j+chunk-1,ncb)
            !         do k = 1, nca
            !             do i = 1, nra
            !                 c(i,kk) = c(i,kk)+ &
            !                             &a(i,k)*b(k,kk)
            !             end do
            !         end do
            !     end do
            ! end do
            !

            !
            ! The above loop is written into what
            ! soft cores can read -->

            call nlp$syncloop_11(mmul,1,ncb,&
                                                &chunk,a,b,c,&
                                                &nra,nca,ncb)
            ! mmul: the loop statements.

!
! output user times
!
            call nlp$getUserTimes( &
                  & elapsedTime,userTime, &
                  & kernelTime,totalTime)
            write(*,'('' Elapsed Time (Seconds): &
                        & '',F8.2)') elapsedTime
            write(*,'('' CPU Time in User Mode &
                        & (Seconds): '',F8.2)') userTime
            write(*, '('' CPU Time in Kernel Mode &
                        & (Seconds): '', F8.2)') kernelTime
            write(*,'('' Total CPU Time (Seconds): &
                        & '',F8.2)') totalTime

        end do

!
! deallocate soft cores
!
      call nlp$done

!
! end of subroutine
!
      end program matmult


!
! do-subroutines
!
      recursive subroutine mata(j,chunk,nca,nra,a)
      implicit none
      integer (4) :: kk,j,chunk,nca,nra,i,ii,jj
      integer (4) :: time_array(8)
      real (8) :: a(nra,nca)
      do kk = j,min0(j+chunk-1,nca)
          do i = 1, nra
              call date_and_time(values=time_array)
              ii = mod(time_array(7),i)
              jj = mod(time_array(8),kk)
              a(i,kk) = (ii-1)+(jj-1)
          end do
      end do
      end subroutine mata


      recursive subroutine matb(j,chunk,nca,ncb,b)
      implicit none
      integer (4) :: kk,j,chunk,nca,ncb,ii,jj,i
      integer (4) :: time_array(8)
      real (8) :: b(nca,ncb)
      do kk = j, min0(j+chunk-1,ncb)
          do i = 1, nca
              call date_and_time(values=time_array)
              ii = mod(time_array(7),i)
              jj = mod(time_array(8),kk)
              b(i,kk) = (ii-1)*(jj-1)
          end do
      end do
      end subroutine matb
     
     
      recursive subroutine matc(j,chunk,nra,ncb,c)
      implicit none
      integer (4) :: kk,j,chunk,nra,ncb,i
      real (8) :: c(nra,ncb)
      do kk = j, min0(j+chunk-1,ncb)
          do i = 1, nra
              c(i,kk) = 0
          end do
      end do
      end subroutine matc


      recursive subroutine mmul &
                      & (j,chunk,a,b,c,nra,nca,ncb)
      implicit none
      integer (4) :: kk,j,chunk,nra,nca,ncb,i,k
      real (8) :: a(nra,nca), b(nca,ncb), c(nra,ncb)
      do kk = j, min0(j+chunk-1,ncb)
          do k = 1, nca
              do i = 1, nra
                  c(i,kk) = c(i,kk) + a(i,k) * b(k,kk)
              end do
          end do
      end do
      end subroutine mmul


TEST PLATFORM AND TIMING RESULTS

      Since for comparison, the test platform is the same as the one to implement examples in the article "Parallelizing Loops on the Face of a Program is not enough for Multicore Computing". The computer system is a SunFire v40z with quad dual-core opteron 870 on Windows 2008 R2, a total of 8 cores. The example program was compiled against GFORTRAN 4.7 without optimization (e.g., with option -O0).

      As introduced previously, neuLoop has two types of soft core. First, we link the example program against homogeneous cores. The timing results are as:

[Homogeneous Soft Cores]
number of cores elapsed time (sec.) speedup efficiency (%)
1 171.54 1.00 100.00
2 87.42 1.96 98.11
3 58.75 2.92 97.33
4 48.27 3.55 88.84
5 40.75 4.21 84.19
6 38.25 4.48 74.75
7 35.43 4.84 69.17
8 28.89 5.94 74.22

From the timing results, we can see that two soft cores can reduce the elapsed time from 171.54 seconds to 87.42 seconds, which shows a speedup 1.96x and 98.11% efficiency; four soft cores can cut the elapsed time from 171.54 seconds to 48.27 seconds, a speedup 3.55x which is equivalent to a 88.84% efficiency; eight cores can reduce the elapsed time from 171.54 seconds to 28.89 seconds, a speedup 5.94x. From the timing results, we can see that soft cores speed up the computing.

      Speedup is not the focus point here.

      Our interest is comparison of overhead. Let us copy the OpenMP timing results, from the article "Parallelizing Loops on the Face of a Program is not enough for Multicore Computing", in the following:

[OpenMP]
number of cores elapsed time (sec.) speedup efficiency (%)
1 193.83 1.00 100.00
2 98.61 1.97 98.28
3 66.13 2.93 97.70
4 52.82 3.67 91.74
5 45.69 4.24 84.85
6 42.42 4.57 76.16
7 40.45 4.79 68.45
8 37.33 5.19 64.90

We compare the overhead, and we can see soft cores consistently run faster than OpenMP. For example, one soft core can complete the computing in 171.54 seconds; while one OpenMP thread takes 193.83 seconds. We have the important finding that soft cores require less overhead than threading. Program in neuLoop runs faster than program with OpenMP.

      We can compare more timing, all of which show consistent results. For example, two soft cores take 87.42 seconds; while two OpenMP threads take 98.61 seconds. We can see again that soft cores require less overhead.

      We can compare more. Four soft cores in neuLoop take 48.27 seconds; while four OpenMP threads take 52.82 seconds. Consistently, soft cores take less overhead.

      More comparisons also show that eight soft cores complete the example in 28.89 seconds; while eight OpenMP threads take 37.33 seconds. The timing results show a consistent result that soft cores in neuLoop require less overhead. Soft cores show advantage.

WHICH TYPE OF SOFT CORE IS BETTER

      As mentioned previously, neuLoop has two types of soft core. The above example links against homogeneous soft cores. How is performance of heterogeneous cores? We re-link the program against the heterogeneous soft cores, and we can see performance. A set of timing results is as follow:

[Heterogeneous Cores]
number of cores elapsed time (sec.) speedup efficiency (%)
1 169.74 1.00 100.00
2 85.36 1.99 99.43
3 57.35 2.96 98.66
4 47.42 3.58 89.49
5 41.87 4.05 81.08
6 39.25 4.32 72.08
7 35.46 4.79 68.38
8 32.87 5.16 64.55

By the timing results, we can see different performance between homogeneous and heterogeneous cores. But, it is too early to draw a line which type of soft cores is better for parallel processing.

      The above timing results show that, when using more cores, homogeneous cores may yield a better speedup; heterogeneous cores are most well suitable for a small number of cores. However, that is not a conclusion applicable to every example and every hardware platform.

      In this example, when using two cores, heterogeneous soft cores can yield an almost perfect speedup 1.99x, but homogeneous soft cores yield a speedup 1.96x. When using two cores, heterogeneous core is more efficient than homogeneous core.

      However, eight homogeneous cores can yield a better speedup 5.94x; while eight heterogeneous cores yield a speedup 5.16x. Each type of soft core has the best environment to apply. No one dominates the other all the time. This writer will post more on soft core computing.