Equation Solution  
    High Performance by Design

 
Page: 5
 
Grandpa on Multicores (5)
 
Is parallel computing easy?
 
Grandpa on Multicores (4)
 
Grandpa on Multicores (3)
 
How Fast a Variable Band Solver Could Speed up in a Parallel Environment
 

  2   3   4   5   6   7   8  
 



Grandpa on Multicores (4)


[Posted by Jenn-Ching Luo on May 02, 2011 ]

        This post shows how grandpa LAIPE decomposes a general dense matrix on multicores. Dense matrix is the simplest type, for example, in a form as:
       /                       \
       |  x  x  x  x  x  x  x  |
       |  x  x  x  x  x  x  x  |
       |  x  x  x  x  x  x  x  |
 [A] = |  x  x  x  x  x  x  x  |
       |  x  x  x  x  x  x  x  |
       |  x  x  x  x  x  x  x  |
       |  x  x  x  x  x  x  x  |
       \                       /
We are going to see how efficient grandpa LAIPE decomposes the matrix [A] into [L][U] in parallel. We did not often see a performance of multicores on the internet. This blog conducts a serious of performance tests, and provides us with an opportunity to see speedup and efficiency. In this post, we are going to see how efficient multicores speed up a decomposition of a dense matrix into triangular matrices.

        The test matrix is of order (3000x3000), and is in 16-byte REAL variables. This post applied a pseudo random process to generate the matrix. The test example called LAIPE subroutine, decompose_DAG_16, to decompose matrix. Grandpa LAIPE is ancient software. As mentioned previously, grandpa LAIPE was programmed in ancient parallel concept, whose base language is fortran-77. The testing platform remained the same as the first post "Grandpa on Multicores (1)". We obtained a set of timing results as follows.

core: 1
      Elapsed Time (Seconds): 2739.05
      CPU Time in User Mode (Seconds): 2738.96
      CPU Time in Kernel Mode (Seconds): 0.09
      Total CPU Time (Seconds): 2739.05

cores: 2
      Elapsed Time (Seconds): 1373.54
      CPU Time in User Mode (Seconds): 2735.79
      CPU Time in Kernel Mode (Seconds): 0.19
      Total CPU Time (Seconds): 2735.98

cores: 3
      Elapsed Time (Seconds): 929.64
      CPU Time in User Mode (Seconds): 2766.10
      CPU Time in Kernel Mode (Seconds): 0.19
      Total CPU Time (Seconds): 2766.29

cores: 4
      Elapsed Time (Seconds): 704.70
      CPU Time in User Mode (Seconds): 2784.48
      CPU Time in Kernel Mode (Seconds): 0.23
      Total CPU Time (Seconds): 2784.71

cores: 5
      Elapsed Time (Seconds): 568.64
      CPU Time in User Mode (Seconds): 2797.36
      CPU Time in Kernel Mode (Seconds): 0.19
      Total CPU Time (Seconds): 2797.55

cores: 6
      Elapsed Time (Seconds): 476.54
      CPU Time in User Mode (Seconds): 2801.04
      CPU Time in Kernel Mode (Seconds): 0.12
      Total CPU Time (Seconds): 2801.17

cores: 7
      Elapsed Time (Seconds): 410.95
      CPU Time in User Mode (Seconds): 2807.19
      CPU Time in Kernel Mode (Seconds): 0.19
      Total CPU Time (Seconds): 2807.38

cores: 8
      Elapsed Time (Seconds): 361.45
      CPU Time in User Mode (Seconds): 2809.34
      CPU Time in Kernel Mode (Seconds): 0.27
      Total CPU Time (Seconds): 2809.61

After we quickly examined the timing result, we could find that elapsed time was almost linearly reduced when increasing cores. We summarize the timing data to have speedup and efficiency:

number of cores elapsed time (sec.) speedup efficiency (%)
1 2739.05 1.00 100.00
2 1373.54 1.99 99.7
3 929.64 2.95 98.21
4 704.70 3.89 97.17
5 568.64 4.82 96.34
6 476.54 5.75 95.80
7 410.95 6.67 95.22
8 361.45 7.58 94.72

On 2 cores, speedup was up to 1.99x, which is equivalent to a 99.71% of efficiency; on 8 cores, grandpa could speed up to 7.58x, and reached a 94.72% of efficiency.

        The ancient software, grandpa LAIPE, showed us an almost perfect speedup again.