Equation Solution  
    High Performance by Design
List of Blog Contents  

 
Page: 2
 
Parallel Performance of laipe$decompose_DAG_8 on 48 cores
 
Chunk Size and Parallel Performance
 
Parallel Performance of laipe$decompose_DAG_4 on 48 cores
 
Parallel Performance of laipe$decompose_DAG_10 on 48 cores
 
Execution Time of One-Core-Enabled Parallel Code and Sequential Code
 

  1   2   3   4   5  
 



Execution Time of One-Core-Enabled Parallel Code And Sequential Code


[Posted by Jenn-Ching Luo on Dec. 05, 2015 ]

        It is understandable that performance of a parallel version of a computing algorithm with one core enabled could be different from its sequential version. Parallel version includes extra codes for synchronizations, distribution, dispatching, which are not parts of the computing algorithm at issue; While sequential code just straightforward executes its algorithms. A reasonable prediction is that one-core-enabled parallel version is slower than sequential version. However, that is not always true. It would be a big surprise that one-core-enabled parallel code could run faster than sequential code. This article shows a comparison.

        LAPACK is a standard software library for numerical linear algebra. The subroutine DPBTRF of LAPACK performs a Cholesky decomposition of a real positive definite band matrix. Here introduces the parallel routine laipe$Decompose_CSP_8 which also performs Cholesky decomposition. The DPBTRF of LAPACK and laipe$Decompose_CSP_8 performs the same function. The difference between DPBTRF and laipe$Decompose_CSP_8 is that DPBTRF is a sequential code, and laipe$Decompose_CSP_8 is a parallel code. When one core is enabled, the routine laipe$Decompose_CSP_8 is executed sequentially. This article is going to compare execution time of DPBTRF and laipe$Decompose_CSP_8 with one core enabled.

        The computing platform is a PowerEdge R815 with quad 1.9GHZ 12-core Opterons. The compiler is gfortran with optimization -O3. The testing example has the order n=50000 and half bandwidth=7000.

        Let us see the performance.

        The DPBTRF of LAPACK took 3856.24 seconds to decompose the example matrix; While it took only 3171.88 seconds on one core to finish the same job by laipe$Decompose_CSP_8. The parallel version with one core enabled ran significantly faster than the sequential code DPBTRF, a difference of 684.36 seconds. It could be hard to imagine that a parallel code with one core enabled could run faster than a less-burden sequential code.

        Besides running faster than a sequential version (e.g. DPBTRF of LAPACK), the execution time of laipe$Decompsoe_CSP_8 also can be sped up when employing more cores. The following lists some timing results:

  Core: 1
      Elapsed Time (Seconds): 3171.88
      CPU Time in User Mode (Seconds): 3171.61
      CPU Time in Kernel Mode (Seconds): 0.25
      Total CPU Time (Seconds): 3171.86

  Cores: 2
      Elapsed Time (Seconds): 1599.85
      CPU Time in User Mode (Seconds): 3196.76
      CPU Time in Kernel Mode (Seconds): 0.50
      Total CPU Time (Seconds): 3197.26

  Cores: 3
      Elapsed Time (Seconds): 1075.69
      CPU Time in User Mode (Seconds): 3220.05
      CPU Time in Kernel Mode (Seconds): 0.66
      Total CPU Time (Seconds): 3220.70

  Cores: 4
      Elapsed Time (Seconds): 811.42
      CPU Time in User Mode (Seconds): 3235.82
      CPU Time in Kernel Mode (Seconds): 0.48
      Total CPU Time (Seconds): 3236.30

  Cores: 5
      Elapsed Time (Seconds): 651.21
      CPU Time in User Mode (Seconds): 3240.61
      CPU Time in Kernel Mode (Seconds): 0.75
      Total CPU Time (Seconds): 3241.36

  Cores: 6
      Elapsed Time (Seconds): 544.30
      CPU Time in User Mode (Seconds): 3243.49
      CPU Time in Kernel Mode (Seconds): 0.83
      Total CPU Time (Seconds): 3244.32

  Cores: 7
      Elapsed Time (Seconds): 468.07
      CPU Time in User Mode (Seconds): 3252.71
      CPU Time in Kernel Mode (Seconds): 0.53
      Total CPU Time (Seconds): 3253.25

  Cores: 8
      Elapsed Time (Seconds): 413.04
      CPU Time in User Mode (Seconds): 3264.29
      CPU Time in Kernel Mode (Seconds): 0.84
      Total CPU Time (Seconds): 3265.13

  Cores: 9
      Elapsed Time (Seconds): 368.49
      CPU Time in User Mode (Seconds): 3271.14
      CPU Time in Kernel Mode (Seconds): 1.17
      Total CPU Time (Seconds): 3272.31

  Cores: 10
      Elapsed Time (Seconds): 335.65
      CPU Time in User Mode (Seconds): 3291.82
      CPU Time in Kernel Mode (Seconds): 0.81
      Total CPU Time (Seconds): 3292.64

  Cores: 11
      Elapsed Time (Seconds): 309.82
      CPU Time in User Mode (Seconds): 3325.32
      CPU Time in Kernel Mode (Seconds): 1.47
      Total CPU Time (Seconds): 3326.78

  Cores: 12
      Elapsed Time (Seconds): 284.59
      CPU Time in User Mode (Seconds): 3340.68
      CPU Time in Kernel Mode (Seconds): 1.22
      Total CPU Time (Seconds): 3341.90

  Cores: 13
      Elapsed Time (Seconds): 264.16
      CPU Time in User Mode (Seconds): 3347.13
      CPU Time in Kernel Mode (Seconds): 1.56
      Total CPU Time (Seconds): 3348.69

  Cores: 14
      Elapsed Time (Seconds): 246.50
      CPU Time in User Mode (Seconds): 3350.51
      CPU Time in Kernel Mode (Seconds): 1.54
      Total CPU Time (Seconds): 3352.06

  Cores: 15
      Elapsed Time (Seconds): 233.05
      CPU Time in User Mode (Seconds): 3364.74
      CPU Time in Kernel Mode (Seconds): 1.83
      Total CPU Time (Seconds): 3366.56

  Cores: 16
      Elapsed Time (Seconds): 220.82
      CPU Time in User Mode (Seconds): 3387.50
      CPU Time in Kernel Mode (Seconds): 1.86
      Total CPU Time (Seconds): 3389.36

The total CPU time is increased with number of cores, which is normal. While, the elapsed time is reduced when more cores are employed. We can see speedup. Speedup and efficiency are listed in the following table:

Number
of Cores
Elapsed
Time (sec)
Speedup Efficiency
(%)
1 3171.88 1.0000 100.00
2 1599.85 1.9826 99.13
3 1075.69 2.9487 98.29
4 811.42 3.9090 97.73
5 651.21 4.8707 97.41
6 544.30 5.8274 97.12
7 468.07 6.7766 96.81
8 413.04 7.6794 95.99
9 368.49 8.6078 95.64
10 335.65 9.4500 94.50
11 309.82 10.2378 93.07
12 284.59 11.1454 92.88
13 264.16 12.0074 92.36
14 246.50 12.8677 91.91
15 233.05 13.6103 90.74
16 220.82 14.3641 89.70

When using 16 cores, the efficiency is down below 90%. This comparison shows parallel code has more advantages. Even only one core is enabled, parallel version still has chance to run faster than a sequential version. The most benefit is speedup. Let us go for parallel computing.