Equation Solution  
    High Performance by Design

 
Page: 1
 
Parallel Performance of 8-Byte Matrix Multiplication
 
How Fast 64 Cores Can Improve
 
Parallel Performance of 10-Byte Real Matrix Product on 48 Cores
 
Parallel Dense Solver on 64 Cores
 
Parallel Performance of laipe$decompose_DAG_16 on 48 cores
 

1   2   3   4  



Parallel Performance of laipe$Decompose_DAG_16 on 48 Cores


[Posted by Jenn-Ching Luo on Mar. 16, 2016 ]

        This post shares a set of parallel performance of LAIPE2 subroutine laipe$Decompose_DAG_16, which is a parallel dense solver for system equations. In this set of performance, the computing had been speeded up to 36x when using 48 cores.

        The example was implemented on homogeneous cores of neuLoop. A Dell PowerEdge R815 with quad 12-core 1.9GHZ Opterons on Windows Server 2008 R2, a total of 48 cores, implemented the computing. The example decomposed a 16-byte matrix [A] into [L][U]. Because 16-byte real arithmetic is very slow on the current computer, the example matrix is of order 5,000-by-5,000. Timing result is as follows:

Timing Result

  Core: 1
      Elapsed Time (Seconds): 5786.84
      CPU Time in User Mode (Seconds): 5786.64
      CPU Time in Kernel Mode (Seconds): 0.20
      Total CPU Time (Seconds): 5786.84

  Cores: 2
      Elapsed Time (Seconds): 2896.97
      CPU Time in User Mode (Seconds): 5776.67
      CPU Time in Kernel Mode (Seconds): 0.09
      Total CPU Time (Seconds): 5776.76

  Cores: 3
      Elapsed Time (Seconds): 1941.53
      CPU Time in User Mode (Seconds): 5775.16
      CPU Time in Kernel Mode (Seconds): 0.27
      Total CPU Time (Seconds): 5775.42

  Cores: 4
      Elapsed Time (Seconds): 1459.73
      CPU Time in User Mode (Seconds): 5770.43
      CPU Time in Kernel Mode (Seconds): 0.45
      Total CPU Time (Seconds): 5770.88
  Cores: 5
      Elapsed Time (Seconds): 1174.95
      CPU Time in User Mode (Seconds): 5767.79
      CPU Time in Kernel Mode (Seconds): 0.76
      Total CPU Time (Seconds): 5768.56
  Cores: 6
      Elapsed Time (Seconds): 983.13
      CPU Time in User Mode (Seconds): 5764.30
      CPU Time in Kernel Mode (Seconds): 0.92
      Total CPU Time (Seconds): 5765.22

  Cores: 7
      Elapsed Time (Seconds): 848.18
      CPU Time in User Mode (Seconds): 5770.41
      CPU Time in Kernel Mode (Seconds): 0.87
      Total CPU Time (Seconds): 5771.29

  Cores: 8
      Elapsed Time (Seconds): 745.15
      CPU Time in User Mode (Seconds): 5778.98
      CPU Time in Kernel Mode (Seconds): 1.17
      Total CPU Time (Seconds): 5780.15

  Cores: 9
      Elapsed Time (Seconds): 667.59
      CPU Time in User Mode (Seconds): 5779.76
      CPU Time in Kernel Mode (Seconds): 1.29
      Total CPU Time (Seconds): 5781.05

  Cores: 10
      Elapsed Time (Seconds): 604.25
      CPU Time in User Mode (Seconds): 5788.20
      CPU Time in Kernel Mode (Seconds): 1.14
      Total CPU Time (Seconds): 5789.34

  Cores: 11
      Elapsed Time (Seconds): 552.85
      CPU Time in User Mode (Seconds): 5795.03
      CPU Time in Kernel Mode (Seconds): 1.28
      Total CPU Time (Seconds): 5796.31

  Cores: 12
      Elapsed Time (Seconds): 508.73
      CPU Time in User Mode (Seconds): 5795.78
      CPU Time in Kernel Mode (Seconds): 1.53
      Total CPU Time (Seconds): 5797.31

  Cores: 13
      Elapsed Time (Seconds): 473.29
      CPU Time in User Mode (Seconds): 5797.20
      CPU Time in Kernel Mode (Seconds): 0.97
      Total CPU Time (Seconds): 5798.17

  Cores: 14
      Elapsed Time (Seconds): 441.97
      CPU Time in User Mode (Seconds): 5805.08
      CPU Time in Kernel Mode (Seconds): 1.98
      Total CPU Time (Seconds): 5807.06

  Cores: 15
      Elapsed Time (Seconds): 415.27
      CPU Time in User Mode (Seconds): 5809.15
      CPU Time in Kernel Mode (Seconds): 2.36
      Total CPU Time (Seconds): 5811.51

  Cores: 16
      Elapsed Time (Seconds): 390.83
      CPU Time in User Mode (Seconds): 5809.98
      CPU Time in Kernel Mode (Seconds): 2.95
      Total CPU Time (Seconds): 5812.92

  Cores: 17
      Elapsed Time (Seconds): 370.89
      CPU Time in User Mode (Seconds): 5816.64
      CPU Time in Kernel Mode (Seconds): 2.29
      Total CPU Time (Seconds): 5818.93

  Cores: 18
      Elapsed Time (Seconds): 352.17
      CPU Time in User Mode (Seconds): 5818.68
      CPU Time in Kernel Mode (Seconds): 2.32
      Total CPU Time (Seconds): 5821.01

  Cores: 19
      Elapsed Time (Seconds): 335.81
      CPU Time in User Mode (Seconds): 5822.27
      CPU Time in Kernel Mode (Seconds): 2.68
      Total CPU Time (Seconds): 5824.95

  Cores: 20
      Elapsed Time (Seconds): 320.49
      CPU Time in User Mode (Seconds): 5829.04
      CPU Time in Kernel Mode (Seconds): 2.42
      Total CPU Time (Seconds): 5831.46

  Cores: 21
      Elapsed Time (Seconds): 307.60
      CPU Time in User Mode (Seconds): 5831.08
      CPU Time in Kernel Mode (Seconds): 2.68
      Total CPU Time (Seconds): 5833.77

  Cores: 22
      Elapsed Time (Seconds): 295.53
      CPU Time in User Mode (Seconds): 5834.83
      CPU Time in Kernel Mode (Seconds): 2.06
      Total CPU Time (Seconds): 5836.89

  Cores: 23
      Elapsed Time (Seconds): 284.52
      CPU Time in User Mode (Seconds): 5840.93
      CPU Time in Kernel Mode (Seconds): 2.54
      Total CPU Time (Seconds): 5843.47

  Cores: 24
      Elapsed Time (Seconds): 273.85
      CPU Time in User Mode (Seconds): 5845.76
      CPU Time in Kernel Mode (Seconds): 3.63
      Total CPU Time (Seconds): 5849.40

  Cores: 25
      Elapsed Time (Seconds): 264.95
      CPU Time in User Mode (Seconds): 5849.16
      CPU Time in Kernel Mode (Seconds): 3.04
      Total CPU Time (Seconds): 5852.21

  Cores: 26
      Elapsed Time (Seconds): 256.39
      CPU Time in User Mode (Seconds): 5852.08
      CPU Time in Kernel Mode (Seconds): 4.34
      Total CPU Time (Seconds): 5856.42

  Cores: 27
      Elapsed Time (Seconds): 248.62
      CPU Time in User Mode (Seconds): 5856.34
      CPU Time in Kernel Mode (Seconds): 4.15
      Total CPU Time (Seconds): 5860.49

  Cores: 28
      Elapsed Time (Seconds): 240.91
      CPU Time in User Mode (Seconds): 5862.89
      CPU Time in Kernel Mode (Seconds): 4.38
      Total CPU Time (Seconds): 5867.28

  Cores: 29
      Elapsed Time (Seconds): 234.45
      CPU Time in User Mode (Seconds): 5862.64
      CPU Time in Kernel Mode (Seconds): 4.46
      Total CPU Time (Seconds): 5867.10

  Cores: 30
      Elapsed Time (Seconds): 228.04
      CPU Time in User Mode (Seconds): 5867.53
      CPU Time in Kernel Mode (Seconds): 4.77
      Total CPU Time (Seconds): 5872.30

  Cores: 31
      Elapsed Time (Seconds): 222.25
      CPU Time in User Mode (Seconds): 5877.56
      CPU Time in Kernel Mode (Seconds): 4.57
      Total CPU Time (Seconds): 5882.13

  Cores: 32
      Elapsed Time (Seconds): 216.19
      CPU Time in User Mode (Seconds): 5882.50
      CPU Time in Kernel Mode (Seconds): 5.85
      Total CPU Time (Seconds): 5888.35

  Cores: 33
      Elapsed Time (Seconds): 211.37
      CPU Time in User Mode (Seconds): 5884.03
      CPU Time in Kernel Mode (Seconds): 4.04
      Total CPU Time (Seconds): 5888.07

  Cores: 34
      Elapsed Time (Seconds): 206.36
      CPU Time in User Mode (Seconds): 5891.78
      CPU Time in Kernel Mode (Seconds): 4.54
      Total CPU Time (Seconds): 5896.32

  Cores: 35
      Elapsed Time (Seconds): 201.90
      CPU Time in User Mode (Seconds): 5895.85
      CPU Time in Kernel Mode (Seconds): 3.90
      Total CPU Time (Seconds): 5899.75

  Cores: 36
      Elapsed Time (Seconds): 197.36
      CPU Time in User Mode (Seconds): 5902.24
      CPU Time in Kernel Mode (Seconds): 5.54
      Total CPU Time (Seconds): 5907.77

  Cores: 37
      Elapsed Time (Seconds): 193.61
      CPU Time in User Mode (Seconds): 5909.47
      CPU Time in Kernel Mode (Seconds): 6.19
      Total CPU Time (Seconds): 5915.67

  Cores: 38
      Elapsed Time (Seconds): 189.71
      CPU Time in User Mode (Seconds): 5912.98
      CPU Time in Kernel Mode (Seconds): 4.71
      Total CPU Time (Seconds): 5917.70

  Cores: 39
      Elapsed Time (Seconds): 186.00
      CPU Time in User Mode (Seconds): 5918.41
      CPU Time in Kernel Mode (Seconds): 5.69
      Total CPU Time (Seconds): 5924.11

  Cores: 40
      Elapsed Time (Seconds): 182.32
      CPU Time in User Mode (Seconds): 5924.47
      CPU Time in Kernel Mode (Seconds): 5.12
      Total CPU Time (Seconds): 5929.58

  Cores: 41
      Elapsed Time (Seconds): 179.28
      CPU Time in User Mode (Seconds): 5923.31
      CPU Time in Kernel Mode (Seconds): 5.40
      Total CPU Time (Seconds): 5928.71

  Cores: 42
      Elapsed Time (Seconds): 175.95
      CPU Time in User Mode (Seconds): 5928.37
      CPU Time in Kernel Mode (Seconds): 4.90
      Total CPU Time (Seconds): 5933.26

  Cores: 43
      Elapsed Time (Seconds): 173.08
      CPU Time in User Mode (Seconds): 5932.58
      CPU Time in Kernel Mode (Seconds): 5.76
      Total CPU Time (Seconds): 5938.33

  Cores: 44
      Elapsed Time (Seconds): 170.29
      CPU Time in User Mode (Seconds): 5943.11
      CPU Time in Kernel Mode (Seconds): 6.46
      Total CPU Time (Seconds): 5949.57

  Cores: 45
      Elapsed Time (Seconds): 167.65
      CPU Time in User Mode (Seconds): 5942.84
      CPU Time in Kernel Mode (Seconds): 7.47
      Total CPU Time (Seconds): 5950.32

  Cores: 46
      Elapsed Time (Seconds): 165.21
      CPU Time in User Mode (Seconds): 5948.08
      CPU Time in Kernel Mode (Seconds): 5.87
      Total CPU Time (Seconds): 5953.95

  Cores: 47
      Elapsed Time (Seconds): 162.93
      CPU Time in User Mode (Seconds): 5953.33
      CPU Time in Kernel Mode (Seconds): 7.64
      Total CPU Time (Seconds): 5960.97

  Cores: 48
      Elapsed Time (Seconds): 160.28
      CPU Time in User Mode (Seconds): 5953.36
      CPU Time in Kernel Mode (Seconds): 6.85
      Total CPU Time (Seconds): 5960.21

From the above, it can be seen elapsed time was reducing with more cores enabled. For example, one cores took 5786.84 seconds to complete the computing; While, two cores cut the elapsed time into 2896.97 seconds, almost half of the time one core required; Three cores completed the computing in 1941.53 seconds, almost 1/3 of the time one core required; When using 48 cores, it took only 160.28 seconds to decompose the matrix.

        This example took advantage of multicore. A 97-minute job (e.g., 5786.84 seconds) can be completed in about 2.5 minutes (e.g., 160.48 seconds) on 48 cores. That is a purpose that multicore is designed. We could use more cores to speed up computing.The speedup and efficiency are listed in the following.

Speedup and Efficiency

Number
of Cores
Elapsed
Time (sec)
Speedup Efficiency
(%)
1 5786.84 1.0000 100.00
2 2896.97 1.9975 99.88
3 1941.53 2.9806 99.35
4 1459.73 3.9643 99.11
5 1174.95 4.9252 98.50
6 983.13 5.8861 98.10
7 848.18 6.8227 97.47
8 745.15 7.7660 97.08
9 667.59 8.6683 96.31
10 604.25 9.5769 95.77
11 552.85 10.4673 95.16
12 508.73 11.3751 94.79
13 473.29 12.2268 94.05
14 441.97 13.0933 93.52
15 415.27 13.9351 92.90
16 390.83 14.8065 92.54
17 370.89 15.6026 91.78
18 352.17 16.4320 91.29
19 335.81 17.2325 90.70
20 320.49 18.0562 90.28
21 307.60 18.8129 89.59
22 295.53 19.5812 89.01
23 284.52 20.3390 88.43
24 273.85 21.1314 88.05
25 264.95 21.8413 87.37
26 256.39 22.5705 86.81
27 248.62 23.2758 86.21
28 240.91 24.0208 85.79
29 234.45 24.6826 85.11
30 228.04 25.3764 84.59
31 222.25 26.0375 83.99
32 216.19 26.7674 83.64
33 211.37 27.3778 82.96
34 206.36 28.0425 82.48
35 201.90 28.6619 81.89
36 197.36 29.3212 81.45
37 193.61 29.8892 80.78
38 189.71 30.5036 80.27
39 186.00 31.1120 79.77
40 182.32 31.7400 79.35
41 179.28 32.2782 78.73
42 175.95 32.8891 78.31
43 173.08 33.4345 77.75
44 170.29 33.9823 77.23
45 167.65 34.5174 76.71
46 165.21 35.0272 76.15
47 162.93 35.5173 75.57
48 160.28 36.1046 75.22

The above table has four columns. The first column is number of cores; The second column is the elapsed time in seconds; The third column is speedup. Speedup is a result we are going to see. The example yielded almost perfect speedup within the range of 10 cores. For example, two cores yielded a 1.9975 speedup; three cores yielded a 2.9806 speedup; four cores yielded a 3.9643 speedup. 48 cores achieved a 36.1046 speedup.

        This example yielded no super-linear speedup.

        The fourth column is efficiency. It also can be seen that, within the range of 20 cores, the example could achieve an efficiency over 90%. With 48 cores, it could reach 75% efficiency of parallel processing. This example convinces us that efficient parallel program can be developed to take advantage of multicore. The LAIPE2 subroutines are programmed in neuLoop.