Equation Solution  
    High Performance by Design

 
Page: 1
 
Parallel Performance of 8-Byte Matrix Multiplication
 
How Fast 64 Cores Can Improve
 
Parallel Performance of 10-Byte Real Matrix Product on 48 Cores
 
Parallel Dense Solver on 64 Cores
 
Parallel Performance of laipe$decompose_DAG_16 on 48 cores
 

1   2   3   4  



Parallel Dense Solver on 64 Cores


[Posted by Jenn-Ching Luo on Mar. 23, 2016 ]

        This post shares a performance of a LAIPE2 parallel dense solver on 64 cores. Parallel computing is a trend. In year 2016, who uses a single core computer? Possibly, no one uses a single core computer. All the computers are equipped with multicore processor. The point is how fast a multicore could speed up a computing. Not everyone has an idea how fast a multicore can improve. If you don't have an idea about it, this post would show you that 64 cores could improve the computing speed about 53 times faster than what one core could do. The detailed timing results are included in this post.

        A Dell PowerEdge R815 with quad 16-core Opteron 6276 on Windows Server 2008 R2, a total of 64 cores, implemented the computing. Because Opteron 6276 can run at a higher frequency when using 8 or less cores. For the purpose to measure parallel performance, processor's turbo boost was disabled. The timing result was obtained from implementing the LAIPE2 subroutine laipe$Decompose_DAG_z16, which is a parallel dense solver for system equations.

        The example matrix is 32-byte complex matrix of order 4,000-by-4,000, e.g., both real and imaginary parts of the complex number are 16-byte variable. 32-byte complex arithmetic is extremely slow on current computer. Don't be surprised that one core took about 4 hours and 20 minutes to decompose the "small" matrix. It is extremely slow. First, let us see the timing result in the following.

Timing Result

  Core: 1
      Elapsed Time (seconds): 15714.46
      CPU Time in User Mode (seconds): 15713.43
      CPU Time in Kernel Mode (seconds): 1.00
      Total CPU Time (seconds): 15714.43

  Cores: 2
      Elapsed Time (seconds): 7883.67
      CPU Time in User Mode (seconds): 15742.51
      CPU Time in Kernel Mode (seconds): 0.98
      Total CPU Time (seconds): 15743.50

  Cores: 3
      Elapsed Time (seconds): 5519.60
      CPU Time in User Mode (seconds): 16503.66
      CPU Time in Kernel Mode (seconds): 1.39
      Total CPU Time (seconds): 16505.05

  Cores: 4
      Elapsed Time (seconds): 4252.40
      CPU Time in User Mode (seconds): 16934.08
      CPU Time in Kernel Mode (seconds): 1.31
      Total CPU Time (seconds): 16935.39

  Cores: 5
      Elapsed Time (seconds): 3356.05
      CPU Time in User Mode (seconds): 16683.14
      CPU Time in Kernel Mode (seconds): 1.48
      Total CPU Time (seconds): 16684.62

  Cores: 6
      Elapsed Time (seconds): 2845.16
      CPU Time in User Mode (seconds): 16947.50
      CPU Time in Kernel Mode (seconds): 1.39
      Total CPU Time (seconds): 16948.88

  Cores: 7
      Elapsed Time (seconds): 2416.47
      CPU Time in User Mode (seconds): 16769.44
      CPU Time in Kernel Mode (seconds): 1.26
      Total CPU Time (seconds): 16770.70

  Cores: 8
      Elapsed Time (seconds): 2143.41
      CPU Time in User Mode (seconds): 16972.97
      CPU Time in Kernel Mode (seconds): 2.09
      Total CPU Time (seconds): 16975.06

  Cores: 9
      Elapsed Time (seconds): 1892.39
      CPU Time in User Mode (seconds): 16831.20
      CPU Time in Kernel Mode (seconds): 1.68
      Total CPU Time (seconds): 16832.88

  Cores: 10
      Elapsed Time (seconds): 1705.95
      CPU Time in User Mode (seconds): 16834.52
      CPU Time in Kernel Mode (seconds): 1.78
      Total CPU Time (seconds): 16836.30

  Cores: 11
      Elapsed Time (seconds): 1558.08
      CPU Time in User Mode (seconds): 16888.95
      CPU Time in Kernel Mode (seconds): 1.51
      Total CPU Time (seconds): 16890.46

  Cores: 12
      Elapsed Time (seconds): 1441.40
      CPU Time in User Mode (seconds): 17019.33
      CPU Time in Kernel Mode (seconds): 1.97
      Total CPU Time (seconds): 17021.30

  Cores: 13
      Elapsed Time (seconds): 1325.51
      CPU Time in User Mode (seconds): 16927.25
      CPU Time in Kernel Mode (seconds): 1.72
      Total CPU Time (seconds): 16928.96

  Cores: 14
      Elapsed Time (seconds): 1240.40
      CPU Time in User Mode (seconds): 17030.44
      CPU Time in Kernel Mode (seconds): 2.22
      Total CPU Time (seconds): 17032.66

  Cores: 15
      Elapsed Time (seconds): 1154.30
      CPU Time in User Mode (seconds): 16954.66
      CPU Time in Kernel Mode (seconds): 1.87
      Total CPU Time (seconds): 16956.53

  Cores: 16
      Elapsed Time (seconds): 1089.39
      CPU Time in User Mode (seconds): 17049.07
      CPU Time in Kernel Mode (seconds): 1.76
      Total CPU Time (seconds): 17050.83

  Cores: 17
      Elapsed Time (seconds): 1023.05
      CPU Time in User Mode (seconds): 16979.24
      CPU Time in Kernel Mode (seconds): 1.84
      Total CPU Time (seconds): 16981.08

  Cores: 18
      Elapsed Time (seconds): 970.87
      CPU Time in User Mode (seconds): 17039.15
      CPU Time in Kernel Mode (seconds): 2.37
      Total CPU Time (seconds): 17041.52

  Cores: 19
      Elapsed Time (seconds): 919.42
      CPU Time in User Mode (seconds): 17001.32
      CPU Time in Kernel Mode (seconds): 2.15
      Total CPU Time (seconds): 17003.47

  Cores: 20
      Elapsed Time (seconds): 878.44
      CPU Time in User Mode (seconds): 17072.17
      CPU Time in Kernel Mode (seconds): 2.62
      Total CPU Time (seconds): 17074.79

  Cores: 21
      Elapsed Time (seconds): 834.67
      CPU Time in User Mode (seconds): 17004.97
      CPU Time in Kernel Mode (seconds): 2.25
      Total CPU Time (seconds): 17007.21

  Cores: 22
      Elapsed Time (seconds): 800.75
      CPU Time in User Mode (seconds): 17061.24
      CPU Time in Kernel Mode (seconds): 2.81
      Total CPU Time (seconds): 17064.04

  Cores: 23
      Elapsed Time (seconds): 765.17
      CPU Time in User Mode (seconds): 17013.89
      CPU Time in Kernel Mode (seconds): 2.20
      Total CPU Time (seconds): 17016.09

  Cores: 24
      Elapsed Time (seconds): 736.03
      CPU Time in User Mode (seconds): 17064.64
      CPU Time in Kernel Mode (seconds): 2.51
      Total CPU Time (seconds): 17067.15

  Cores: 25
      Elapsed Time (seconds): 707.09
      CPU Time in User Mode (seconds): 17036.21
      CPU Time in Kernel Mode (seconds): 2.54
      Total CPU Time (seconds): 17038.76

  Cores: 26
      Elapsed Time (seconds): 680.84
      CPU Time in User Mode (seconds): 17034.64
      CPU Time in Kernel Mode (seconds): 2.85
      Total CPU Time (seconds): 17037.49

  Cores: 27
      Elapsed Time (seconds): 656.79
      CPU Time in User Mode (seconds): 17039.01
      CPU Time in Kernel Mode (seconds): 2.89
      Total CPU Time (seconds): 17041.89

  Cores: 28
      Elapsed Time (seconds): 636.31
      CPU Time in User Mode (seconds): 17089.71
      CPU Time in Kernel Mode (seconds): 2.84
      Total CPU Time (seconds): 17092.55

  Cores: 29
      Elapsed Time (seconds): 613.72
      CPU Time in User Mode (seconds): 17042.36
      CPU Time in Kernel Mode (seconds): 2.82
      Total CPU Time (seconds): 17045.19

  Cores: 30
      Elapsed Time (seconds): 595.14
      CPU Time in User Mode (seconds): 17071.75
      CPU Time in Kernel Mode (seconds): 2.73
      Total CPU Time (seconds): 17074.48

  Cores: 31
      Elapsed Time (seconds): 575.69
      CPU Time in User Mode (seconds): 17039.79
      CPU Time in Kernel Mode (seconds): 2.92
      Total CPU Time (seconds): 17042.71

  Cores: 32
      Elapsed Time (seconds): 558.67
      CPU Time in User Mode (seconds): 17044.44
      CPU Time in Kernel Mode (seconds): 3.06
      Total CPU Time (seconds): 17047.49

  Cores: 33
      Elapsed Time (seconds): 542.85
      CPU Time in User Mode (seconds): 17048.38
      CPU Time in Kernel Mode (seconds): 2.96
      Total CPU Time (seconds): 17051.35

  Cores: 34
      Elapsed Time (seconds): 528.19
      CPU Time in User Mode (seconds): 17056.03
      CPU Time in Kernel Mode (seconds): 3.15
      Total CPU Time (seconds): 17059.18

  Cores: 35
      Elapsed Time (seconds): 513.77
      CPU Time in User Mode (seconds): 17045.25
      CPU Time in Kernel Mode (seconds): 3.26
      Total CPU Time (seconds): 17048.51

  Cores: 36
      Elapsed Time (seconds): 500.64
      CPU Time in User Mode (seconds): 17051.61
      CPU Time in Kernel Mode (seconds): 3.20
      Total CPU Time (seconds): 17054.81

  Cores: 37
      Elapsed Time (seconds): 487.21
      CPU Time in User Mode (seconds): 17038.32
      CPU Time in Kernel Mode (seconds): 3.68
      Total CPU Time (seconds): 17042.00

  Cores: 38
      Elapsed Time (seconds): 475.32
      CPU Time in User Mode (seconds): 17034.22
      CPU Time in Kernel Mode (seconds): 3.28
      Total CPU Time (seconds): 17037.49

  Cores: 39
      Elapsed Time (seconds): 463.18
      CPU Time in User Mode (seconds): 17015.95
      CPU Time in Kernel Mode (seconds): 3.49
      Total CPU Time (seconds): 17019.44

  Cores: 40
      Elapsed Time (seconds): 452.53
      CPU Time in User Mode (seconds): 17015.58
      CPU Time in Kernel Mode (seconds): 3.70
      Total CPU Time (seconds): 17019.27

  Cores: 41
      Elapsed Time (seconds): 442.43
      CPU Time in User Mode (seconds): 17028.63
      CPU Time in Kernel Mode (seconds): 4.13
      Total CPU Time (seconds): 17032.77

  Cores: 42
      Elapsed Time (seconds): 433.28
      CPU Time in User Mode (seconds): 17047.23
      CPU Time in Kernel Mode (seconds): 3.45
      Total CPU Time (seconds): 17050.68

  Cores: 43
      Elapsed Time (seconds): 423.54
      CPU Time in User Mode (seconds): 17029.97
      CPU Time in Kernel Mode (seconds): 3.62
      Total CPU Time (seconds): 17033.59

  Cores: 44
      Elapsed Time (seconds): 414.92
      CPU Time in User Mode (seconds): 17052.77
      CPU Time in Kernel Mode (seconds): 3.46
      Total CPU Time (seconds): 17056.23

  Cores: 45
      Elapsed Time (seconds): 405.81
      CPU Time in User Mode (seconds): 17037.18
      CPU Time in Kernel Mode (seconds): 3.93
      Total CPU Time (seconds): 17041.11

  Cores: 46
      Elapsed Time (seconds): 397.91
      CPU Time in User Mode (seconds): 17029.04
      CPU Time in Kernel Mode (seconds): 3.95
      Total CPU Time (seconds): 17032.99

  Cores: 47
      Elapsed Time (seconds): 390.13
      CPU Time in User Mode (seconds): 17041.39
      CPU Time in Kernel Mode (seconds): 3.51
      Total CPU Time (seconds): 17044.90

  Cores: 48
      Elapsed Time (seconds): 382.56
      CPU Time in User Mode (seconds): 17033.44
      CPU Time in Kernel Mode (seconds): 4.43
      Total CPU Time (seconds): 17037.87

  Cores: 49
      Elapsed Time (seconds): 375.64
      CPU Time in User Mode (seconds): 17068.04
      CPU Time in Kernel Mode (seconds): 3.87
      Total CPU Time (seconds): 17071.91

  Cores: 50
      Elapsed Time (seconds): 369.22
      CPU Time in User Mode (seconds): 17049.27
      CPU Time in Kernel Mode (seconds): 4.38
      Total CPU Time (seconds): 17053.65

  Cores: 51
      Elapsed Time (seconds): 362.67
      CPU Time in User Mode (seconds): 17062.06
      CPU Time in Kernel Mode (seconds): 4.46
      Total CPU Time (seconds): 17066.53

  Cores: 52
      Elapsed Time (seconds): 356.31
      CPU Time in User Mode (seconds): 17070.77
      CPU Time in Kernel Mode (seconds): 4.01
      Total CPU Time (seconds): 17074.78

  Cores: 53
      Elapsed Time (seconds): 349.83
      CPU Time in User Mode (seconds): 17054.36
      CPU Time in Kernel Mode (seconds): 4.37
      Total CPU Time (seconds): 17058.72

  Cores: 54
      Elapsed Time (seconds): 344.06
      CPU Time in User Mode (seconds): 17064.67
      CPU Time in Kernel Mode (seconds): 3.96
      Total CPU Time (seconds): 17068.63

  Cores: 55
      Elapsed Time (seconds): 338.35
      CPU Time in User Mode (seconds): 17066.43
      CPU Time in Kernel Mode (seconds): 4.27
      Total CPU Time (seconds): 17070.71

  Cores: 56
      Elapsed Time (seconds): 333.06
      CPU Time in User Mode (seconds): 17057.20
      CPU Time in Kernel Mode (seconds): 4.10
      Total CPU Time (seconds): 17061.30

  Cores: 57
      Elapsed Time (seconds): 327.98
      CPU Time in User Mode (seconds): 17092.94
      CPU Time in Kernel Mode (seconds): 4.43
      Total CPU Time (seconds): 17097.37

  Cores: 58
      Elapsed Time (seconds): 323.33
      CPU Time in User Mode (seconds): 17082.69
      CPU Time in Kernel Mode (seconds): 5.10
      Total CPU Time (seconds): 17087.79

  Cores: 59
      Elapsed Time (seconds): 318.15
      CPU Time in User Mode (seconds): 17081.35
      CPU Time in Kernel Mode (seconds): 4.73
      Total CPU Time (seconds): 17086.07

  Cores: 60
      Elapsed Time (seconds): 313.28
      CPU Time in User Mode (seconds): 17071.66
      CPU Time in Kernel Mode (seconds): 4.06
      Total CPU Time (seconds): 17075.71

  Cores: 61
      Elapsed Time (seconds): 308.91
      CPU Time in User Mode (seconds): 17077.71
      CPU Time in Kernel Mode (seconds): 4.96
      Total CPU Time (seconds): 17082.67

  Cores: 62
      Elapsed Time (seconds): 304.55
      CPU Time in User Mode (seconds): 17101.91
      CPU Time in Kernel Mode (seconds): 5.09
      Total CPU Time (seconds): 17106.99

  Cores: 63
      Elapsed Time (seconds): 300.51
      CPU Time in User Mode (seconds): 17109.16
      CPU Time in Kernel Mode (seconds): 4.52
      Total CPU Time (seconds): 17113.68

  Cores: 64
      Elapsed Time (seconds): 296.82
      CPU Time in User Mode (seconds): 17136.60
      CPU Time in Kernel Mode (seconds): 4.85
      Total CPU Time (seconds): 17141.45

From the above timing result, the first thing to be noted is the time that one core took to decompose the matrix. We can see from the above list that one core took 15714.46 seconds, (e.g, about 4 hours and 20 minutes), to decompose the matrix. That is extremely slow.

        Second, let us examine how much time 64 cores required for the decomposition of the matrix. The timing result shows 64 cores took 296.82 seconds. 64 cores allow us to get the solution in less than 5 minutes. As compared with 4 hours and 20 minutes for one core to decompose the matrix, we can get the solution in less than 5 minutes on 64 cores. This timing result shows no reason to reject multicore applications, even which is relatively difficult in development.

        Third, the elapsed time has not reached a limit yet. In parallel computing, it is possible that elapsed time could not be improved, or even got worse, when using more additional cores, e.g., reaching a limit. However, that did not happen in this example. From the above list, we can see elapsed time was reduced with enabling an additional core. That means if more cores were available, the computing speed could be further improved.

        The detailed speedup and efficiency are as follows.

Speedup and Efficiency

Number
of Cores
Elapsed
Time (sec)
Speedup Efficiency
(%)
1 15714.46 1.000 100.00
2 7883.67 1.9933 99.66
3 5519.60 2.8470 94.90
4 4252.40 3.6954 92.39
5 3356.05 4.6824 93.65
6 2845.16 5.5232 92.05
7 2416.47 6.5031 92.90
8 2143.41 7.3315 91.64
9 1892.39 8.3040 92.27
10 1705.95 9.2116 92.12
11 1558.08 10.0858 91.69
12 1441.40 10.9022 90.85
13 1325.51 11.8554 91.20
14 1240.40 12.6689 90.49
15 1154.30 13.6138 90.76
16 1089.39 14.4250 90.16
17 1023.05 15.3604 90.36
18 970.87 16.1860 89.92
19 919.42 17.0917 89.96
20 878.44 17.8891 89.45
21 834.67 18.8272 89.65
22 800.75 19.6247 89.20
23 765.17 20.5372 89.29
24 736.03 21.3503 88.96
25 707.09 22.2241 88.90
26 680.84 23.0810 88.77
27 656.79 23.9262 88.62
28 636.31 24.6962 88.20
29 613.72 25.6053 88.29
30 595.14 26.4046 88.02
31 575.69 27.2967 88.05
32 558.67 28.1283 87.90
33 542.85 28.9481 87.72
34 528.19 29.7515 87.50
35 513.77 30.5866 87.39
36 500.64 31.3887 87.19
37 487.21 32.2540 87.17
38 475.32 33.0608 87.00
39 463.18 33.9273 86.99
40 452.53 34.7258 86.81
41 442.43 35.5185 86.63
42 433.28 36.2686 86.35
43 423.54 37.1027 86.29
44 414.92 37.8735 86.08
45 405.81 38.7237 86.05
46 397.91 39.4925 85.85
47 390.13 40.28 85.70
48 382.56 41.0771 85.58
49 375.64 41.8338 85.38
50 369.22 42.5612 85.12
51 362.67 43.3299 84.96
52 356.31 44.1033 84.81
53 349.83 44.9203 84.76
54 344.06 45.6736 84.58
55 338.35 46.4444 84.44
56 333.06 47.1821 84.25
57 327.98 47.9129 84.06
58 323.33 48.6019 83.80
59 318.15 49.3932 83.72
60 313.28 50.1611 83.60
61 308.91 50.8707 83.39
62 304.55 51.5989 83.22
63 300.51 52.2926 83.00
64 296.82 52.9427 82.72

The above table includes four columns. The first column is number of cores; The second column is the elapsed time in seconds; The third column is speedup; The fourth column is efficiency. Our interest is speedup and efficiency. The following notes two points.

        First, the example shows an unusual performance. Normally, efficiency is in a decreasing order. However, from the above table, we can see the efficiency is not completely in a decreasing order. For example, from the above table, we can see efficiency of four cores is 92.39%; Supposedly, five cores could yield an efficiency lower than what four cores could produce. However, the efficiency of five cores is 93.65%, which is higher than the efficiency of four cores. That is unusual. At this moment, it is uncertain what the actual cause is. One explanation is the cost of accessing memory.

        Second, 64 cores had improved the computing speed about 53x, and yielded an efficiency of 83%. This example provides an answer how fast a multicore could improve.