Equation Solution  
    High Performance by Design

 
Page: 1
 
Parallel Performance of 8-Byte Matrix Multiplication
 
How Fast 64 Cores Can Improve
 
Parallel Performance of 10-Byte Real Matrix Product on 48 Cores
 
Parallel Dense Solver on 64 Cores
 
Parallel Performance of laipe$decompose_DAG_16 on 48 cores
 

1   2   3   4  



Parallel Performance of 10-byte Real Matrix Product on 48 Cores


[Posted by Jenn-Ching Luo on May 7, 2016 ]

        This post shares a set of parallel performance of matrix product. Parallelism of matrix product can be seen on the face of its formulation. Matrix product is a good example of parallel processing. In this example, we are going to see a performance of "mixed" efficiency. The example could achieve a highly efficient performance when enabling a small number of cores. However, when enabling many cores, performance was sharply degraded.

        For example, on 10 cores or less, it could yield an almost perfect speedup, with an efficiency above 98%; However, on 38 cores, efficiency of parallel performance was below 50%. One of the causes of degradation is the matrices were allocated in a variable, which could not take advantage of NUMA. SMP cannot yield an efficient performance on many cores, especially matrix product is a memory-hungry procedure. The following shows the performance.

Example for Test

Compute [C]=[A][B]. Matrices [A],[B],and [C] are 10-byte real matrix of order (8200-by-8200).

Computing Environment:

Computer: Dell PowerEdge R815 with quad 6168 12-core Opteron, a total of 48 cores, Windows Server 2008 R2.
Compiler: gfortran with optimization -O3
Subroutine: LAIPE2 MATMUL_10 in neuloop4

Timing Result

(A) 1 Core to 10 Cores
        First, let us see performance from one core to ten cores. The performance was "almost perfect". Elapsed time was reduced proportionally to the reciprocal of the number of cores enabled. For example, one core took 4887.09 seconds to complete the computing, and two cores required only 2437.81 second, and so on. The timing result from one core to 10 cores is as:

  number of cores: 1
      Elapsed Time (Seconds): 4887.09
      CPU Time in User Mode (Seconds): 4884.08
      CPU Time in Kernel Mode (Seconds): 2.68
      Total CPU Time (Seconds): 4886.76

  number of cores: 2
      Elapsed Time (Seconds): 2437.81
      CPU Time in User Mode (Seconds): 4870.63
      CPU Time in Kernel Mode (Seconds): 3.54
      Total CPU Time (Seconds): 4874.17

  number of cores: 3
      Elapsed Time (Seconds): 1625.73
      CPU Time in User Mode (Seconds): 4870.65
      CPU Time in Kernel Mode (Seconds): 3.28
      Total CPU Time (Seconds): 4873.92

  number of cores: 4
      Elapsed Time (Seconds): 1220.16
      CPU Time in User Mode (Seconds): 4871.91
      CPU Time in Kernel Mode (Seconds): 4.17
      Total CPU Time (Seconds): 4876.08

  number of cores: 5
      Elapsed Time (Seconds): 976.63
      CPU Time in User Mode (Seconds): 4875.59
      CPU Time in Kernel Mode (Seconds): 3.98
      Total CPU Time (Seconds): 4879.57

  number of cores: 6
      Elapsed Time (Seconds): 814.92
      CPU Time in User Mode (Seconds): 4878.78
      CPU Time in Kernel Mode (Seconds): 4.31
      Total CPU Time (Seconds): 4883.08

  number of cores: 7
      Elapsed Time (Seconds): 701.63
      CPU Time in User Mode (Seconds): 4899.90
      CPU Time in Kernel Mode (Seconds): 4.95
      Total CPU Time (Seconds): 4904.84

  number of cores: 8
      Elapsed Time (Seconds): 616.59
      CPU Time in User Mode (Seconds): 4918.01
      CPU Time in Kernel Mode (Seconds): 4.93
      Total CPU Time (Seconds): 4922.94

  number of cores: 9
      Elapsed Time (Seconds): 551.71
      CPU Time in User Mode (Seconds): 4945.98
      CPU Time in Kernel Mode (Seconds): 4.71
      Total CPU Time (Seconds): 4950.69

  number of cores: 10
      Elapsed Time (Seconds): 500.45
      CPU Time in User Mode (Seconds): 4985.56
      CPU Time in Kernel Mode (Seconds): 5.26
      Total CPU Time (Seconds): 4990.82

(B) 11 Cores to 40 Cores
        Next, let us see the performance from 11 cores to 40 cores. The elapsed time could be reduced when using more cores. However, the performance cannot be considered as "highly efficient". When more cores cooperated in the computing, "false caching" was significant in this example and degraded the performance.

        For example, when using 11 cores, the total CPU time was 5050.60 seconds. Based on which, one core took 5050.60/11=459.15 seconds on average, which is very close to the elapsed time 460.50 seconds. The average CPU time one core took is almost equal to the elapsed time. That shows the parallelism kept all the enabled cores busy all the time, none of which was idle due to lack of parallelism. Clearly, it is an efficient parallelism.

        While, when using 40 cores, the total CPU time jumped up to 10330.89 seconds, which is more than two times of the cost 11 core required. The extra cost degraded the overall performance. When using 40 cores, the extra cost is the main cause to degrade the overall performance. However, the degradation is not from parallelism. It could be easily verified. The total CPU time was 10330.89 seconds. One core took 10330.89/40=258.27 seconds on average. The average CPU time one core took is close to the elapsed time 260.96 seconds. That shows the parallelism kept all enabled cores busy almost all the time, none of which was idle due to lack of parallelism. Apparently, matrix product has an efficient parallelism. The timing result from 11 cores to 40 cores is as:

  number of cores: 11
      Elapsed Time (Seconds): 460.50
      CPU Time in User Mode (Seconds): 5044.99
      CPU Time in Kernel Mode (Seconds): 5.60
      Total CPU Time (Seconds): 5050.60

  number of cores: 12
      Elapsed Time (Seconds): 427.93
      CPU Time in User Mode (Seconds): 5113.15
      CPU Time in Kernel Mode (Seconds): 6.22
      Total CPU Time (Seconds): 5119.38

  number of cores: 13
      Elapsed Time (Seconds): 402.65
      CPU Time in User Mode (Seconds): 5210.82
      CPU Time in Kernel Mode (Seconds): 6.65
      Total CPU Time (Seconds): 5217.47

  number of cores: 14
      Elapsed Time (Seconds): 383.11
      CPU Time in User Mode (Seconds): 5333.94
      CPU Time in Kernel Mode (Seconds): 7.19
      Total CPU Time (Seconds): 5341.13

  number of cores: 15
      Elapsed Time (Seconds): 371.02
      CPU Time in User Mode (Seconds): 5535.51
      CPU Time in Kernel Mode (Seconds): 8.63
      Total CPU Time (Seconds): 5544.14

  number of cores: 16
      Elapsed Time (Seconds): 369.69
      CPU Time in User Mode (Seconds): 5884.15
      CPU Time in Kernel Mode (Seconds): 10.30
      Total CPU Time (Seconds): 5894.45

  number of cores: 17
      Elapsed Time (Seconds): 370.02
      CPU Time in User Mode (Seconds): 6253.14
      CPU Time in Kernel Mode (Seconds): 12.39
      Total CPU Time (Seconds): 6265.53

  number of cores: 18
      Elapsed Time (Seconds): 365.45
      CPU Time in User Mode (Seconds): 6534.69
      CPU Time in Kernel Mode (Seconds): 15.15
      Total CPU Time (Seconds): 6549.84

  number of cores: 19
      Elapsed Time (Seconds): 353.25
      CPU Time in User Mode (Seconds): 6666.06
      CPU Time in Kernel Mode (Seconds): 14.48
      Total CPU Time (Seconds): 6680.54

  number of cores: 20
      Elapsed Time (Seconds): 340.50
      CPU Time in User Mode (Seconds): 6762.57
      CPU Time in Kernel Mode (Seconds): 14.73
      Total CPU Time (Seconds): 6777.29

  number of cores: 21
      Elapsed Time (Seconds): 328.74
      CPU Time in User Mode (Seconds): 6852.42
      CPU Time in Kernel Mode (Seconds): 15.43
      Total CPU Time (Seconds): 6867.85

  number of cores: 22
      Elapsed Time (Seconds): 319.61
      CPU Time in User Mode (Seconds): 6982.82
      CPU Time in Kernel Mode (Seconds): 16.54
      Total CPU Time (Seconds): 6999.36

  number of cores: 23
      Elapsed Time (Seconds): 313.50
      CPU Time in User Mode (Seconds): 7152.72
      CPU Time in Kernel Mode (Seconds): 17.35
      Total CPU Time (Seconds): 7170.07

  number of cores: 24
      Elapsed Time (Seconds): 309.10
      CPU Time in User Mode (Seconds): 7355.17
      CPU Time in Kernel Mode (Seconds): 17.69
      Total CPU Time (Seconds): 7372.86

  number of cores: 25
      Elapsed Time (Seconds): 300.04
      CPU Time in User Mode (Seconds): 7442.42
      CPU Time in Kernel Mode (Seconds): 18.56
      Total CPU Time (Seconds): 7460.98

  number of cores: 26
      Elapsed Time (Seconds): 292.73
      CPU Time in User Mode (Seconds): 7547.08
      CPU Time in Kernel Mode (Seconds): 19.14
      Total CPU Time (Seconds): 7566.22

  number of cores: 27
      Elapsed Time (Seconds): 288.48
      CPU Time in User Mode (Seconds): 7728.09
      CPU Time in Kernel Mode (Seconds): 19.00
      Total CPU Time (Seconds): 7747.09

  number of cores: 28
      Elapsed Time (Seconds): 285.76
      CPU Time in User Mode (Seconds): 7932.20
      CPU Time in Kernel Mode (Seconds): 19.19
      Total CPU Time (Seconds): 7951.39

  number of cores: 29
      Elapsed Time (Seconds): 284.31
      CPU Time in User Mode (Seconds): 8169.74
      CPU Time in Kernel Mode (Seconds): 21.01
      Total CPU Time (Seconds): 8190.75

  number of cores: 30
      Elapsed Time (Seconds): 283.19
      CPU Time in User Mode (Seconds): 8413.79
      CPU Time in Kernel Mode (Seconds): 22.21
      Total CPU Time (Seconds): 8436.00

  number of cores: 31
      Elapsed Time (Seconds): 277.95
      CPU Time in User Mode (Seconds): 8529.79
      CPU Time in Kernel Mode (Seconds): 23.01
      Total CPU Time (Seconds): 8552.80

  number of cores: 32
      Elapsed Time (Seconds): 272.66
      CPU Time in User Mode (Seconds): 8633.00
      CPU Time in Kernel Mode (Seconds): 23.70
      Total CPU Time (Seconds): 8656.70

  number of cores: 33
      Elapsed Time (Seconds): 268.51
      CPU Time in User Mode (Seconds): 8771.39
      CPU Time in Kernel Mode (Seconds): 23.03
      Total CPU Time (Seconds): 8794.42

  number of cores: 34
      Elapsed Time (Seconds): 265.62
      CPU Time in User Mode (Seconds): 8936.24
      CPU Time in Kernel Mode (Seconds): 23.73
      Total CPU Time (Seconds): 8959.96

  number of cores: 35
      Elapsed Time (Seconds): 263.28
      CPU Time in User Mode (Seconds): 9120.66
      CPU Time in Kernel Mode (Seconds): 26.32
      Total CPU Time (Seconds): 9146.98

  number of cores: 36
      Elapsed Time (Seconds): 261.69
      CPU Time in User Mode (Seconds): 9317.39
      CPU Time in Kernel Mode (Seconds): 26.13
      Total CPU Time (Seconds): 9343.52

  number of cores: 37
      Elapsed Time (Seconds): 261.19
      CPU Time in User Mode (Seconds): 9551.63
      CPU Time in Kernel Mode (Seconds): 27.89
      Total CPU Time (Seconds): 9579.52

  number of cores: 38
      Elapsed Time (Seconds): 260.83
      CPU Time in User Mode (Seconds): 9787.89
      CPU Time in Kernel Mode (Seconds): 29.41
      Total CPU Time (Seconds): 9817.30

  number of cores: 39
      Elapsed Time (Seconds): 260.68
      CPU Time in User Mode (Seconds): 10037.93
      CPU Time in Kernel Mode (Seconds): 30.28
      Total CPU Time (Seconds): 10068.21

  number of cores: 40
      Elapsed Time (Seconds): 260.96
      CPU Time in User Mode (Seconds): 10300.40
      CPU Time in Kernel Mode (Seconds): 30.48
      Total CPU Time (Seconds): 10330.89

(C) 41 Cores to 48 Cores
        We are going to see the timing result from 41 cores to 48 cores. Enabling 41 or more cores could not speed computing any more. The problem is the extra cost. The extra cost from enabling an additional core could not be traded off by the additional core. In such situation, it is not worth using any additional core. As mentioned in the early beginning, the matrices in this example were allocated in a variable. The computing could be interpreted as in a SMP environment. SMP would degrade parallel performance when many cores cooperate in computing an application. In such situation, NUMA is more efficient. The timing result from 41 cores to 48 cores is as:

  number of cores: 41
      Elapsed Time (Seconds): 261.44
      CPU Time in User Mode (Seconds): 10580.33
      CPU Time in Kernel Mode (Seconds): 31.87
      Total CPU Time (Seconds): 10612.20

  number of cores: 42
      Elapsed Time (Seconds): 262.02
      CPU Time in User Mode (Seconds): 10861.20
      CPU Time in Kernel Mode (Seconds): 31.51
      Total CPU Time (Seconds): 10892.71

  number of cores: 43
      Elapsed Time (Seconds): 262.22
      CPU Time in User Mode (Seconds): 11113.93
      CPU Time in Kernel Mode (Seconds): 34.34
      Total CPU Time (Seconds): 11148.27

  number of cores: 44
      Elapsed Time (Seconds): 262.19
      CPU Time in User Mode (Seconds): 11371.97
      CPU Time in Kernel Mode (Seconds): 35.41
      Total CPU Time (Seconds): 11407.39

  number of cores: 45
      Elapsed Time (Seconds): 262.30
      CPU Time in User Mode (Seconds): 11632.15
      CPU Time in Kernel Mode (Seconds): 36.97
      Total CPU Time (Seconds): 11669.12

  number of cores: 46
      Elapsed Time (Seconds): 262.80
      CPU Time in User Mode (Seconds): 11910.65
      CPU Time in Kernel Mode (Seconds): 37.27
      Total CPU Time (Seconds): 11947.91

  number of cores: 47
      Elapsed Time (Seconds): 263.38
      CPU Time in User Mode (Seconds): 12195.57
      CPU Time in Kernel Mode (Seconds): 37.21
      Total CPU Time (Seconds): 12232.77

  number of cores: 48
      Elapsed Time (Seconds): 263.86
      CPU Time in User Mode (Seconds): 12465.74
      CPU Time in Kernel Mode (Seconds): 39.00
      Total CPU Time (Seconds): 12504.74

Speedup and Efficiency

        Speedup and efficient is summarized in the following table. The first column is number of cores; The second column is elapsed time in seconds; The third column is speedup. From the following table, it can be seen that an almost perfect speedup had been achieved on 10 cores or less. For example, 2 cores improved the computing speed to 2.0047x, and 3 cores improved the speed to 3.0061x, and so on. On 10 cores, the speed was improved to 9.7654x.

        The fourth column is parallel efficiency. It also can be seen that 90% efficiency could be achieved when enabling 14 cores or less. For example, 2 cores achieved a 100.24% efficiency, and 3 cores achieved a 100.20%, and so on. On 14 cores, efficiency is 91.12%. The list of speedup and efficiency is as follows:

Number
of Cores
Elapsed
Time (sec)
Speedup Efficiency
(%)
1 4887.09 1.0000 100.00
2 2437.81 2.0047 100.24
3 1625.73 3.0061 100.20
4 1220.16 4.0053 100.13
5 976.63 5.0040 100.08
6 814.92 5.9970 99.95
7 701.63 6.9653 99.50
8 616.59 7.9260 99.07
9 551.71 8.8581 98.42
10 500.45 9.7654 97.65
11 460.50 10.6126 96.48
12 427.93 11.4203 95.17
13 402.65 12.1373 93.36
14 383.11 12.7564 91.12
15 371.02 13.1720 87.81
16 369.69 13.2194 82.62
17 370.02 13.2076 77.69
18 365.45 13.3728 74.29
19 353.25 13.8346 72.81
20 340.50 14.3529 71.76
21 328.74 14.8661 70.79
22 319.61 15.2908 69.50
23 313.50 15.5888 67.78
24 309.10 15.8107 65.88
25 300.04 16.2881 65.15
26 292.73 16.6949 64.21
27 288.48 16.9408 62.74
28 285.76 17.1021 61.08
29 284.31 17.1893 59.27
30 283.19 17.2573 57.52
31 277.95 17.5826 56.72
32 272.66 17.9238 56.01
33 268.51 18.2008 55.15
34 265.62 18.3988 54.11
35 263.28 18.5623 53.04
36 261.69 18.6751 51.88
37 261.19 18.7109 50.57
38 260.83 18.7367 49.31
39 260.68 18.7475 48.07
40 260.96 18.7274 46.82
41 261.44 18.6930 45.59
42 262.02 18.6516 44.41
43 262.22 18.6374 43.34
44 262.19 18.6395 42.36
45 262.30 18.6317 41.40
46 262.80 18.5962 40.43
47 263.38 18.5553 39.48
48 263.86 18.5215 37.80