Equation Solution High Performance by Design |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Parallel Performance of 10-byte Real Matrix Product on 48 Cores
[Posted by Jenn-Ching Luo on May 7, 2016 ]
This post shares a set of parallel performance of matrix product. Parallelism of matrix product can be seen on the face of its formulation. Matrix product is a good example of parallel processing. In this example, we are going to see a performance of "mixed" efficiency. The example could achieve a highly efficient performance when enabling a small number of cores. However, when enabling many cores, performance was sharply degraded.
For example, on 10 cores or less, it could yield an almost perfect speedup, with an efficiency above 98%; However, on 38 cores, efficiency of parallel performance was below 50%. One of the causes of degradation is the matrices were allocated in a variable, which could not take advantage of NUMA. SMP cannot yield an efficient performance on many cores, especially matrix product is a memory-hungry procedure. The following shows the performance. Example for Test
Compute [C]=[A][B]. Matrices [A],[B],and [C] are 10-byte real matrix of order (8200-by-8200).
Computing Environment:
Computer: Dell PowerEdge R815 with quad 6168 12-core Opteron, a total of 48 cores, Windows Server 2008 R2.
Compiler: gfortran with optimization -O3
Subroutine: LAIPE2 MATMUL_10 in neuloop4
Timing Result (A) 1 Core to 10 Cores First, let us see performance from one core to ten cores. The performance was "almost perfect". Elapsed time was reduced proportionally to the reciprocal of the number of cores enabled. For example, one core took 4887.09 seconds to complete the computing, and two cores required only 2437.81 second, and so on. The timing result from one core to 10 cores is as:
number of cores: 1
Elapsed Time (Seconds): 4887.09 CPU Time in User Mode (Seconds): 4884.08 CPU Time in Kernel Mode (Seconds): 2.68 Total CPU Time (Seconds): 4886.76 number of cores: 2 Elapsed Time (Seconds): 2437.81 CPU Time in User Mode (Seconds): 4870.63 CPU Time in Kernel Mode (Seconds): 3.54 Total CPU Time (Seconds): 4874.17 number of cores: 3 Elapsed Time (Seconds): 1625.73 CPU Time in User Mode (Seconds): 4870.65 CPU Time in Kernel Mode (Seconds): 3.28 Total CPU Time (Seconds): 4873.92 number of cores: 4 Elapsed Time (Seconds): 1220.16 CPU Time in User Mode (Seconds): 4871.91 CPU Time in Kernel Mode (Seconds): 4.17 Total CPU Time (Seconds): 4876.08 number of cores: 5 Elapsed Time (Seconds): 976.63 CPU Time in User Mode (Seconds): 4875.59 CPU Time in Kernel Mode (Seconds): 3.98 Total CPU Time (Seconds): 4879.57 number of cores: 6 Elapsed Time (Seconds): 814.92 CPU Time in User Mode (Seconds): 4878.78 CPU Time in Kernel Mode (Seconds): 4.31 Total CPU Time (Seconds): 4883.08 number of cores: 7 Elapsed Time (Seconds): 701.63 CPU Time in User Mode (Seconds): 4899.90 CPU Time in Kernel Mode (Seconds): 4.95 Total CPU Time (Seconds): 4904.84 number of cores: 8 Elapsed Time (Seconds): 616.59 CPU Time in User Mode (Seconds): 4918.01 CPU Time in Kernel Mode (Seconds): 4.93 Total CPU Time (Seconds): 4922.94 number of cores: 9 Elapsed Time (Seconds): 551.71 CPU Time in User Mode (Seconds): 4945.98 CPU Time in Kernel Mode (Seconds): 4.71 Total CPU Time (Seconds): 4950.69 number of cores: 10 Elapsed Time (Seconds): 500.45 CPU Time in User Mode (Seconds): 4985.56 CPU Time in Kernel Mode (Seconds): 5.26 Total CPU Time (Seconds): 4990.82 (B) 11 Cores to 40 Cores Next, let us see the performance from 11 cores to 40 cores. The elapsed time could be reduced when using more cores. However, the performance cannot be considered as "highly efficient". When more cores cooperated in the computing, "false caching" was significant in this example and degraded the performance. For example, when using 11 cores, the total CPU time was 5050.60 seconds. Based on which, one core took 5050.60/11=459.15 seconds on average, which is very close to the elapsed time 460.50 seconds. The average CPU time one core took is almost equal to the elapsed time. That shows the parallelism kept all the enabled cores busy all the time, none of which was idle due to lack of parallelism. Clearly, it is an efficient parallelism. While, when using 40 cores, the total CPU time jumped up to 10330.89 seconds, which is more than two times of the cost 11 core required. The extra cost degraded the overall performance. When using 40 cores, the extra cost is the main cause to degrade the overall performance. However, the degradation is not from parallelism. It could be easily verified. The total CPU time was 10330.89 seconds. One core took 10330.89/40=258.27 seconds on average. The average CPU time one core took is close to the elapsed time 260.96 seconds. That shows the parallelism kept all enabled cores busy almost all the time, none of which was idle due to lack of parallelism. Apparently, matrix product has an efficient parallelism. The timing result from 11 cores to 40 cores is as:
number of cores: 11
Elapsed Time (Seconds): 460.50 CPU Time in User Mode (Seconds): 5044.99 CPU Time in Kernel Mode (Seconds): 5.60 Total CPU Time (Seconds): 5050.60 number of cores: 12 Elapsed Time (Seconds): 427.93 CPU Time in User Mode (Seconds): 5113.15 CPU Time in Kernel Mode (Seconds): 6.22 Total CPU Time (Seconds): 5119.38 number of cores: 13 Elapsed Time (Seconds): 402.65 CPU Time in User Mode (Seconds): 5210.82 CPU Time in Kernel Mode (Seconds): 6.65 Total CPU Time (Seconds): 5217.47 number of cores: 14 Elapsed Time (Seconds): 383.11 CPU Time in User Mode (Seconds): 5333.94 CPU Time in Kernel Mode (Seconds): 7.19 Total CPU Time (Seconds): 5341.13 number of cores: 15 Elapsed Time (Seconds): 371.02 CPU Time in User Mode (Seconds): 5535.51 CPU Time in Kernel Mode (Seconds): 8.63 Total CPU Time (Seconds): 5544.14 number of cores: 16 Elapsed Time (Seconds): 369.69 CPU Time in User Mode (Seconds): 5884.15 CPU Time in Kernel Mode (Seconds): 10.30 Total CPU Time (Seconds): 5894.45 number of cores: 17 Elapsed Time (Seconds): 370.02 CPU Time in User Mode (Seconds): 6253.14 CPU Time in Kernel Mode (Seconds): 12.39 Total CPU Time (Seconds): 6265.53 number of cores: 18 Elapsed Time (Seconds): 365.45 CPU Time in User Mode (Seconds): 6534.69 CPU Time in Kernel Mode (Seconds): 15.15 Total CPU Time (Seconds): 6549.84 number of cores: 19 Elapsed Time (Seconds): 353.25 CPU Time in User Mode (Seconds): 6666.06 CPU Time in Kernel Mode (Seconds): 14.48 Total CPU Time (Seconds): 6680.54 number of cores: 20 Elapsed Time (Seconds): 340.50 CPU Time in User Mode (Seconds): 6762.57 CPU Time in Kernel Mode (Seconds): 14.73 Total CPU Time (Seconds): 6777.29 number of cores: 21 Elapsed Time (Seconds): 328.74 CPU Time in User Mode (Seconds): 6852.42 CPU Time in Kernel Mode (Seconds): 15.43 Total CPU Time (Seconds): 6867.85 number of cores: 22 Elapsed Time (Seconds): 319.61 CPU Time in User Mode (Seconds): 6982.82 CPU Time in Kernel Mode (Seconds): 16.54 Total CPU Time (Seconds): 6999.36 number of cores: 23 Elapsed Time (Seconds): 313.50 CPU Time in User Mode (Seconds): 7152.72 CPU Time in Kernel Mode (Seconds): 17.35 Total CPU Time (Seconds): 7170.07 number of cores: 24 Elapsed Time (Seconds): 309.10 CPU Time in User Mode (Seconds): 7355.17 CPU Time in Kernel Mode (Seconds): 17.69 Total CPU Time (Seconds): 7372.86 number of cores: 25 Elapsed Time (Seconds): 300.04 CPU Time in User Mode (Seconds): 7442.42 CPU Time in Kernel Mode (Seconds): 18.56 Total CPU Time (Seconds): 7460.98 number of cores: 26 Elapsed Time (Seconds): 292.73 CPU Time in User Mode (Seconds): 7547.08 CPU Time in Kernel Mode (Seconds): 19.14 Total CPU Time (Seconds): 7566.22 number of cores: 27 Elapsed Time (Seconds): 288.48 CPU Time in User Mode (Seconds): 7728.09 CPU Time in Kernel Mode (Seconds): 19.00 Total CPU Time (Seconds): 7747.09 number of cores: 28 Elapsed Time (Seconds): 285.76 CPU Time in User Mode (Seconds): 7932.20 CPU Time in Kernel Mode (Seconds): 19.19 Total CPU Time (Seconds): 7951.39 number of cores: 29 Elapsed Time (Seconds): 284.31 CPU Time in User Mode (Seconds): 8169.74 CPU Time in Kernel Mode (Seconds): 21.01 Total CPU Time (Seconds): 8190.75 number of cores: 30 Elapsed Time (Seconds): 283.19 CPU Time in User Mode (Seconds): 8413.79 CPU Time in Kernel Mode (Seconds): 22.21 Total CPU Time (Seconds): 8436.00 number of cores: 31 Elapsed Time (Seconds): 277.95 CPU Time in User Mode (Seconds): 8529.79 CPU Time in Kernel Mode (Seconds): 23.01 Total CPU Time (Seconds): 8552.80 number of cores: 32 Elapsed Time (Seconds): 272.66 CPU Time in User Mode (Seconds): 8633.00 CPU Time in Kernel Mode (Seconds): 23.70 Total CPU Time (Seconds): 8656.70 number of cores: 33 Elapsed Time (Seconds): 268.51 CPU Time in User Mode (Seconds): 8771.39 CPU Time in Kernel Mode (Seconds): 23.03 Total CPU Time (Seconds): 8794.42 number of cores: 34 Elapsed Time (Seconds): 265.62 CPU Time in User Mode (Seconds): 8936.24 CPU Time in Kernel Mode (Seconds): 23.73 Total CPU Time (Seconds): 8959.96 number of cores: 35 Elapsed Time (Seconds): 263.28 CPU Time in User Mode (Seconds): 9120.66 CPU Time in Kernel Mode (Seconds): 26.32 Total CPU Time (Seconds): 9146.98 number of cores: 36 Elapsed Time (Seconds): 261.69 CPU Time in User Mode (Seconds): 9317.39 CPU Time in Kernel Mode (Seconds): 26.13 Total CPU Time (Seconds): 9343.52 number of cores: 37 Elapsed Time (Seconds): 261.19 CPU Time in User Mode (Seconds): 9551.63 CPU Time in Kernel Mode (Seconds): 27.89 Total CPU Time (Seconds): 9579.52 number of cores: 38 Elapsed Time (Seconds): 260.83 CPU Time in User Mode (Seconds): 9787.89 CPU Time in Kernel Mode (Seconds): 29.41 Total CPU Time (Seconds): 9817.30 number of cores: 39 Elapsed Time (Seconds): 260.68 CPU Time in User Mode (Seconds): 10037.93 CPU Time in Kernel Mode (Seconds): 30.28 Total CPU Time (Seconds): 10068.21 number of cores: 40 Elapsed Time (Seconds): 260.96 CPU Time in User Mode (Seconds): 10300.40 CPU Time in Kernel Mode (Seconds): 30.48 Total CPU Time (Seconds): 10330.89 (C) 41 Cores to 48 Cores We are going to see the timing result from 41 cores to 48 cores. Enabling 41 or more cores could not speed computing any more. The problem is the extra cost. The extra cost from enabling an additional core could not be traded off by the additional core. In such situation, it is not worth using any additional core. As mentioned in the early beginning, the matrices in this example were allocated in a variable. The computing could be interpreted as in a SMP environment. SMP would degrade parallel performance when many cores cooperate in computing an application. In such situation, NUMA is more efficient. The timing result from 41 cores to 48 cores is as:
number of cores: 41
Elapsed Time (Seconds): 261.44 CPU Time in User Mode (Seconds): 10580.33 CPU Time in Kernel Mode (Seconds): 31.87 Total CPU Time (Seconds): 10612.20 number of cores: 42 Elapsed Time (Seconds): 262.02 CPU Time in User Mode (Seconds): 10861.20 CPU Time in Kernel Mode (Seconds): 31.51 Total CPU Time (Seconds): 10892.71 number of cores: 43 Elapsed Time (Seconds): 262.22 CPU Time in User Mode (Seconds): 11113.93 CPU Time in Kernel Mode (Seconds): 34.34 Total CPU Time (Seconds): 11148.27 number of cores: 44 Elapsed Time (Seconds): 262.19 CPU Time in User Mode (Seconds): 11371.97 CPU Time in Kernel Mode (Seconds): 35.41 Total CPU Time (Seconds): 11407.39 number of cores: 45 Elapsed Time (Seconds): 262.30 CPU Time in User Mode (Seconds): 11632.15 CPU Time in Kernel Mode (Seconds): 36.97 Total CPU Time (Seconds): 11669.12 number of cores: 46 Elapsed Time (Seconds): 262.80 CPU Time in User Mode (Seconds): 11910.65 CPU Time in Kernel Mode (Seconds): 37.27 Total CPU Time (Seconds): 11947.91 number of cores: 47 Elapsed Time (Seconds): 263.38 CPU Time in User Mode (Seconds): 12195.57 CPU Time in Kernel Mode (Seconds): 37.21 Total CPU Time (Seconds): 12232.77 number of cores: 48 Elapsed Time (Seconds): 263.86 CPU Time in User Mode (Seconds): 12465.74 CPU Time in Kernel Mode (Seconds): 39.00 Total CPU Time (Seconds): 12504.74 Speedup and Efficiency Speedup and efficient is summarized in the following table. The first column is number of cores; The second column is elapsed time in seconds; The third column is speedup. From the following table, it can be seen that an almost perfect speedup had been achieved on 10 cores or less. For example, 2 cores improved the computing speed to 2.0047x, and 3 cores improved the speed to 3.0061x, and so on. On 10 cores, the speed was improved to 9.7654x. The fourth column is parallel efficiency. It also can be seen that 90% efficiency could be achieved when enabling 14 cores or less. For example, 2 cores achieved a 100.24% efficiency, and 3 cores achieved a 100.20%, and so on. On 14 cores, efficiency is 91.12%. The list of speedup and efficiency is as follows:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||