Equation Solution High Performance by Design 




Parallel Performance of 8Byte Matrix Multiplication
[Posted by JennChing Luo on June 14, 2016 ]
This post shows a set of parallel performance of an 8byte matrix product.
There are two related posts. One was a result from a 16byte (quad precision) matrix product, and the other was from a 10byte (extended precision) matrix product. Parallel performances of those two sets are inconsistent. Computation of 16byte matrix product was speeded up to 60 times on 64 cores; While computation of 10byte matrix product did not show improvement when using more than 20 cores. In this post, we are going to see parallel performance of another variable type, 8byte (double precision) matrix product. Computation of 8byte matrix product was speeded up 40.5x on 48 cores. The performance was also inconsistent to 10byte variable. Why different variable types would lead to different performances could be explained. This post is not prepared to address technical issues, but to show a parallel performance of an 8byte (double precision) matrix product. The performance of 8byte matrix product shows the computing was speeded up to 40.5x on 48 cores. The parallel performance is as follows. TESTING EXAMPLE
Compute [C]=[A][B], where matrices [A], [B] and [C] are 8byte real matrix. Matrix [A] is of order (15000by11000), and matrix [B] is of order (11000by12000), and matrix [C] is of order (15000by12000).
COMPUTING ENVIRONMENT
Computer: a Dell PowerEdge R815 with quad Opteron 6168, a total of 48 cores.
Operating System: Windows Server 2008 R2
Compiler: gfortran with optimization O3; The application links against neuloop4 for parallel processing.
Subroutine: laipe$matmul_8 which performs matrix multiplication in parallel
COMPARISON WITH GFORTRAN INTRINSIC FUNCTION MATMUL GFORTRAN has the intrinsic function, matmul, for matrix multiplication. The intrinsic function matmul is a sequential subroutine, and cannot take advantage of multicore. Before showing parallel performance of laipe$matmul_8, we are going to have a comparison of laipe$matmul_8, on one core, with the intrinsic function, matmul. First, let us see the performance of the intrinsic function, matmul. Timing result is as follows: Elapsed Time (Seconds): 7265.82 CPU Time in User Mode (Seconds): 7264.33 CPU Time in Kernel Mode (Seconds): 1.50 Total CPU Time (Seconds): 7265.82 The intrinsic function, matmul, took 7265.82 seconds to compute the matrix multiplication. Next, let us see performance of the parallel subroutine laipe$matmul_8 on one core. We have the following timing result: Elapsed Time (Seconds): 7086.91 CPU Time in User Mode (Seconds): 7084.63 CPU Time in Kernel Mode (Seconds): 1.95 Total CPU Time (Seconds): 7086.58 When one core enabling, the subroutine laipe$matmul_8 ran faster than the intrinsic function matmul. laipe$matmul_8 is a parallel subroutine, which has extra code for parallel processing. Supposedly, laipe$matmul_8, with extra burden, should be slower than matmul. However laipe$matmul_8 ran faster than matmul. When only one core enabling, the parallel subroutine laipe$matmul_8 took only 7086.91 seconds to compute the matrix multiplication. TIMING RESULT Timing results include "elapsed time", "CPU time in user mode", "CPU time in kernel mode", and "total CPU time". The timing result on one core to 48 cores is as follows: number of cores: 1 Elapsed Time (Seconds): 7086.91 CPU Time in User Mode (Seconds): 7084.63 CPU Time in Kernel Mode (Seconds): 1.95 Total CPU Time (Seconds): 7086.58 number of cores: 2 Elapsed Time (Seconds): 3538.93 CPU Time in User Mode (Seconds): 7076.46 CPU Time in Kernel Mode (Seconds): 0.67 Total CPU Time (Seconds): 7077.13 number of cores: 3 Elapsed Time (Seconds): 2360.84 CPU Time in User Mode (Seconds): 7080.57 CPU Time in Kernel Mode (Seconds): 0.89 Total CPU Time (Seconds): 7081.46 number of cores: 4 Elapsed Time (Seconds): 1776.20 CPU Time in User Mode (Seconds): 7085.67 CPU Time in Kernel Mode (Seconds): 0.80 Total CPU Time (Seconds): 7086.47 number of cores: 5 Elapsed Time (Seconds): 1419.08 CPU Time in User Mode (Seconds): 7092.35 CPU Time in Kernel Mode (Seconds): 0.70 Total CPU Time (Seconds): 7093.05 number of cores: 6 Elapsed Time (Seconds): 1183.78 CPU Time in User Mode (Seconds): 7099.17 CPU Time in Kernel Mode (Seconds): 0.83 Total CPU Time (Seconds): 7100.00 number of cores: 7 Elapsed Time (Seconds): 1021.89 CPU Time in User Mode (Seconds): 7103.77 CPU Time in Kernel Mode (Seconds): 0.72 Total CPU Time (Seconds): 7104.49 number of cores: 8 Elapsed Time (Seconds): 890.70 CPU Time in User Mode (Seconds): 7105.56 CPU Time in Kernel Mode (Seconds): 0.87 Total CPU Time (Seconds): 7106.44 number of cores: 9 Elapsed Time (Seconds): 796.12 CPU Time in User Mode (Seconds): 7111.38 CPU Time in Kernel Mode (Seconds): 0.98 Total CPU Time (Seconds): 7112.37 number of cores: 10 Elapsed Time (Seconds): 716.06 CPU Time in User Mode (Seconds): 7123.57 CPU Time in Kernel Mode (Seconds): 0.86 Total CPU Time (Seconds): 7124.43 number of cores: 11 Elapsed Time (Seconds): 654.52 CPU Time in User Mode (Seconds): 7125.08 CPU Time in Kernel Mode (Seconds): 0.92 Total CPU Time (Seconds): 7126.00 number of cores: 12 Elapsed Time (Seconds): 597.61 CPU Time in User Mode (Seconds): 7123.61 CPU Time in Kernel Mode (Seconds): 0.81 Total CPU Time (Seconds): 7124.43 number of cores: 13 Elapsed Time (Seconds): 559.47 CPU Time in User Mode (Seconds): 7201.68 CPU Time in Kernel Mode (Seconds): 0.97 Total CPU Time (Seconds): 7202.64 number of cores: 14 Elapsed Time (Seconds): 522.09 CPU Time in User Mode (Seconds): 7224.69 CPU Time in Kernel Mode (Seconds): 0.80 Total CPU Time (Seconds): 7225.48 number of cores: 15 Elapsed Time (Seconds): 486.16 CPU Time in User Mode (Seconds): 7213.00 CPU Time in Kernel Mode (Seconds): 0.66 Total CPU Time (Seconds): 7213.66 number of cores: 16 Elapsed Time (Seconds): 458.07 CPU Time in User Mode (Seconds): 7242.91 CPU Time in Kernel Mode (Seconds): 0.92 Total CPU Time (Seconds): 7243.83 number of cores: 17 Elapsed Time (Seconds): 436.55 CPU Time in User Mode (Seconds): 7353.48 CPU Time in Kernel Mode (Seconds): 0.94 Total CPU Time (Seconds): 7354.42 number of cores: 18 Elapsed Time (Seconds): 414.09 CPU Time in User Mode (Seconds): 7353.29 CPU Time in Kernel Mode (Seconds): 0.81 Total CPU Time (Seconds): 7354.11 number of cores: 19 Elapsed Time (Seconds): 394.50 CPU Time in User Mode (Seconds): 7373.03 CPU Time in Kernel Mode (Seconds): 0.75 Total CPU Time (Seconds): 7373.78 number of cores: 20 Elapsed Time (Seconds): 368.94 CPU Time in User Mode (Seconds): 7259.41 CPU Time in Kernel Mode (Seconds): 0.95 Total CPU Time (Seconds): 7260.36 number of cores: 21 Elapsed Time (Seconds): 350.57 CPU Time in User Mode (Seconds): 7265.81 CPU Time in Kernel Mode (Seconds): 0.76 Total CPU Time (Seconds): 7266.57 number of cores: 22 Elapsed Time (Seconds): 336.23 CPU Time in User Mode (Seconds): 7298.26 CPU Time in Kernel Mode (Seconds): 1.05 Total CPU Time (Seconds): 7299.30 number of cores: 23 Elapsed Time (Seconds): 320.69 CPU Time in User Mode (Seconds): 7262.85 CPU Time in Kernel Mode (Seconds): 1.05 Total CPU Time (Seconds): 7263.89 number of cores: 24 Elapsed Time (Seconds): 308.52 CPU Time in User Mode (Seconds): 7318.74 CPU Time in Kernel Mode (Seconds): 1.00 Total CPU Time (Seconds): 7319.74 number of cores: 25 Elapsed Time (Seconds): 297.68 CPU Time in User Mode (Seconds): 7321.63 CPU Time in Kernel Mode (Seconds): 0.81 Total CPU Time (Seconds): 7322.44 number of cores: 26 Elapsed Time (Seconds): 286.96 CPU Time in User Mode (Seconds): 7322.41 CPU Time in Kernel Mode (Seconds): 1.00 Total CPU Time (Seconds): 7323.40 number of cores: 27 Elapsed Time (Seconds): 276.98 CPU Time in User Mode (Seconds): 7329.86 CPU Time in Kernel Mode (Seconds): 0.69 Total CPU Time (Seconds): 7330.55 number of cores: 28 Elapsed Time (Seconds): 267.32 CPU Time in User Mode (Seconds): 7333.12 CPU Time in Kernel Mode (Seconds): 0.80 Total CPU Time (Seconds): 7333.92 number of cores: 29 Elapsed Time (Seconds): 258.70 CPU Time in User Mode (Seconds): 7339.24 CPU Time in Kernel Mode (Seconds): 0.94 Total CPU Time (Seconds): 7340.17 number of cores: 30 Elapsed Time (Seconds): 249.40 CPU Time in User Mode (Seconds): 7342.81 CPU Time in Kernel Mode (Seconds): 0.78 Total CPU Time (Seconds): 7343.59 number of cores: 31 Elapsed Time (Seconds): 240.38 CPU Time in User Mode (Seconds): 7350.33 CPU Time in Kernel Mode (Seconds): 1.06 Total CPU Time (Seconds): 7351.39 number of cores: 32 Elapsed Time (Seconds): 233.81 CPU Time in User Mode (Seconds): 7365.81 CPU Time in Kernel Mode (Seconds): 0.67 Total CPU Time (Seconds): 7366.48 number of cores: 33 Elapsed Time (Seconds): 229.56 CPU Time in User Mode (Seconds): 7393.50 CPU Time in Kernel Mode (Seconds): 0.56 Total CPU Time (Seconds): 7394.06 number of cores: 34 Elapsed Time (Seconds): 221.99 CPU Time in User Mode (Seconds): 7401.73 CPU Time in Kernel Mode (Seconds): 0.83 Total CPU Time (Seconds): 7402.56 number of cores: 35 Elapsed Time (Seconds): 216.19 CPU Time in User Mode (Seconds): 7422.51 CPU Time in Kernel Mode (Seconds): 1.05 Total CPU Time (Seconds): 7423.56 number of cores: 36 Elapsed Time (Seconds): 211.94 CPU Time in User Mode (Seconds): 7456.69 CPU Time in Kernel Mode (Seconds): 0.80 Total CPU Time (Seconds): 7457.49 number of cores: 37 Elapsed Time (Seconds): 206.97 CPU Time in User Mode (Seconds): 7478.70 CPU Time in Kernel Mode (Seconds): 0.90 Total CPU Time (Seconds): 7479.61 number of cores: 38 Elapsed Time (Seconds): 203.18 CPU Time in User Mode (Seconds): 7513.54 CPU Time in Kernel Mode (Seconds): 0.75 Total CPU Time (Seconds): 7514.29 number of cores: 39 Elapsed Time (Seconds): 198.26 CPU Time in User Mode (Seconds): 7555.64 CPU Time in Kernel Mode (Seconds): 0.87 Total CPU Time (Seconds): 7556.52 number of cores: 40 Elapsed Time (Seconds): 195.20 CPU Time in User Mode (Seconds): 7592.27 CPU Time in Kernel Mode (Seconds): 0.84 Total CPU Time (Seconds): 7593.11 number of cores: 41 Elapsed Time (Seconds): 190.12 CPU Time in User Mode (Seconds): 7647.62 CPU Time in Kernel Mode (Seconds): 0.98 Total CPU Time (Seconds): 7648.60 number of cores: 42 Elapsed Time (Seconds): 187.48 CPU Time in User Mode (Seconds): 7686.03 CPU Time in Kernel Mode (Seconds): 1.03 Total CPU Time (Seconds): 7687.06 number of cores: 43 Elapsed Time (Seconds): 185.58 CPU Time in User Mode (Seconds): 7776.79 CPU Time in Kernel Mode (Seconds): 0.95 Total CPU Time (Seconds): 7777.74 number of cores: 44 Elapsed Time (Seconds): 183.57 CPU Time in User Mode (Seconds): 7835.40 CPU Time in Kernel Mode (Seconds): 0.89 Total CPU Time (Seconds): 7836.29 number of cores: 45 Elapsed Time (Seconds): 178.98 CPU Time in User Mode (Seconds): 7895.26 CPU Time in Kernel Mode (Seconds): 0.95 Total CPU Time (Seconds): 7896.21 number of cores: 46 Elapsed Time (Seconds): 178.65 CPU Time in User Mode (Seconds): 8006.25 CPU Time in Kernel Mode (Seconds): 1.25 Total CPU Time (Seconds): 8007.50 number of cores: 47 Elapsed Time (Seconds): 178.11 CPU Time in User Mode (Seconds): 8093.89 CPU Time in Kernel Mode (Seconds): 0.86 Total CPU Time (Seconds): 8094.75 number of cores: 48 Elapsed Time (Seconds): 174.81 CPU Time in User Mode (Seconds): 8228.79 CPU Time in Kernel Mode (Seconds): 1.05 Total CPU Time (Seconds): 8229.83 48 Cores computed the matrix product in 174 seconds. One core took 7086.91 seconds, e.g., 1 hour and 58 minutes. 48 cores could finish an about2hour job in less than 3 minutes. In the following, we are going to see parallel speedup and efficiency. SPEEDUP AND EFFICIENCY Speedup and efficiency is summarized in the following table. The first column is number of cores; The second column is elapsed time in seconds; The third column is parallel speedup. From the following table, it can be seen that the performance yielded an almost linear speedup. On 48 cores, the speed was improved to 40.5x. The fourth column is parallel efficiency. It also shows 42 cores could achieve a 90% efficiency. The following table summarizes parallel performance.



