Equation Solution High Performance by Design |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Parallel Performance of 8-Byte Matrix Multiplication
[Posted by Jenn-Ching Luo on June 14, 2016 ]
There are two related posts. One was a performance of multiplication of two 16-byte (quad precision) matrices, and the other showed the performance of 10-byte (extended precision) matrix product. The efficiency was inconsistent. Sixty-four cores speeded the product of two
16-byte matrices
up to 60 times faster; While multiplication of
two 10-byte matrices
did not improve the computing speed when using more than 20 cores.
This post shows a parallel performance of an 8-byte matrix product. Forty-eight cores speeded 8-byte matrix product up to 40.5x. The performance was also inconsistent with a 10-byte matrix product. It is explainable why a different variable type leads to various efficiency. This post is not prepared to address technical issues but showing a parallel performance of an 8-byte (double precision) matrix product. The performance of the 8-byte matrix product shows 48 cores speeded the computing up to 40.5x. The parallel performance is as follows. TESTING EXAMPLE
Perform [C]=[A][B], where matrices [A], [B] and [C] are 8-byte real matrix. Matrix [A] is of order (15000-by-11000), and matrix [B] is of order (11000-by-12000), and matrix [C] is of order (15000-by-12000).
COMPUTING ENVIRONMENT
Computer: a Dell PowerEdge R815 with quad Opteron 6168, a total of 48 cores.
Operating System: Windows Server 2008 R2
Compiler: gfortran with optimization -O3; The application links against neuloop4 for parallel processing.
Subroutine: laipe$matmul_8 which performs matrix multiplication in parallel
COMPARISON WITH GFORTRAN INTRINSIC FUNCTION MATMUL GFORTRAN has the intrinsic function, matmul, for matrix multiplication. The intrinsic function matmul is a sequential subroutine, and cannot take advantage of multicore. Before showing the parallel performance of laipe$matmul_8, we are going to have a comparison of laipe$matmul_8, on one core, with the intrinsic function, matmul. First, let us see the performance of the intrinsic function, matmul. The timing result is as follows: Elapsed Time (Seconds): 7265.82 CPU Time in User Mode (Seconds): 7264.33 CPU Time in Kernel Mode (Seconds): 1.50 Total CPU Time (Seconds): 7265.82 The intrinsic function, matmul, took 7265.82 seconds to compute the matrix multiplication. Next, let us see the performance of the parallel subroutine laipe$matmul_8 on one core. We have the following timing result: Elapsed Time (Seconds): 7086.91 CPU Time in User Mode (Seconds): 7084.63 CPU Time in Kernel Mode (Seconds): 1.95 Total CPU Time (Seconds): 7086.58 When one core enables, the subroutine laipe$matmul_8 ran faster than the intrinsic function matmul. laipe$matmul_8 is a parallel subroutine, which has extra code for parallel processing. Supposedly, laipe$matmul_8, with extra burden, should be slower than matmul. However, laipe$matmul_8 ran faster than matmul. When only one core enabling, the parallel subroutine laipe$matmul_8 took only 7086.91 seconds to perform the matrix multiplication. TIMING RESULT Timing results include "elapsed time," "CPU time in user mode," "CPU time in kernel mode," and "total CPU time." The timing result on one core to 48 cores is as follows: number of cores: 1 Elapsed Time (Seconds): 7086.91 CPU Time in User Mode (Seconds): 7084.63 CPU Time in Kernel Mode (Seconds): 1.95 Total CPU Time (Seconds): 7086.58 number of cores: 2 Elapsed Time (Seconds): 3538.93 CPU Time in User Mode (Seconds): 7076.46 CPU Time in Kernel Mode (Seconds): 0.67 Total CPU Time (Seconds): 7077.13 number of cores: 3 Elapsed Time (Seconds): 2360.84 CPU Time in User Mode (Seconds): 7080.57 CPU Time in Kernel Mode (Seconds): 0.89 Total CPU Time (Seconds): 7081.46 number of cores: 4 Elapsed Time (Seconds): 1776.20 CPU Time in User Mode (Seconds): 7085.67 CPU Time in Kernel Mode (Seconds): 0.80 Total CPU Time (Seconds): 7086.47 number of cores: 5 Elapsed Time (Seconds): 1419.08 CPU Time in User Mode (Seconds): 7092.35 CPU Time in Kernel Mode (Seconds): 0.70 Total CPU Time (Seconds): 7093.05 number of cores: 6 Elapsed Time (Seconds): 1183.78 CPU Time in User Mode (Seconds): 7099.17 CPU Time in Kernel Mode (Seconds): 0.83 Total CPU Time (Seconds): 7100.00 number of cores: 7 Elapsed Time (Seconds): 1021.89 CPU Time in User Mode (Seconds): 7103.77 CPU Time in Kernel Mode (Seconds): 0.72 Total CPU Time (Seconds): 7104.49 number of cores: 8 Elapsed Time (Seconds): 890.70 CPU Time in User Mode (Seconds): 7105.56 CPU Time in Kernel Mode (Seconds): 0.87 Total CPU Time (Seconds): 7106.44 number of cores: 9 Elapsed Time (Seconds): 796.12 CPU Time in User Mode (Seconds): 7111.38 CPU Time in Kernel Mode (Seconds): 0.98 Total CPU Time (Seconds): 7112.37 number of cores: 10 Elapsed Time (Seconds): 716.06 CPU Time in User Mode (Seconds): 7123.57 CPU Time in Kernel Mode (Seconds): 0.86 Total CPU Time (Seconds): 7124.43 number of cores: 11 Elapsed Time (Seconds): 654.52 CPU Time in User Mode (Seconds): 7125.08 CPU Time in Kernel Mode (Seconds): 0.92 Total CPU Time (Seconds): 7126.00 number of cores: 12 Elapsed Time (Seconds): 597.61 CPU Time in User Mode (Seconds): 7123.61 CPU Time in Kernel Mode (Seconds): 0.81 Total CPU Time (Seconds): 7124.43 number of cores: 13 Elapsed Time (Seconds): 559.47 CPU Time in User Mode (Seconds): 7201.68 CPU Time in Kernel Mode (Seconds): 0.97 Total CPU Time (Seconds): 7202.64 number of cores: 14 Elapsed Time (Seconds): 522.09 CPU Time in User Mode (Seconds): 7224.69 CPU Time in Kernel Mode (Seconds): 0.80 Total CPU Time (Seconds): 7225.48 number of cores: 15 Elapsed Time (Seconds): 486.16 CPU Time in User Mode (Seconds): 7213.00 CPU Time in Kernel Mode (Seconds): 0.66 Total CPU Time (Seconds): 7213.66 number of cores: 16 Elapsed Time (Seconds): 458.07 CPU Time in User Mode (Seconds): 7242.91 CPU Time in Kernel Mode (Seconds): 0.92 Total CPU Time (Seconds): 7243.83 number of cores: 17 Elapsed Time (Seconds): 436.55 CPU Time in User Mode (Seconds): 7353.48 CPU Time in Kernel Mode (Seconds): 0.94 Total CPU Time (Seconds): 7354.42 number of cores: 18 Elapsed Time (Seconds): 414.09 CPU Time in User Mode (Seconds): 7353.29 CPU Time in Kernel Mode (Seconds): 0.81 Total CPU Time (Seconds): 7354.11 number of cores: 19 Elapsed Time (Seconds): 394.50 CPU Time in User Mode (Seconds): 7373.03 CPU Time in Kernel Mode (Seconds): 0.75 Total CPU Time (Seconds): 7373.78 number of cores: 20 Elapsed Time (Seconds): 368.94 CPU Time in User Mode (Seconds): 7259.41 CPU Time in Kernel Mode (Seconds): 0.95 Total CPU Time (Seconds): 7260.36 number of cores: 21 Elapsed Time (Seconds): 350.57 CPU Time in User Mode (Seconds): 7265.81 CPU Time in Kernel Mode (Seconds): 0.76 Total CPU Time (Seconds): 7266.57 number of cores: 22 Elapsed Time (Seconds): 336.23 CPU Time in User Mode (Seconds): 7298.26 CPU Time in Kernel Mode (Seconds): 1.05 Total CPU Time (Seconds): 7299.30 number of cores: 23 Elapsed Time (Seconds): 320.69 CPU Time in User Mode (Seconds): 7262.85 CPU Time in Kernel Mode (Seconds): 1.05 Total CPU Time (Seconds): 7263.89 number of cores: 24 Elapsed Time (Seconds): 308.52 CPU Time in User Mode (Seconds): 7318.74 CPU Time in Kernel Mode (Seconds): 1.00 Total CPU Time (Seconds): 7319.74 number of cores: 25 Elapsed Time (Seconds): 297.68 CPU Time in User Mode (Seconds): 7321.63 CPU Time in Kernel Mode (Seconds): 0.81 Total CPU Time (Seconds): 7322.44 number of cores: 26 Elapsed Time (Seconds): 286.96 CPU Time in User Mode (Seconds): 7322.41 CPU Time in Kernel Mode (Seconds): 1.00 Total CPU Time (Seconds): 7323.40 number of cores: 27 Elapsed Time (Seconds): 276.98 CPU Time in User Mode (Seconds): 7329.86 CPU Time in Kernel Mode (Seconds): 0.69 Total CPU Time (Seconds): 7330.55 number of cores: 28 Elapsed Time (Seconds): 267.32 CPU Time in User Mode (Seconds): 7333.12 CPU Time in Kernel Mode (Seconds): 0.80 Total CPU Time (Seconds): 7333.92 number of cores: 29 Elapsed Time (Seconds): 258.70 CPU Time in User Mode (Seconds): 7339.24 CPU Time in Kernel Mode (Seconds): 0.94 Total CPU Time (Seconds): 7340.17 number of cores: 30 Elapsed Time (Seconds): 249.40 CPU Time in User Mode (Seconds): 7342.81 CPU Time in Kernel Mode (Seconds): 0.78 Total CPU Time (Seconds): 7343.59 number of cores: 31 Elapsed Time (Seconds): 240.38 CPU Time in User Mode (Seconds): 7350.33 CPU Time in Kernel Mode (Seconds): 1.06 Total CPU Time (Seconds): 7351.39 number of cores: 32 Elapsed Time (Seconds): 233.81 CPU Time in User Mode (Seconds): 7365.81 CPU Time in Kernel Mode (Seconds): 0.67 Total CPU Time (Seconds): 7366.48 number of cores: 33 Elapsed Time (Seconds): 229.56 CPU Time in User Mode (Seconds): 7393.50 CPU Time in Kernel Mode (Seconds): 0.56 Total CPU Time (Seconds): 7394.06 number of cores: 34 Elapsed Time (Seconds): 221.99 CPU Time in User Mode (Seconds): 7401.73 CPU Time in Kernel Mode (Seconds): 0.83 Total CPU Time (Seconds): 7402.56 number of cores: 35 Elapsed Time (Seconds): 216.19 CPU Time in User Mode (Seconds): 7422.51 CPU Time in Kernel Mode (Seconds): 1.05 Total CPU Time (Seconds): 7423.56 number of cores: 36 Elapsed Time (Seconds): 211.94 CPU Time in User Mode (Seconds): 7456.69 CPU Time in Kernel Mode (Seconds): 0.80 Total CPU Time (Seconds): 7457.49 number of cores: 37 Elapsed Time (Seconds): 206.97 CPU Time in User Mode (Seconds): 7478.70 CPU Time in Kernel Mode (Seconds): 0.90 Total CPU Time (Seconds): 7479.61 number of cores: 38 Elapsed Time (Seconds): 203.18 CPU Time in User Mode (Seconds): 7513.54 CPU Time in Kernel Mode (Seconds): 0.75 Total CPU Time (Seconds): 7514.29 number of cores: 39 Elapsed Time (Seconds): 198.26 CPU Time in User Mode (Seconds): 7555.64 CPU Time in Kernel Mode (Seconds): 0.87 Total CPU Time (Seconds): 7556.52 number of cores: 40 Elapsed Time (Seconds): 195.20 CPU Time in User Mode (Seconds): 7592.27 CPU Time in Kernel Mode (Seconds): 0.84 Total CPU Time (Seconds): 7593.11 number of cores: 41 Elapsed Time (Seconds): 190.12 CPU Time in User Mode (Seconds): 7647.62 CPU Time in Kernel Mode (Seconds): 0.98 Total CPU Time (Seconds): 7648.60 number of cores: 42 Elapsed Time (Seconds): 187.48 CPU Time in User Mode (Seconds): 7686.03 CPU Time in Kernel Mode (Seconds): 1.03 Total CPU Time (Seconds): 7687.06 number of cores: 43 Elapsed Time (Seconds): 185.58 CPU Time in User Mode (Seconds): 7776.79 CPU Time in Kernel Mode (Seconds): 0.95 Total CPU Time (Seconds): 7777.74 number of cores: 44 Elapsed Time (Seconds): 183.57 CPU Time in User Mode (Seconds): 7835.40 CPU Time in Kernel Mode (Seconds): 0.89 Total CPU Time (Seconds): 7836.29 number of cores: 45 Elapsed Time (Seconds): 178.98 CPU Time in User Mode (Seconds): 7895.26 CPU Time in Kernel Mode (Seconds): 0.95 Total CPU Time (Seconds): 7896.21 number of cores: 46 Elapsed Time (Seconds): 178.65 CPU Time in User Mode (Seconds): 8006.25 CPU Time in Kernel Mode (Seconds): 1.25 Total CPU Time (Seconds): 8007.50 number of cores: 47 Elapsed Time (Seconds): 178.11 CPU Time in User Mode (Seconds): 8093.89 CPU Time in Kernel Mode (Seconds): 0.86 Total CPU Time (Seconds): 8094.75 number of cores: 48 Elapsed Time (Seconds): 174.81 CPU Time in User Mode (Seconds): 8228.79 CPU Time in Kernel Mode (Seconds): 1.05 Total CPU Time (Seconds): 8229.83 Forty-eight cores completed the computation in 174 seconds; One core took 7086.91 seconds, e.g., 1 hour and 58 minutes. Forty-eight cores could complete a 2-hour job in 3 minutes. In the following, we are going to see parallel speedup and efficiency. SPEEDUP AND EFFICIENCY The following table summarizes speedup and efficiency. The number of cores is in the first column; Elapsed time is in the second column; The third column is parallel speedup. In the following table, we can see it yielded an almost linear speedup. Forty-eight cores improved the speed to 40.5x. The fourth column is parallel efficiency. It also shows 42 cores could achieve a 90% efficiency. The following table has the timing results.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||